Hi, all.
I am using JEDI-FV3 v1.1.2 and testing CONUS domain (RRFS ESG grid, 1749x1049) of FV3-LAM applications. I employs Frontera machine from TACC, but I have lost queue access twice when I test JEDI EnKF for CONUS domain with that code.
They said that
Your account was disabled again as your application was beating one OST heavily instead of spreading the load onto multiple OSTs on /scratch1. Were you able to execute dry-run tests to verify your changes? The administrator has left a note to recommend user to fix IO and use striping or other parallel IO methods.
When I checked the log file, my applications stopped after this line.
patch center=[-122.161,22.4423] patch radius=107023
ObsSpaces::ObsSpaces : conf LocalConfiguration[root={name => Radar , distribution => Halo , obsdatain => {obsfile => -------/INPUTS/obs/newtest_dbz_fv3lam_202105201200_5.nc} , simulated variables => (dbz) , center => (-122.161,22.4423) , radius => 107023}]
ObsSpaceBase: seed = 112557149200
Radar vars: 1 variables: dbz
Halo constructed: center: {-122.161,22.4423} radius: 107023
ObsFrameRead: maximum frame size: 10000
Radar: 0 observations are outside of time window out of 3232593
“letkf_srw_c1_conus.log” 2519L, 47963C
I think this is around where observations are redistributed with halo observation distribution for EnKF applications. I don’t know if severe I/O onto the single OST occurs at this time. (I can’t see any temporary files at workpath)
Have anyone confront with those issues earlier?
The below is my configuration of JEDI-FV3 v1.1.2. I compiled them with JEDI-STACK.
Currently Loaded Modules:
1) git/2.24.1 8) intel/19.1.1 15) lapack/3.8.0 22) ecbuild/ecmwf-3.6.1 29) fckit/ecmwf-0.9.2
2) autotools/1.2 9) jedi-intel/19.1.1 16) bufr/noaa-emc-11.5.0 23) boost-headers/1.68.0 30) atlas/ecmwf-0.24.1
3) python3/3.7.0 10) impi/19.0.9 17) hdf5/1.12.0 24) eigen/3.3.7 31) pio/2.5.1
4) pmix/3.1.4 11) jedi-impi/19.0.9 18) netcdf/4.7.4 25) gsl_lite/0.37.0
5) hwloc/1.11.12 12) zlib/1.2.11 19) pnetcdf/1.12.1 26) json/3.9.1
6) xalt/2.10.34 13) udunits/2.2.28 20) nccmp/1.8.7.0 27) pybind11/2.5.0
7) TACC 14) szip/2.1.1 21) cmake/3.20.0 28) eckit/ecmwf-1.16.0
Since I also suspect my HPC environment, I also checked my compile log for JEDI.
I expect hat employing PIO (31) does not make a large burden onto I/O but it doesn’t seem to be.
I also checked that PIO environmental variables are set in ‘ecbulid.log’.
Are there any way to check if my JEDI executables are compiled with Parallel I/O supports?
Thanks,
Jun