JEDI EnKF issues with CONUS FV3-LAM domain

Hi, all.

I am using JEDI-FV3 v1.1.2 and testing CONUS domain (RRFS ESG grid, 1749x1049) of FV3-LAM applications. I employs Frontera machine from TACC, but I have lost queue access twice when I test JEDI EnKF for CONUS domain with that code.

They said that

Your account was disabled again as your application was beating one OST heavily instead of spreading the load onto multiple OSTs on /scratch1. Were you able to execute dry-run tests to verify your changes? The administrator has left a note to recommend user to fix IO and use striping or other parallel IO methods.​

When I checked the log file, my applications stopped after this line.


patch center=[-122.161,22.4423] patch radius=107023
ObsSpaces::ObsSpaces : conf LocalConfiguration[root={name => Radar , distribution => Halo , obsdatain => {obsfile => -------/INPUTS/obs/newtest_dbz_fv3lam_202105201200_5.nc} , simulated variables => (dbz) , center => (-122.161,22.4423) , radius => 107023}]
ObsSpaceBase: seed = 112557149200
Radar vars: 1 variables: dbz
Halo constructed: center: {-122.161,22.4423} radius: 107023
ObsFrameRead: maximum frame size: 10000
Radar: 0 observations are outside of time window out of 3232593
“letkf_srw_c1_conus.log” 2519L, 47963C

I think this is around where observations are redistributed with halo observation distribution for EnKF applications. I don’t know if severe I/O onto the single OST occurs at this time. (I can’t see any temporary files at workpath)

Have anyone confront with those issues earlier?

The below is my configuration of JEDI-FV3 v1.1.2. I compiled them with JEDI-STACK.
Currently Loaded Modules:
1) git/2.24.1 8) intel/19.1.1 15) lapack/3.8.0 22) ecbuild/ecmwf-3.6.1 29) fckit/ecmwf-0.9.2
2) autotools/1.2 9) jedi-intel/19.1.1 16) bufr/noaa-emc-11.5.0 23) boost-headers/1.68.0 30) atlas/ecmwf-0.24.1
3) python3/3.7.0 10) impi/19.0.9 17) hdf5/1.12.0 24) eigen/3.3.7 31) pio/2.5.1
4) pmix/3.1.4 11) jedi-impi/19.0.9 18) netcdf/4.7.4 25) gsl_lite/0.37.0
5) hwloc/1.11.12 12) zlib/1.2.11 19) pnetcdf/1.12.1 26) json/3.9.1
6) xalt/2.10.34 13) udunits/2.2.28 20) nccmp/1.8.7.0 27) pybind11/2.5.0
7) TACC 14) szip/2.1.1 21) cmake/3.20.0 28) eckit/ecmwf-1.16.0

Since I also suspect my HPC environment, I also checked my compile log for JEDI.
I expect hat employing PIO (31) does not make a large burden onto I/O but it doesn’t seem to be.
I also checked that PIO environmental variables are set in ‘ecbulid.log’.
Are there any way to check if my JEDI executables are compiled with Parallel I/O supports?

Thanks,

Jun

Hi @junpark217,

Sorry about the slow response. This issue is being worked on now. We are aware of the poor performance of the IODA reader (where these messages are coming from) and my current project is to improve the performance (and parallel processing) of the IODA io system (reader and writer).

The JEDI project is under heavy development and the latest release is not optimized yet for large jobs. We are making progress with optimization but are not finished yet. This is typical of a new project undergoing development. We will be scheduling another release that will hopefully contain major performance improvements (including the IODA io system).

Thank you for your understanding and patience while we resolve these issues.

Stephen

Hi Stephen,

Thanks for the reply.
It’s great to see the gradual improvement of JEDI infrastructure.
I am relieved that this issue is being handled by now.

Jun