Jedi-bundle ctest failures on Orion

In getting back to JEDI I’m attempting to build develop branch of jedi-bundle and run it, and SkyLab, on Orion. I’m getting a small number of ctest failures and am wondering if these are expected for develop, or if perhaps I have a problem. I’ve been following instructions on readthedocs and did the ctest -R get_ on the front end before running ctest -E get_ in an interactive session on a compute node. I haven’t looked at full output of every one, but some seg fault and others appear to have what looks like a ioda marshalling problem. The following was run, both at compile time and ctest time, to set up the environment:

module purge
module use /work/noaa/da/role-da/spack-stack/modulefiles
module load miniconda/3.9.7
module load ecflow/5.8.4
module load mysql/8.0.31

module use /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0/envs/unified-env-v2/install/modulefiles/Core
module load stack-intel/2022.0.2
module load stack-intel-oneapi-mpi/2021.5.1
module load stack-python/3.9.7

module load jedi-fv3-env/unified-dev
module load ewok-env/unified-dev
module load soca-env/unified-dev

module unload crtm

module list

ulimit -s unlimited
ulimit -v unlimited
export SLURM_EXPORT_ENV=ALL
export HDF5_USE_FILE_LOCKING=FALSE

The following test are failing:

99% tests passed, 16 tests failed out of 1918

...

The following tests FAILED:
	588 - test_k_matrix_ChannelSubset_iasi_metop-b (Failed)
	1471 - fv3jedi_test_tier1_hofx_fv3lm (Failed)
	1473 - fv3jedi_test_tier1_hofx_nomodel (Failed)
	1475 - fv3jedi_test_tier1_hofx_nomodel_amsua_radii (Failed)
	1476 - fv3jedi_test_tier1_hofx_nomodel_abi_radii (Failed)
	1496 - fv3jedi_test_tier1_hyb-3dvar (Failed)
	1512 - fv3jedi_test_tier1_hyb-4dvar_pseudo-geos (Failed)
	1519 - fv3jedi_test_tier1_diffstates_gfs (Failed)
	1520 - fv3jedi_test_tier1_diffstates_geos (Failed)
	1522 - fv3jedi_test_tier1_addincrement_gfs (Failed)
	1523 - fv3jedi_test_tier1_letkf (Failed)
	1525 - fv3jedi_test_tier1_lgetkf (Failed)
	1531 - fv3jedi_test_tier1_enshofx_fv3lm (Failed)
	1533 - fv3jedi_test_tier1_eda_3dvar (Failed)
	1917 - test_coupled_hofx3d_fv3_mom6 (Failed)
	1918 - test_coupled_hofx3d_fv3_mom6_dontusemom6 (Failed)

Your modules and setup look correct. I haven’t run ctests on Orion in the last three weeks since I was on vacation, but I know that not too long ago all of them passed. We’ll try to reproduce the problem on our end!

If instead of grabbing develop I clone with git clone --branch 5.0.0 https://github.com/jcsda/jedi-bundle, then I get only two failed tests:

The following tests FAILED:
	517 - test_k_matrix_ChannelSubset_iasi_metop-b (Failed)
	1397 - fv3jedi_test_tier1_lgetkf (Timeout)

But then there are also only 1476 tests instead of 1918.

More confusion… Now, today when I try to run ctest -R get_ from an Orion front-end when using the develop branch, I get:

(base) charrop@Orion-login-4:/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build> ctest -R get_
Test project /work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build
    Start  507: get_crtm_coeffs
1/7 Test  #507: get_crtm_coeffs ........................   Passed   42.42 sec
    Start  778: get_ioda_test_data
2/7 Test  #778: get_ioda_test_data .....................***Failed    0.25 sec
    Start  951: ufo_get_ufo_test_data
3/7 Test  #951: ufo_get_ufo_test_data ..................***Failed    0.11 sec
    Start  952: ufo_get_crtm_test_data
4/7 Test  #952: ufo_get_crtm_test_data .................   Passed    0.52 sec
    Start  981: test_ufo_geovals_get_nonexistent_var
5/7 Test  #981: test_ufo_geovals_get_nonexistent_var ...   Passed    1.94 sec
    Start 1423: fv3jedi_get_fv3-jedi_test_data
6/7 Test #1423: fv3jedi_get_fv3-jedi_test_data .........***Failed    0.11 sec
    Start 1424: fv3jedi_get_crtm_test_data
7/7 Test #1424: fv3jedi_get_crtm_test_data .............   Passed    0.21 sec

57% tests passed, 3 tests failed out of 7

And when I use verbose mode, it tells me a directory is missing:

(base) charrop@Orion-login-4:/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build> ctest -VV -R  get_ioda_test_data
UpdateCTestConfiguration  from :/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/DartConfiguration.tcl
Parse Config file:/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/DartConfiguration.tcl
UpdateCTestConfiguration  from :/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/DartConfiguration.tcl
Parse Config file:/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/DartConfiguration.tcl
Test project /work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 778
    Start 778: get_ioda_test_data

778: Test command: /work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/bin/ioda_data_checker.py "/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/jedi-bundle/ioda-data"
778: Environment variables: 
778:  OMP_NUM_THREADS=1
778: Test timeout computed to be: 1500
778: /work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/jedi-bundle/ioda-data does not exist
1/1 Test #778: get_ioda_test_data ...............***Failed    0.27 sec

0% tests passed, 1 tests failed out of 1

This is different behavior than what I was seeing before. And it happens with develop only. I’m using the exact same comands/scripts for 5.0.0 and that works fine.

Don’t know what to make out of this, to be honest. But I ran the ctests for “develop” with a slightly newer spack-stack version (1.4.1) and I got all tests to pass on Orion. The 1.4.1 version was created specifically for UFS and shouldn’t give different results for jedi-bundle than 1.4.0, though. Here is the job_card that I used to run the tests, but I load the same modules to build the code. Make sure you have nothing in your ~/.bashrc, ~/.profile etc that modifies the environment. Dirty user environments have led to many problems in the past.

#!/usr/bin/bash
#SBATCH --job-name=ctest-jedi-bundle
#SBATCH --nodes=1
#SBATCH --tasks-per-node=24
#SBATCH --account=da-cpu
#SBATCH --partition=orion
#SBATCH --qos=batch
#SBATCH --time=08:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=dom.heinzeller@ucar.edu

# Insert the module purge and load statements in here

module purge
module use /work/noaa/da/role-da/spack-stack/modulefiles
module load miniconda/3.9.7
module load ecflow/5.8.4
module load mysql/8.0.31

module use /work/noaa/epic/role-epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/modulefiles/Core
module load stack-intel/2022.0.2
module load stack-intel-oneapi-mpi/2021.5.1
module load stack-python/3.9.7

module load jedi-fv3-env/unified-dev
module load ewok-env/unified-dev
module load soca-env/unified-dev
module unload crtm

module list
ulimit -s unlimited
ulimit -v unlimited

export SLURM_EXPORT_ENV=ALL
export HDF5_USE_FILE_LOCKING=FALSE

cd /work2/noaa/da/dheinzel-new/skylab-test-20230815-orion/build-release
ctest -E get_

The output:

100% tests passed, 0 tests failed out of 1926

Label Time Summary:
CRTM_Tests            = 228.32 sec*proc (154 tests)
GEOS                  =  23.11 sec*proc (4 tests)
HofX                  =  43.86 sec*proc (11 tests)
Q                     =   5.77 sec*proc (1 test)
QC                    =  67.36 sec*proc (12 tests)
TLAD                  =   4.24 sec*proc (1 test)
UV                    =   5.75 sec*proc (1 test)
actions               =  13.15 sec*proc (4 tests)
aircraft              =  10.40 sec*proc (2 tests)
compo                 =  10.77 sec*proc (2 tests)
coupling              = 181.76 sec*proc (9 tests)
crtm                  = 154.19 sec*proc (51 tests)
crtm_tests            = 228.32 sec*proc (154 tests)
errors                =  81.32 sec*proc (8 tests)
executable            = 549.03 sec*proc (219 tests)
femps                 =   8.77 sec*proc (1 test)
filters               = 778.76 sec*proc (172 tests)
fortran               =  12.41 sec*proc (3 tests)
fov                   =   2.82 sec*proc (2 tests)
fv3-jedi              = 13455.83 sec*proc (125 tests)
fv3jedi               = 13459.23 sec*proc (126 tests)
gnssro                =   5.65 sec*proc (1 test)
gsw                   =   6.81 sec*proc (6 tests)
instrument            = 121.99 sec*proc (25 tests)
ioda                  = 300.46 sec*proc (285 tests)
metoffice             =   6.22 sec*proc (2 tests)
mpi                   = 17377.59 sec*proc (855 tests)
obsfunctions          = 193.41 sec*proc (78 tests)
oops                  = 780.36 sec*proc (303 tests)
openmp                = 1116.22 sec*proc (236 tests)
operators             = 293.77 sec*proc (141 tests)
ozone                 =   2.66 sec*proc (2 tests)
predictors            =  56.17 sec*proc (20 tests)
profile               = 119.84 sec*proc (41 tests)
radarVAD              =   9.73 sec*proc (2 tests)
rass                  =   9.54 sec*proc (2 tests)
saber                 = 973.46 sec*proc (175 tests)
satwinds              =  10.00 sec*proc (2 tests)
scatwinds             =   9.69 sec*proc (2 tests)
script                = 17546.07 sec*proc (1596 tests)
sfcLand               =   9.81 sec*proc (2 tests)
sfcMarine             =  18.09 sec*proc (3 tests)
soca                  = 532.71 sec*proc (74 tests)
sonde                 =  25.65 sec*proc (5 tests)
ufo                   = 1600.73 sec*proc (468 tests)
ufo_data              =  60.83 sec*proc (302 tests)
ufo_data_validate     =  60.83 sec*proc (302 tests)
unit_tests            = 265.89 sec*proc (89 tests)
utils                 =   2.94 sec*proc (2 tests)
vader                 =  10.12 sec*proc (28 tests)
variablenamemap       =   1.59 sec*proc (1 test)
variabletransforms    =  87.07 sec*proc (26 tests)

Total Test time (real) = 18150.89 sec

I’m trying again after removing anything suspicious from my bash init files. But, now, for develop I’m getting failures when running ctgest -R get_ and ctest -R bumpparameters on an Orion front-end. It was my impression that is needed before doing the ctest -E get_ on the compute node. It worked earlier, but now it doesn’t.

charrop@Orion-login-3:/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build> ctest -R get_
Test project /work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build
    Start  507: get_crtm_coeffs
1/7 Test  #507: get_crtm_coeffs ........................   Passed   56.91 sec
    Start  778: get_ioda_test_data
2/7 Test  #778: get_ioda_test_data .....................***Failed    0.17 sec
    Start  952: ufo_get_ufo_test_data
3/7 Test  #952: ufo_get_ufo_test_data ..................***Failed    0.15 sec
    Start  953: ufo_get_crtm_test_data
4/7 Test  #953: ufo_get_crtm_test_data .................   Passed    0.29 sec
    Start  982: test_ufo_geovals_get_nonexistent_var
5/7 Test  #982: test_ufo_geovals_get_nonexistent_var ...   Passed    0.28 sec
    Start 1424: fv3jedi_get_fv3-jedi_test_data
6/7 Test #1424: fv3jedi_get_fv3-jedi_test_data .........***Failed    0.15 sec
    Start 1425: fv3jedi_get_crtm_test_data
7/7 Test #1425: fv3jedi_get_crtm_test_data .............   Passed    0.27 sec

57% tests passed, 3 tests failed out of 7

Do you know why I would be getting errors about directories not existing?

charrop@Orion-login-3:/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build> ctest -VV -R get_ioda_test_data
UpdateCTestConfiguration  from :/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/DartConfiguration.tcl
Parse Config file:/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/DartConfiguration.tcl
UpdateCTestConfiguration  from :/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/DartConfiguration.tcl
Parse Config file:/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/DartConfiguration.tcl
Test project /work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 778
    Start 778: get_ioda_test_data

778: Test command: /work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/build/bin/ioda_data_checker.py "/work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/jedi-bundle/ioda-data"
778: Environment variables: 
778:  OMP_NUM_THREADS=1
778: Test timeout computed to be: 1500
778: /work/noaa/gsd-hpcs/charrop/SENA/JEDI/develop/jedi-bundle/ioda-data does not exist
1/1 Test #778: get_ioda_test_data ...............***Failed    0.15 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
ioda      =   0.15 sec*proc (1 test)
script    =   0.15 sec*proc (1 test)

Total Test time (real) =   0.62 sec

The following tests FAILED:
	778 - get_ioda_test_data (Failed)

I don’t see any instructions about creating these, and I didn’t need to do anything else in my previous attempt on Monday.

It looks like if I mkdir the directory manually, I can get it to pass. But it’s confusing because this behavior is new.

Hi, I also tried to run the ctest on orion. However, I only got 39% tests passed, 993 tests failed out of 1617.
I cloned with git clone GitHub - JCSDA/jedi-bundle: Bundle containing all the repositories that are needed to compile JEDI for use in Skylab
Below is the job_card that I used to run the tests. Could you please help and check what’s going on?

#!/usr/bin/bash
#SBATCH --job-name=ctest-jedi-bundle
#SBATCH --nodes=1
#SBATCH --tasks-per-node=24
#SBATCH --account=aoml-hafsda
#SBATCH --partition=orion
#SBATCH --qos=batch
#SBATCH --time=08:00:00
#SBATCH --output=./ctest.out
#SBATCH --error=./ctest.err

Insert the module purge and load statements in here

module purge
module use /work/noaa/da/role-da/spack-stack/modulefiles
module load miniconda/3.9.7
module load ecflow/5.8.4
module load mysql/8.0.31

module use /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0/envs/unified-env-v2/install/modulefiles/Core
module load stack-gcc/10.2.0
module load stack-openmpi/4.0.4
module load stack-python/3.9.7

module load jedi-fv3-env/unified-dev
module load ewok-env/unified-dev
module load soca-env/unified-dev
module unload crtm

module list
ulimit -s unlimited
ulimit -v unlimited

export SLURM_EXPORT_ENV=ALL
export HDF5_USE_FILE_LOCKING=FALSE

cd /work/noaa/aoml-hafsda/danwu/jedi/build/
ctest -E get_

@climbfuji - Did you produce these results for develop of jedi-bundle from github.com/JCSDA-internal/ or github.com/JCSDA?

@DWu - If you don’t first run ctest -R get_ and possibly also ctest -R bumpparameters on a front-end to download the data, you will get a LOT of test failures. Make sure you’ve run those tests on the front-end first, and then do the ctest -E get_ on a compute node to run the full suite.

JCSDA-internal, although in theory (…) the develop branches of all the internal repos get pushed to JCSDA (public) automatically.

As far as I know, bumpparameters doesn’t need to run on the login node and doesn’t need to run first (the dependencies should be set up correctly).

1 Like

@cwharrop Thanks for your reply.
I ran the ctest -R get_ as suggested, then I got the same error as you: the non-existing directories. I manually created those directories and got them to pass.

Then I ran ctest -R bumpparameters on the front-end, however, all the tests are failed.
0% tests passed, 5 tests failed out of 5

Label Time Summary:
fv3-jedi = 63.91 secproc (5 tests)
fv3jedi = 63.91 sec
proc (5 tests)
mpi = 63.91 secproc (5 tests)
script = 63.91 sec
proc (5 tests)

Total Test time (real) = 67.42 sec

The following tests FAILED:
1492 - fv3jedi_test_tier1_bumpparameters_nicas_geos (Failed)
1493 - fv3jedi_test_tier1_bumpparameters_nicas_geos_cf (Failed)
1494 - fv3jedi_test_tier1_bumpparameters_nicas_gfs (Failed)
1495 - fv3jedi_test_tier1_bumpparameters_nicas_gfs_aero (Failed)
1496 - fv3jedi_test_tier1_bumpparameters_nicas_lam_cmaq (Failed)

The error message looks like this:
NetCDF error, aborting …, nf90_open Data/fv3files/akbk127.nc4. Error code: NetCDF: Unknown file format
NetCDF error, aborting …, nf90_open Data/fv3files/akbk127.nc4. Error code: NetCDF: Unknown file format
NetCDF error, aborting …, nf90_open Data/fv3files/akbk127.nc4. Error code: NetCDF: Unknown file format
[Orion-login-1.HPC.MsState.Edu:277583] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[Orion-login-1.HPC.MsState.Edu:277583] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[Orion-login-1.HPC.MsState.Edu:277583] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193

Do any one of you know why I got this error? Thanks!

@DWu - It sounds like you need to run git lfs install on Orion after you load the modules, and before you clone the repositories.

@DWu - To clarify, you only have to run git lfs install once. The reason I mentioned to do it after loading the modules is that it’s probably better to make sure your have the proper version of git and git-lfs loaded in your environment when you run that command. You have to do it BEFORE you clone because it is what enables downloading of LFS files during the cloning process. If you do it after, there is a git lfs command you can run to grab the files, but I forget what it is right now (maybe git lfs fetch or something, I don’t remember).

@cwharrop I tried git lfs install before clone the repositories. However, I still cannot pass ctest -R bumpparameters. And for ctest -E get_, I got only 57% tests passed, 701 tests failed out of 1615.

Below is the error message. Seems I don’t have permission to access the needed files.

Running case 0: distribution/Distribution/testConstructor …
Completed case 0: distribution/Distribution/testConstructor
Running case 1: distribution/Distribution/testDistributionConstructedManually …
Completed case 1: distribution/Distribution/testDistributionConstructedManually
Running case 2: distribution/Distribution/testDistributionConstructedByObsSpace …
HDF5-DIAG: Error detected in HDF5 (1.14.0) MPI-process 1:
#000: /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0/cache/build_stage/spack-stage-hdf5-1.14.0-aq2zjiuzjn3c7ocaky4vzc7wispcxs6s/spack-src/src/H5F.c line 836 in H5Fopen(): unable to synchronously open file
major: File accessibility
minor: Unable to open file
#001: /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0/cache/build_stage/spack-stage-hdf5-1.14.0-aq2zjiuzjn3c7ocaky4vzc7wispcxs6s/spack-src/src/H5F.c line 796 in H5F__open_api_common(): unable to open file
major: File accessibility
minor: Unable to open file

@DWu - I am also have trouble with git lfs not downloading the data for the tests. I don’t know what is going on. I’ve tried a few different things, but the data directories remain unpopulated and git lfs thinks there’s nothing to download. I didn’t have any of these issues with 5.0.0. Not only that, but it had worked for me a couple times earlier, and now it doesn’t.

Can you check if you get error message from git saying “rate limit exceeded”?