FV3-JEDI ctest failed with core dumped after turning on "debug"

I tried to turn on compiler’s debug to check the Nan value issue in my case

ecbuild --build=debug …

It passed compiling, but some unit tests failed, such as geometry test, in ctest

  Start  1: fv3jedi_test_tier1_coding_norms

1/71 Test #1: fv3jedi_test_tier1_coding_norms … Passed 2.94 sec
Start 2: fv3jedi_test_tier1_get_test_data_fv3-jedi
2/71 Test #2: fv3jedi_test_tier1_get_test_data_fv3-jedi … Passed 21.53 sec
Start 3: fv3jedi_test_tier1_get_test_data_ioda
3/71 Test #3: fv3jedi_test_tier1_get_test_data_ioda … Passed 10.39 sec
Start 4: fv3jedi_test_tier1_get_test_data_crtm
4/71 Test #4: fv3jedi_test_tier1_get_test_data_crtm … Passed 184.67 sec
Start 5: fv3jedi_test_tier1_geometry_gfs
5/71 Test #5: fv3jedi_test_tier1_geometry_gfs …***Failed 8.54 sec
Start 6: fv3jedi_test_tier1_geometry_gfs127
6/71 Test #6: fv3jedi_test_tier1_geometry_gfs127 …***Failed 17.31 sec
Start 7: fv3jedi_test_tier1_geometry_geos
7/71 Test #7: fv3jedi_test_tier1_geometry_geos …***Failed 16.11 sec
Start 8: fv3jedi_test_tier1_geometry_lam_cmaq
8/71 Test #8: fv3jedi_test_tier1_geometry_lam_cmaq …***Failed 20.25 sec
Start 9: fv3jedi_test_tier1_geometry_2d
9/71 Test #9: fv3jedi_test_tier1_geometry_2d …***Failed 15.21 sec

BTW, the same tests could pass when turning off the “–build=debug”. Has anyone encountered the similar issue?

Thanks

Hi @ytangnoaa - what system are you running on, and what compiler suite? Also, have you made sure you have sufficient memory with
ulimit -s unlimited
ulimit -v unlimited

Also, what is the error message from the fv3jedi_test_tier1_geometry_gfs test?

I am running on Hera, with intel/18.0.5.274 (the newer intel/2020.2 gave me compiler internal error during the compiling). I tried

ulimit -s unlimited
bash: ulimit: stack size: cannot modify limit: Operation not permitted
ulimit -v unlimited (OK)

With debug option, the gfs run still failed

ctest -VV -R fv3jedi_test_tier1_geometry_gfs
UpdateCTestConfiguration from :/home/YouHua.Tang/noscrub/test/build2/DartConfiguration.tcl
Parse Config file:/home/YouHua.Tang/noscrub/test/build2/DartConfiguration.tcl
UpdateCTestConfiguration from :/home/YouHua.Tang/noscrub/test/build2/DartConfiguration.tcl
Parse Config file:/home/YouHua.Tang/noscrub/test/build2/DartConfiguration.tcl
Test project /home/YouHua.Tang/noscrub/test/build2
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph…
Checking test dependency graph end
test 1017
Start 1017: fv3jedi_test_tier1_geometry_gfs

1017: Test command: /apps/slurm/default/bin/srun “-n” “6” “-q” “debug” “/home/YouHua.Tang/noscrub/test/build2/bin/test_fv3jedi_geometry.x” “testinput/geometry_gfs.yaml”
1017: Environment variables:
1017: OOPS_TRAPFPE=0
1017: OMP_NUM_THREADS=1
1017: Test timeout computed to be: 1500
1017: srun: job 17685159 queued and waiting for resources
1017: srun: job 17685159 has been allocated resources
1017: OOPS Starting 2021-04-05 21:09:16 (UTC+0000)
1017: OOPS Starting 2021-04-05 21:09:16 (UTC+0000)
1017: OOPS Starting 2021-04-05 21:09:16 (UTC+0000)
1017: OOPS Starting 2021-04-05 21:09:16 (UTC+0000)
1017: OOPS Starting 2021-04-05 21:09:16 (UTC+0000)
1017: OOPS Starting 2021-04-05 21:09:16 (UTC+0000)
1017: Configuration input file is: testinput/geometry_gfs.yaml
1017: Full configuration is:YAMLConfiguration[path=testinput/geometry_gfs.yaml, root={geometry => {nml_file_mpp => Data/fv3files/fmsmpp.nml , trc_file => Data/fv3files/field_table , akbk => Data/fv3files/akbk64.nc4 , layout => (1,1) , io_layout => (1,1) , npx => 13 , npy => 13 , npz => 64 , ntiles => 6 , do_write_geom => true , fieldsets => ({fieldset => Data/fieldsets/aerosols_gfs.yaml},{fieldset => Data/fieldsets/dynamics.yaml},{fieldset => Data/fieldsets/ufo.yaml})}}]
1017: OOPS_STATS ObjectCountHelper started.
1017: Run: Starting oops::Test running test::Geometry
1017: Running 1 tests:
1017: Running 1 tests:
1017: Running 1 tests:
1017: Running 1 tests:
1017: Running 1 tests:
1017: Running case “interface/Geometry/testConstructor” …
1017: Running case “interface/Geometry/testConstructor” …
1017: Running case “interface/Geometry/testConstructor” …
1017: Running case “interface/Geometry/testConstructor” …
1017: Running 1 tests:
1017: Running case “interface/Geometry/testConstructor” …
1017: Running case “interface/Geometry/testConstructor” …
1017: srun: error: h23c08: tasks 0-5: Segmentation fault (core dumped)
1017: srun: launch/slurm: _step_signal: Terminating StepId=17685159.0
1/2 Test #1017: fv3jedi_test_tier1_geometry_gfs …***Failed 11.91 sec
test 1018
Start 1018: fv3jedi_test_tier1_geometry_gfs127

1018: Test command: /apps/slurm/default/bin/srun “-n” “6” “-q” “debug” “/home/YouHua.Tang/noscrub/test/build2/bin/test_fv3jedi_geometry.x” “testinput/geometry_gfs127.yaml”
1018: Environment variables:
1018: OOPS_TRAPFPE=0
1018: OMP_NUM_THREADS=1
1018: Test timeout computed to be: 1500
1018: srun: job 17685208 queued and waiting for resources
1018: srun: job 17685208 has been allocated resources
1018: OOPS Starting 2021-04-05 21:09:28 (UTC+0000)
1018: OOPS Starting 2021-04-05 21:09:28 (UTC+0000)
1018: OOPS Starting 2021-04-05 21:09:28 (UTC+0000)
1018: OOPS Starting 2021-04-05 21:09:28 (UTC+0000)
1018: OOPS Starting 2021-04-05 21:09:28 (UTC+0000)
1018: OOPS Starting 2021-04-05 21:09:28 (UTC+0000)
1018: Configuration input file is: testinput/geometry_gfs127.yaml
1018: Full configuration is:YAMLConfiguration[path=testinput/geometry_gfs127.yaml, root={geometry => {nml_file_mpp => Data/fv3files/fmsmpp.nml , trc_file => Data/fv3files/field_table , akbk => Data/fv3files/akbk127.nc4 , layout => (1,1) , io_layout => (1,1) , npx => 13 , npy => 13 , npz => 127 , ntiles => 6 , do_write_geom => true , fieldsets => ({fieldset => Data/fieldsets/aerosols_gfs.yaml},{fieldset => Data/fieldsets/dynamics.yaml},{fieldset => Data/fieldsets/ufo.yaml})}}]
1018: OOPS_STATS ObjectCountHelper started.
1018: Run: Starting oops::Test running test::Geometry
1018: Running 1 tests:
1018: Running 1 tests:
1018: Running 1 tests:
1018: Running case “interface/Geometry/testConstructor” …
1018: Running case “interface/Geometry/testConstructor” …
1018: Running case “interface/Geometry/testConstructor” …
1018: Running 1 tests:
1018: Running 1 tests:
1018: Running case “interface/Geometry/testConstructor” …
1018: Running case “interface/Geometry/testConstructor” …
1018: Running 1 tests:
1018: Running case “interface/Geometry/testConstructor” …
1018: srun: error: h25c01: tasks 0-5: Segmentation fault (core dumped)
1018: srun: launch/slurm: _step_signal: Terminating StepId=17685208.0
2/2 Test #1018: fv3jedi_test_tier1_geometry_gfs127 …***Failed 11.95 sec

0% tests passed, 2 tests failed out of 2

Label Time Summary:
fv3-jedi = 23.86 secproc (2 tests)
fv3jedi = 23.86 sec
proc (2 tests)
mpi = 23.86 secproc (2 tests)
script = 23.86 sec
proc (2 tests)

Total Test time (real) = 24.25 sec

The following tests FAILED:
1017 - fv3jedi_test_tier1_geometry_gfs (Failed)
1018 - fv3jedi_test_tier1_geometry_gfs127 (Failed)
Errors while running CTest

is this the same problem @TingLei-NOAA has found?

Seems not. This issue only occurred after turning on --build=debug option in FV3-JEDI test, such as geometry. Other units tests, like qg and l95, can pass with debug on.

@CoryMartin-NOAA , thanks for letting me notice this discussion. @ytangnoaa , I am not sure if you are having the same issue as I encountered. On hera, when I first turned on debug mode, quite a few tests failed like hyb-gfs and, traced to the fms part. Quote my words "It was found the issue occurred in mpp_util_mpi.inc of FMS when fms compiled with option use_libMPI (for more details, we can discuss off-line) . Changes to use_libSMA would avoid this problem. " The tricky part is use_libSMA seems coming from the recent changes in the 19.0.5 module setup by @Ryan.Honeyager . I haven’t found where to change it from my side.
So, if you will trace the problem to fms mpi initialization step, you can try what works for me.