@climbfuji - I thought I would post an update on my Hercules experiences thus far. I am curious to know if this is consistent with your observations.
For the sake of completeness, this post is very verbose and includes everything I’m doing so that there are no hidden steps. Hoping others find this useful. I’m sure there are improvements to be made. All scripts are embedded with appropriate Slurm directives and are submitted using sbatch <script_name>
.
I can build jedi-bundle on Hercules just fine now. But I am getting a huge number of ctest failures for some repositories. Here is a summary list of test results:
[charrop@hercules-login-1 develop]$ grep "tests pass" ctest.*.out
ctest.coupling.out:63% tests passed, 3 tests failed out of 8
ctest.crtm.out:17% tests passed, 130 tests failed out of 156
ctest.femps.out:100% tests passed, 0 tests failed out of 1
ctest.fv3-jedi.out:90% tests passed, 13 tests failed out of 129
ctest.gsw.out:100% tests passed, 0 tests failed out of 2
ctest.ioda.out:100% tests passed, 0 tests failed out of 308
ctest.oops.out:100% tests passed, 0 tests failed out of 304
ctest.saber.out:100% tests passed, 0 tests failed out of 175
ctest.soca.out:100% tests passed, 0 tests failed out of 74
ctest.ufo-data.out:100% tests passed, 0 tests failed out of 302
ctest.ufo.out:90% tests passed, 46 tests failed out of 469
ctest.vader.out:100% tests passed, 0 tests failed out of 28
[charrop@hercules-login-1 develop]$
For building and testing I am using this environment setup:
[charrop@hercules-login-1 JEDI]$ cat setupenv-hercules.sh
module purge
module use /work/noaa/epic-ps/role-epic-ps/spack-stack/modulefiles
module load ecflow/5.8.4-hercules
module load mysql/8.0.31-hercules
module use /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/envs/unified-env-v2/install/modulefiles/Core
module load stack-intel/2021.7.1
module load stack-intel-oneapi-mpi/2021.7.1
module load stack-python/3.9.14
module load jedi-fv3-env/unified-dev
module load ewok-env/unified-dev
module load soca-env/unified-dev
module unload crtm
module list
ulimit -s unlimited
ulimit -v unlimited
export SLURM_EXPORT_ENV=ALL
export HDF5_USE_FILE_LOCKING=FALSE
[charrop@hercules-login-1 JEDI]$
I am not doing anything special for the ecbuild or make steps. Just using the following scripts submitted to service partition when internet access is needed and to compute partition otherwise.
For cloning:
[charrop@hercules-login-1 JEDI]$ cat clone.sh
#!/usr/bin/bash
#SBATCH -A gsd-hpcs
#SBATCH --time=00:10:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --partition=service
#SBATCH --qos=batch
source /etc/bashrc
# Get version to install
export JEDI_VERSION=${1:-develop}
# Set location of JEDI source and build
export WORK=/work/noaa/gsd-hpcs/charrop/hercules/SENA/JEDI
#export WORK=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
export JEDI_ROOT=${WORK}/${JEDI_VERSION}
export JEDI_SRC=${JEDI_ROOT}/jedi-bundle
export JEDI_BUILD=${JEDI_ROOT}/build
# Setup software environment
. ${WORK}/setupenv-hercules.sh
# Make sure git-lfs activated
git lfs install --skip-repo
# Clone jedi-bundle
rm -rf ${JEDI_ROOT}
mkdir -p ${JEDI_ROOT}
cd ${JEDI_ROOT}
git clone --branch ${JEDI_VERSION} https://github.com/JCSDA-internal/jedi-bundle > ${JEDI_ROOT}/clone.out 2>&1
[charrop@hercules-login-1 JEDI]$
For ecbuild:
[charrop@hercules-login-1 JEDI]$ cat ecbuild.sh
#!/usr/bin/bash
#SBATCH -A gsd-hpcs
#SBATCH --time=00:30:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --partition=service
#SBATCH --qos=batch
source /etc/bashrc
# Get version to install
export JEDI_VERSION=${1:-develop}
# Set location of JEDI source and build
export WORK=/work/noaa/gsd-hpcs/charrop/hercules/SENA/JEDI
#export WORK=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
export JEDI_ROOT=${WORK}/${JEDI_VERSION}
export JEDI_SRC=${JEDI_ROOT}/jedi-bundle
export JEDI_BUILD=${JEDI_ROOT}/build
# Setup software environment
. ${WORK}/setupenv-hercules.sh
# Run ecbuild
rm -rf ${JEDI_BUILD}
mkdir -p ${JEDI_BUILD}
cd ${JEDI_BUILD}
ecbuild ${JEDI_SRC} > ${JEDI_ROOT}/ecbuild.out 2>&1
For make:
[charrop@hercules-login-1 JEDI]$ cat make.sh
#!/usr/bin/bash
#SBATCH -A gsd-hpcs
#SBATCH --time=02:00:00
#SBATCH -N 1
#SBATCH -n 24
#SBATCH --partition=hercules
#SBATCH --qos=batch
source /etc/bashrc
# Get version to install
export JEDI_VERSION=${1:-develop}
# Set location of JEDI source and build
export WORK=/work/noaa/gsd-hpcs/charrop/hercules/SENA/JEDI
#export WORK=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
export JEDI_ROOT=${WORK}/${JEDI_VERSION}
export JEDI_SRC=${JEDI_ROOT}/jedi-bundle
export JEDI_BUILD=${JEDI_ROOT}/build
# Setup software environment
. ${WORK}/setupenv-hercules.sh
# Run ecbuild
cd ${JEDI_BUILD}
make -j 24 VERBOSE=1 > ${JEDI_ROOT}/make.out 2>&1
[charrop@hercules-login-1 JEDI]$
For retrieving the test data:
[charrop@hercules-login-1 JEDI]$ cat get_test_data.sh
#!/usr/bin/bash
#SBATCH -A gsd-hpcs
#SBATCH --time=01:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --partition=service
#SBATCH --qos=batch
source /etc/bashrc
# Get version to install
export JEDI_VERSION=${1:-develop}
# Set location of JEDI source and build
export WORK=/work/noaa/gsd-hpcs/charrop/hercules/SENA/JEDI
#export WORK=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
export JEDI_ROOT=${WORK}/${JEDI_VERSION}
export JEDI_SRC=${JEDI_ROOT}/jedi-bundle
export JEDI_BUILD=${JEDI_ROOT}/build
# Setup software environment
. ${WORK}/setupenv-hercules.sh
# Run ecbuild
cd ${JEDI_BUILD}
ctest -R get_ > ${JEDI_ROOT}/get_test_data.out 2>&1
[charrop@hercules-login-1 JEDI]$
To run the tests on compute nodes I have to split them up into pieces because otherwise it won’t fit into the 8 hour max walltime limit of a compute job. To do this, I’m using the following two scripts.
For testing:
[charrop@hercules-login-1 JEDI]$ cat test.sh
#!/usr/bin/bash
# Get version to install
export JEDI_VERSION=${1:-develop}
# Set location of JEDI source and build
export WORK=/work/noaa/gsd-hpcs/charrop/hercules/SENA/JEDI
#export WORK=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
export JEDI_ROOT=${WORK}/${JEDI_VERSION}
export JEDI_SRC=${JEDI_ROOT}/jedi-bundle
export JEDI_BUILD=${JEDI_ROOT}/build
for r in coupling crtm femps fv3-jedi gsw ioda oops saber soca ufo ufo-data vader; do
sbatch ctest.sh ${JEDI_VERSION} $r
done
[charrop@hercules-login-1 JEDI]$
and
[charrop@hercules-login-1 JEDI]$ cat ctest.sh
#!/usr/bin/bash
#SBATCH -A gsd-hpcs
#SBATCH --time=08:00:00
#SBATCH -N 1
#SBATCH --partition=hercules
#SBATCH --qos=batch
source /etc/bashrc
# Get version to install
export JEDI_VERSION=${1:-develop}
# Get repository to test
export TEST_REPO=$2
# Set location of JEDI source and build
export WORK=/work/noaa/gsd-hpcs/charrop/hercules/SENA/JEDI
#export WORK=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
export JEDI_ROOT=${WORK}/${JEDI_VERSION}
export JEDI_SRC=${JEDI_ROOT}/jedi-bundle
export JEDI_BUILD=${JEDI_ROOT}/build
# Setup software environment
. ${WORK}/setupenv-hercules.sh
# Run ecbuild
cd ${JEDI_BUILD}/${TEST_REPO}
ctest -E get_ --timeout=3600 --rerun-failed --output-on-failure > ${JEDI_ROOT}/ctest.${TEST_REPO}.out 2>&1
[charrop@hercules-login-1 JEDI]$
The list of specific tests that are currently not passing for me is as follows:
1 - coupling_coding_norms (Failed)
7 - test_coupled_hofx3d_fv3_mom6 (Failed)
8 - test_coupled_hofx3d_fv3_mom6_dontusemom6 (Failed)
The following tests FAILED:
1 - test_check_crtm (Failed)
2 - test_check_crtm_random (Failed)
3 - Unit_TL_TEST (Failed)
12 - Unit_test_emis_coeff_io_nc (Failed)
14 - test_forward_Simple_atms_n21 (Failed)
15 - test_forward_Simple_cris-fsr_n21 (Failed)
16 - test_forward_Simple_v.abi_g18 (Failed)
17 - test_forward_Simple_atms_npp (Failed)
18 - test_forward_Simple_cris399_npp (Failed)
19 - test_forward_Simple_v.abi_gr (Failed)
20 - test_forward_Simple_abi_g18 (Failed)
21 - test_forward_Simple_modis_aqua (Failed)
30 - test_forward_Zeeman_ssmis_f18 (Failed)
32 - test_forward_Zeeman_ssmis_f16 (Failed)
33 - test_forward_ChannelSubset_iasi_metop-b (Failed)
34 - test_forward_ClearSky_atms_n21 (Failed)
35 - test_forward_ClearSky_cris-fsr_n21 (Failed)
36 - test_forward_ClearSky_v.abi_g18 (Failed)
37 - test_forward_ClearSky_atms_npp (Failed)
38 - test_forward_ClearSky_cris399_npp (Failed)
39 - test_forward_ClearSky_v.abi_gr (Failed)
40 - test_forward_ClearSky_abi_g18 (Failed)
41 - test_forward_ClearSky_modis_aqua (Failed)
42 - test_forward_Aircraft_cris-fsr_n21 (Failed)
43 - test_forward_Aircraft_crisB1_npp (Failed)
44 - test_forward_ScatteringSwitch_atms_n21 (Failed)
45 - test_forward_ScatteringSwitch_cris-fsr_n21 (Failed)
46 - test_forward_ScatteringSwitch_v.abi_g18 (Failed)
47 - test_forward_ScatteringSwitch_atms_npp (Failed)
48 - test_forward_ScatteringSwitch_cris399_npp (Failed)
49 - test_forward_ScatteringSwitch_v.abi_gr (Failed)
50 - test_forward_ScatteringSwitch_abi_g18 (Failed)
51 - test_forward_ScatteringSwitch_modis_aqua (Failed)
52 - test_forward_SOI_atms_n21 (Failed)
53 - test_forward_SOI_cris-fsr_n21 (Failed)
54 - test_forward_SOI_v.abi_g18 (Failed)
55 - test_forward_SOI_atms_npp (Failed)
56 - test_forward_SOI_cris399_npp (Failed)
57 - test_forward_SOI_v.abi_gr (Failed)
58 - test_forward_SOI_abi_g18 (Failed)
59 - test_forward_SOI_modis_aqua (Failed)
60 - test_forward_SSU_ssu_n06 (Failed)
61 - test_forward_SSU_ssu_n14 (Failed)
62 - test_forward_VerticalCoordinates_atms_n21 (Failed)
63 - test_forward_VerticalCoordinates_cris-fsr_n21 (Failed)
64 - test_forward_VerticalCoordinates_v.abi_g18 (Failed)
65 - test_forward_VerticalCoordinates_atms_npp (Failed)
66 - test_forward_VerticalCoordinates_cris399_npp (Failed)
67 - test_forward_VerticalCoordinates_v.abi_gr (Failed)
68 - test_forward_VerticalCoordinates_abi_g18 (Failed)
69 - test_forward_VerticalCoordinates_modis_aqua (Failed)
70 - test_k_matrix_Simple_atms_n21 (Failed)
71 - test_k_matrix_Simple_cris-fsr_n21 (Failed)
72 - test_k_matrix_Simple_v.abi_g18 (Failed)
73 - test_k_matrix_Simple_atms_npp (Failed)
74 - test_k_matrix_Simple_cris399_npp (Failed)
75 - test_k_matrix_Simple_v.abi_gr (Failed)
76 - test_k_matrix_Simple_abi_g18 (Failed)
77 - test_k_matrix_Simple_modis_aqua (Failed)
85 - test_k_matrix_Zeeman_ssmis_f19 (Failed)
87 - test_k_matrix_Zeeman_ssmis_f17 (Failed)
88 - test_k_matrix_Zeeman_ssmis_f16 (Failed)
89 - test_k_matrix_ChannelSubset_iasi_metop-b (Failed)
90 - test_k_matrix_ClearSky_atms_n21 (Failed)
91 - test_k_matrix_ClearSky_cris-fsr_n21 (Failed)
92 - test_k_matrix_ClearSky_v.abi_g18 (Failed)
93 - test_k_matrix_ClearSky_atms_npp (Failed)
94 - test_k_matrix_ClearSky_cris399_npp (Failed)
95 - test_k_matrix_ClearSky_v.abi_gr (Failed)
96 - test_k_matrix_ClearSky_abi_g18 (Failed)
97 - test_k_matrix_ClearSky_modis_aqua (Failed)
98 - test_k_matrix_ScatteringSwitch_atms_n21 (Failed)
99 - test_k_matrix_ScatteringSwitch_cris-fsr_n21 (Failed)
100 - test_k_matrix_ScatteringSwitch_v.abi_g18 (Failed)
101 - test_k_matrix_ScatteringSwitch_atms_npp (Failed)
102 - test_k_matrix_ScatteringSwitch_cris399_npp (Failed)
103 - test_k_matrix_ScatteringSwitch_v.abi_gr (Failed)
104 - test_k_matrix_ScatteringSwitch_abi_g18 (Failed)
105 - test_k_matrix_ScatteringSwitch_modis_aqua (Failed)
106 - test_k_matrix_SOI_atms_n21 (Failed)
107 - test_k_matrix_SOI_cris-fsr_n21 (Failed)
108 - test_k_matrix_SOI_v.abi_g18 (Failed)
109 - test_k_matrix_SOI_atms_npp (Failed)
110 - test_k_matrix_SOI_cris399_npp (Failed)
111 - test_k_matrix_SOI_v.abi_gr (Failed)
112 - test_k_matrix_SOI_abi_g18 (Failed)
113 - test_k_matrix_SOI_modis_aqua (Failed)
114 - test_k_matrix_SSU_ssu_n06 (Failed)
115 - test_k_matrix_SSU_ssu_n14 (Failed)
116 - test_k_matrix_VerticalCoordinates_atms_n21 (Failed)
117 - test_k_matrix_VerticalCoordinates_cris-fsr_n21 (Failed)
118 - test_k_matrix_VerticalCoordinates_v.abi_g18 (Failed)
119 - test_k_matrix_VerticalCoordinates_atms_npp (Failed)
120 - test_k_matrix_VerticalCoordinates_cris399_npp (Failed)
121 - test_k_matrix_VerticalCoordinates_v.abi_gr (Failed)
122 - test_k_matrix_VerticalCoordinates_abi_g18 (Failed)
123 - test_k_matrix_VerticalCoordinates_modis_aqua (Failed)
124 - test_adjoint_Simple_atms_n21 (Failed)
125 - test_adjoint_Simple_cris-fsr_n21 (Failed)
126 - test_adjoint_Simple_v.abi_g18 (Failed)
127 - test_adjoint_Simple_atms_npp (Failed)
128 - test_adjoint_Simple_cris399_npp (Failed)
129 - test_adjoint_Simple_v.abi_gr (Failed)
130 - test_adjoint_Simple_abi_g18 (Failed)
131 - test_adjoint_Simple_modis_aqua (Failed)
132 - test_adjoint_ClearSky_atms_n21 (Failed)
133 - test_adjoint_ClearSky_cris-fsr_n21 (Failed)
134 - test_adjoint_ClearSky_v.abi_g18 (Failed)
135 - test_adjoint_ClearSky_atms_npp (Failed)
136 - test_adjoint_ClearSky_cris399_npp (Failed)
137 - test_adjoint_ClearSky_v.abi_gr (Failed)
138 - test_adjoint_ClearSky_abi_g18 (Failed)
139 - test_adjoint_ClearSky_modis_aqua (Failed)
140 - test_tangent_linear_Simple_atms_n21 (Failed)
141 - test_tangent_linear_Simple_cris-fsr_n21 (Failed)
142 - test_tangent_linear_Simple_v.abi_g18 (Failed)
143 - test_tangent_linear_Simple_atms_npp (Failed)
144 - test_tangent_linear_Simple_cris399_npp (Failed)
145 - test_tangent_linear_Simple_v.abi_gr (Failed)
146 - test_tangent_linear_Simple_abi_g18 (Failed)
147 - test_tangent_linear_Simple_modis_aqua (Failed)
148 - test_tangent_linear_ClearSky_atms_n21 (Failed)
149 - test_tangent_linear_ClearSky_cris-fsr_n21 (Failed)
150 - test_tangent_linear_ClearSky_v.abi_g18 (Failed)
151 - test_tangent_linear_ClearSky_atms_npp (Failed)
152 - test_tangent_linear_ClearSky_cris399_npp (Failed)
153 - test_tangent_linear_ClearSky_v.abi_gr (Failed)
154 - test_tangent_linear_ClearSky_abi_g18 (Failed)
155 - test_tangent_linear_ClearSky_modis_aqua (Failed)
156 - test_forward_OMPoverChannels_atms_npp (Failed)
The following tests FAILED:
61 - fv3jedi_test_tier1_hofx_fv3lm (Failed)
63 - fv3jedi_test_tier1_hofx_nomodel (Failed)
65 - fv3jedi_test_tier1_hofx_nomodel_amsua_radii (Failed)
66 - fv3jedi_test_tier1_hofx_nomodel_abi_radii (Failed)
87 - fv3jedi_test_tier1_hyb-3dvar (Failed)
103 - fv3jedi_test_tier1_hyb-4dvar_pseudo-geos (Failed)
112 - fv3jedi_test_tier1_diffstates_gfs (Failed)
113 - fv3jedi_test_tier1_diffstates_geos (Failed)
115 - fv3jedi_test_tier1_addincrement_gfs (Failed)
116 - fv3jedi_test_tier1_letkf (Failed)
119 - fv3jedi_test_tier1_lgetkf (Failed)
125 - fv3jedi_test_tier1_enshofx_fv3lm (Failed)
127 - fv3jedi_test_tier1_eda_3dvar (Failed)
The following tests FAILED:
139 - ufo_test_tier1_test_ufo_amsr2_qc (Failed)
140 - ufo_test_tier1_test_ufo_amsua_qc (Failed)
141 - ufo_test_tier1_test_ufo_amsua_allsky_gsi_qc (Failed)
142 - ufo_test_tier1_test_ufo_amsua_qc_clwretmw (Failed)
143 - ufo_test_tier1_test_ufo_amsua_qc_filters (Failed)
144 - ufo_test_tier1_test_ufo_amsua_qc_filters_geos (Failed)
145 - ufo_test_tier1_test_ufo_amsua_qc_miss_val (Failed)
146 - ufo_test_tier1_test_ufo_atms_qc_filters (Failed)
147 - ufo_test_tier1_test_ufo_atms_n20_qc_filters_geos (Failed)
148 - ufo_test_tier1_test_ufo_cris_qc (Failed)
149 - ufo_test_tier1_test_ufo_cris_qc_land (Failed)
152 - ufo_test_tier1_test_ufo_mhs_qc_filters_geos (Failed)
349 - ufo_test_tier1_test_ufo_opr_abi_ahi_crtm (Failed)
350 - ufo_test_tier1_test_ufo_linopr_abi_ahi_crtm (Failed)
351 - ufo_test_tier1_test_ufo_opr_airs_crtm (Failed)
352 - ufo_test_tier1_test_ufo_linopr_airs_crtm (Failed)
353 - ufo_test_tier1_test_ufo_opr_amsr2_crtm (Failed)
356 - ufo_test_tier1_test_ufo_opr_amsua_crtm (Failed)
357 - ufo_test_tier1_test_ufo_linopr_amsua_crtm (Failed)
358 - ufo_test_tier1_test_ufo_opr_amsua_geos_crtm (Failed)
359 - ufo_test_tier1_test_ufo_linopr_amsua_geos_crtm (Failed)
360 - ufo_test_tier1_test_ufo_opr_atms_crtm (Failed)
361 - ufo_test_tier1_test_ufo_linopr_atms_crtm (Failed)
362 - ufo_test_tier1_test_ufo_opr_cris_crtm (Failed)
363 - ufo_test_tier1_test_ufo_linopr_cris_crtm (Failed)
364 - ufo_test_tier1_test_ufo_opr_gmi_crtm (Failed)
365 - ufo_test_tier1_test_ufo_opr_hirs4_crtm (Failed)
366 - ufo_test_tier1_test_ufo_linopr_hirs4_crtm (Failed)
367 - ufo_test_tier1_test_ufo_opr_iasi_crtm (Failed)
368 - ufo_test_tier1_test_ufo_opr_mhs_crtm (Failed)
369 - ufo_test_tier1_test_ufo_linopr_mhs_crtm (Failed)
370 - ufo_test_tier1_test_ufo_opr_seviri_crtm (Failed)
371 - ufo_test_tier1_test_ufo_linopr_seviri_crtm (Failed)
372 - ufo_test_tier1_test_ufo_opr_smap_crtm (Failed)
373 - ufo_test_tier1_test_ufo_linopr_smap_crtm (Failed)
374 - ufo_test_tier1_test_ufo_opr_sndrd1-4_crtm (Failed)
375 - ufo_test_tier1_test_ufo_linopr_sndrd1-4_crtm (Failed)
376 - ufo_test_tier1_test_ufo_obsdiag_crtm_airs_optics (Failed)
377 - ufo_test_tier1_test_ufo_obsdiag_crtm_amsua_optics (Failed)
378 - ufo_test_tier1_test_ufo_obsdiag_crtm_atms_optics (Failed)
379 - ufo_test_tier1_test_ufo_amsua_crtm_bc_channelnobc_geos (Failed)
380 - ufo_test_tier1_test_ufo_amsua_crtm_bc_geos (Failed)
381 - ufo_test_tier1_test_ufo_amsua_crtm_bc_obsoperator (Failed)
382 - ufo_test_tier1_test_ufo_amsua_crtm_bc_tlad (Failed)
395 - ufo_test_tier1_test_ufo_ssmis_f17_gfs_backgroundcheck_bc (Failed)
396 - ufo_test_tier1_test_ufo_ssmis_f17_gfs_backgroundcheck_nbc (Failed)