Nicas keeps getting terminated

xtian15 · July 5, 2023, 8:00pm

Hi,
I am trying to train a static B at c192 resolution with release v4. The nicas step for ozone mixing ratio and specific humidity keeps getting terminated with the following messages

Info     :              Count and find nearest neighbors, compute distances: 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Info     :              Compute weights: 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Info     :           Compute internal normalization
Info     :           Compute normalization
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node tomorrowbe-cn-8 exited on signal 9 (Killed).
--------------------------------------------------------------------------
real    52m13.253s
user    219m59.659s
sys     1640m26.638s
Wed Jul  5 19:48:55 UTC 2023
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=36426.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The stopping point is always immediately after the Compute normalization.
Is there any parameter in the yaml file that can be tuned for this behavior? Thanks in advance!

benjaminmenetrier · July 7, 2023, 6:59am

As the oom-kill suggests, this is an “Out Of Memory” issue. To reduce the memory footprint, you can reduce the nicas.resolution parameter, or limit the convolution length-scales by specifying a reasonable maximum length-scale with the parameter general.universe length-scale. You can send me the full log if these options are not working.

xtian15 · July 7, 2023, 12:33pm

Thanks for the suggestion! What would the decreased nicas.resolution imply or where can I find related documentations?

benjaminmenetrier · July 7, 2023, 1:28pm

Reducing nicas.resolution would make the correlation or localization function represented by NICAS more “bumpy”, less smooth. But the computational cost and the memory footprint scale like this parameter squared… So it is a good trade-off parameter between cost and accuracy.

xtian15 · July 7, 2023, 2:00pm

Would it be viable to merge variables trained with different nicas resolutions together for 3dvar? All other variables are trained with resolution=10, only ozone and specific humidities with resolution=4.

benjaminmenetrier · July 7, 2023, 3:27pm

Yes, NICAS applications for each variable are independent. Not a problem!

Topic		Replies	Views
Missing value encountered in nicas	0	163	July 10, 2023
Trouble splitting NICAS JEDI	2	180	September 29, 2023
Nicas_norm component is too large JEDI	2	205	April 19, 2023
Failure of bumpparameters_nicas_gfs_aero at C96 JEDI	11	826	February 4, 2021
Error in nicas_blk_compute_sampling_c1 JEDI	2	368	February 17, 2021

Nicas keeps getting terminated

Related topics