Nicas keeps getting terminated

I am trying to train a static B at c192 resolution with release v4. The nicas step for ozone mixing ratio and specific humidity keeps getting terminated with the following messages

Info     :              Count and find nearest neighbors, compute distances: 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Info     :              Compute weights: 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Info     :           Compute internal normalization
Info     :           Compute normalization
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 7 with PID 0 on node tomorrowbe-cn-8 exited on signal 9 (Killed).
real    52m13.253s
user    219m59.659s
sys     1640m26.638s
Wed Jul  5 19:48:55 UTC 2023
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=36426.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The stopping point is always immediately after the Compute normalization.
Is there any parameter in the yaml file that can be tuned for this behavior? Thanks in advance!

As the oom-kill suggests, this is an “Out Of Memory” issue. To reduce the memory footprint, you can reduce the nicas.resolution parameter, or limit the convolution length-scales by specifying a reasonable maximum length-scale with the parameter general.universe length-scale. You can send me the full log if these options are not working.

Thanks for the suggestion! What would the decreased nicas.resolution imply or where can I find related documentations?

Reducing nicas.resolution would make the correlation or localization function represented by NICAS more “bumpy”, less smooth. But the computational cost and the memory footprint scale like this parameter squared… So it is a good trade-off parameter between cost and accuracy.

Would it be viable to merge variables trained with different nicas resolutions together for 3dvar? All other variables are trained with resolution=10, only ozone and specific humidities with resolution=4.

Yes, NICAS applications for each variable are independent. Not a problem!

1 Like