Hi,
I am trying to train a static B at c192 resolution with release v4. The nicas step for ozone mixing ratio and specific humidity keeps getting terminated with the following messages
Info : Count and find nearest neighbors, compute distances: 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Info : Compute weights: 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Info : Compute internal normalization
Info : Compute normalization
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node tomorrowbe-cn-8 exited on signal 9 (Killed).
--------------------------------------------------------------------------
real 52m13.253s
user 219m59.659s
sys 1640m26.638s
Wed Jul 5 19:48:55 UTC 2023
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=36426.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
The stopping point is always immediately after the Compute normalization.
Is there any parameter in the yaml file that can be tuned for this behavior? Thanks in advance!