Hi, long time contributor, first time caller here.
I’m trying to run the fv3-jedi HofXNoModel application (release v5) using GEOS cubed sphere history files on the c960 grid as input. Those files are generated by starting with c2880, then taking every 3rd centered grid point (including for latitude and longitude). Only centered fields are used. I’m simulating aircraft, GNSSRO (NBAM), and sondes.
When I run the application on 24 processors with a [2, 2] layout (2x2x6==24), the application finishes successfully, using ~554 GB of RAM. When I run the application with 48 processors with a [4, 2] layout or 96 processors with a [4, 4] layout, I receive the following error message (apparently during the horizontal interpolation):
###################################################################################
stripack_addnod: colinear points
Abort(1) on node 81 (rank 81 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 81
###################################################################################
I tried with a large number of nodes such that the maximum available memory was ~1600 GB. So I do not think this is a memory issue related to halo footprint.
Does anybody have insights into what might be going wrong? Or ideas of debugging routes?
Thanks,
JJ
If you remove ‘sondes’, does it work? We’ve found issues before where if there are no obs on a processor for sondes it will fail.
Thanks for the idea @CoryMartin-NOAA. Unfortunately, that did not fix the problem. My colleague, @xtian15 has also tried this same case with release v4, and sometimes had the same error and sometimes not, depending on the layout chosen. For 96 processors, [2, 8], [1, 16], and [16, 1] worked, but no other layout did, e.g., [4, 4].
Actually, I tried again with release v5 using 96 pe and a [1,16] layout just now. The error message goes away. So maybe there is a problem with squarish layouts? Not sure if that is only triggered for large grids, or why? Any thoughts @fmahebert?
EDIT: or is there a requirement for the number of processors per node requested vs. the layout?
Hi JJ — My suspicion is that the triangulation algorithm (we currently use stripack to triangulate the grid in the oops::UnstructuredInterpolator) is not robust enough. Your grid is fairly high in resolution, certainly more than I’ve tested with. I expect the larger number of points makes it more likely that some pathological cases where you try to insert a point directly in between two other points will occur.
I think I can add some fake try-catch (because Fortran) logic here to reshuffle the points in those pathological cases and hopefully the reshuffled list will avoid the problem. I can try this next week I think.
The better long-term solution would be to use an interpolator that’s aware of the grid structure… but we’ll be working on that soon.
Hi Francois. Thanks for those insights and plans. Very helpful!
One short-term idea that may help verify the above suspicion: if you go to src/oops/external/stripack/stripack.cc:49
and put in a different random integer: does that fix your failing case? If it does fix it, that gives me reassurance your abort may just be “bad luck”. If not, then perhaps we have to think harder, or at least be prepared to try several random shuffles to get “good luck”…
Hi @jjguerrette, nice to read from you here. Speaking with @fmahebert about it this morning we’ve recalled seeing this error message in the past due to the need of more memory. A few comments… I don’t know how many tasks per node you are using, but I’d play with that number. I assume you are using exclusive nodes in your tests, right? Can you isolate your problem with a zero obs and/or single obs test?
Using a wide/tall tile aspect ratio (1:16 for 96 cores) is working for what we need in the short term (I think). I can test some edge cases more in September, but don’t have time right now to set up the zero/single obs case. I tried reducing to fewer than 3K aircraft obs and the problem persisted. The wide/tall tile aspect ratio has not failed yet. So I don’t know how this could be an out-of-memory issue.
1 Like