GEOS HofXNoModel horizontal interpolation

jjguerrette · August 2, 2023, 7:56pm

Hi, long time contributor, first time caller here.

I’m trying to run the fv3-jedi HofXNoModel application (release v5) using GEOS cubed sphere history files on the c960 grid as input. Those files are generated by starting with c2880, then taking every 3rd centered grid point (including for latitude and longitude). Only centered fields are used. I’m simulating aircraft, GNSSRO (NBAM), and sondes.

When I run the application on 24 processors with a [2, 2] layout (2x2x6==24), the application finishes successfully, using ~554 GB of RAM. When I run the application with 48 processors with a [4, 2] layout or 96 processors with a [4, 4] layout, I receive the following error message (apparently during the horizontal interpolation):

###################################################################################
stripack_addnod: colinear points
Abort(1) on node 81 (rank 81 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 81
###################################################################################

I tried with a large number of nodes such that the maximum available memory was ~1600 GB. So I do not think this is a memory issue related to halo footprint.

Does anybody have insights into what might be going wrong? Or ideas of debugging routes?

Thanks,
JJ

CoryMartin-NOAA · August 3, 2023, 2:58pm

If you remove ‘sondes’, does it work? We’ve found issues before where if there are no obs on a processor for sondes it will fail.

jjguerrette · August 7, 2023, 10:39pm

Thanks for the idea @CoryMartin-NOAA. Unfortunately, that did not fix the problem. My colleague, @xtian15 has also tried this same case with release v4, and sometimes had the same error and sometimes not, depending on the layout chosen. For 96 processors, [2, 8], [1, 16], and [16, 1] worked, but no other layout did, e.g., [4, 4].

jjguerrette · August 7, 2023, 10:54pm

Actually, I tried again with release v5 using 96 pe and a [1,16] layout just now. The error message goes away. So maybe there is a problem with squarish layouts? Not sure if that is only triggered for large grids, or why? Any thoughts @fmahebert?

EDIT: or is there a requirement for the number of processors per node requested vs. the layout?

fmahebert · August 7, 2023, 11:36pm

Hi JJ — My suspicion is that the triangulation algorithm (we currently use stripack to triangulate the grid in the oops::UnstructuredInterpolator) is not robust enough. Your grid is fairly high in resolution, certainly more than I’ve tested with. I expect the larger number of points makes it more likely that some pathological cases where you try to insert a point directly in between two other points will occur.

I think I can add some fake try-catch (because Fortran) logic here to reshuffle the points in those pathological cases and hopefully the reshuffled list will avoid the problem. I can try this next week I think.

The better long-term solution would be to use an interpolator that’s aware of the grid structure… but we’ll be working on that soon.

jjguerrette · August 7, 2023, 11:54pm

Hi Francois. Thanks for those insights and plans. Very helpful!

fmahebert · August 8, 2023, 12:18am

One short-term idea that may help verify the above suspicion: if you go to src/oops/external/stripack/stripack.cc:49 and put in a different random integer: does that fix your failing case? If it does fix it, that gives me reassurance your abort may just be “bad luck”. If not, then perhaps we have to think harder, or at least be prepared to try several random shuffles to get “good luck”…

fabiolrdiniz · August 8, 2023, 7:41pm

Hi @jjguerrette, nice to read from you here. Speaking with @fmahebert about it this morning we’ve recalled seeing this error message in the past due to the need of more memory. A few comments… I don’t know how many tasks per node you are using, but I’d play with that number. I assume you are using exclusive nodes in your tests, right? Can you isolate your problem with a zero obs and/or single obs test?

jjguerrette · August 14, 2023, 7:18pm

Using a wide/tall tile aspect ratio (1:16 for 96 cores) is working for what we need in the short term (I think). I can test some edge cases more in September, but don’t have time right now to set up the zero/single obs case. I tried reducing to fewer than 3K aircraft obs and the problem persisted. The wide/tall tile aspect ratio has not failed yet. So I don’t know how this could be an out-of-memory issue.

Topic		Replies	Views
Trouble running mpasjedi_hofx3d.x with mpas-bundle 3.0.2 JEDI	15	147	May 29, 2025
JEDI EnKF issues with CONUS FV3-LAM domain JEDI	2	256	May 27, 2022
The vader shared memory BTL will fall back on another single-copy mechanism if one is available. This may result in lower performance JEDI	0	579	September 26, 2022
Failure building fv3-jedi on Orion JEDI	10	775	February 9, 2021
Make fv3-jedi faster with more processors JEDI	4	334	December 13, 2021

GEOS HofXNoModel horizontal interpolation

Related topics