JEDI MPI Issue on Derecho

Hello,

When I try to run mpasjedi_variational.x on Derecho, I’m getting a puzzling MPI error. It always occurs just after the ==> create geom messages start to be displayed. Here is the error.

MPICH ERROR [Rank 16] [job id 9682c896-2454-4b8a-90a2-ae0debb8f0d3] [Mon Dec 29 09:07:03 2025] [dec2433] - Abort(940160770) (rank 16 in comm 0): Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(161): MPI_Isend(buf=0x37a23130, count=-1, MPI_INTEGER, dest=6, tag=16, comm=0x84000001, request=0x37a31940) failed
PMPI_Isend(97).: Negative count, value is -1

My executable was compiled from the 3.0.2 version of JCSDA/mpas-bundle. Here is the environment setup file that I used for both compiling and running the executable.

#!/bin/bash
#
source /etc/profile.d/z00_modules.sh

export LMOD_TMOD_FIND_FIRST=yes

# Check if conda is installed. If it is, deactivate it.
if command -v conda >/dev/null 2>&1; then
    conda deactivate
fi

module purge
# ignore that the sticky module ncarenv/... is not unloaded
module load ncarenv/24.12
module use /glade/work/epicufsrt/contrib/spack-stack/derecho/modulefiles
module load ecflow/5.8.4
module load mysql/8.0.33

module use /glade/work/epicufsrt/contrib/spack-stack/derecho/spack-stack-1.9.3/envs/ue-gcc-12.4.0/install/modulefiles/Core

module load stack-gcc/12.4.0
module load stack-cray-mpich/8.1.29
module load stack-python/3.11.7
module load jedi-mpas-env
module list

ulimit -s unlimited
export GFORTRAN_CONVERT_UNIT='big_endian:101-200'
export LD_LIBRARY_PATH=`pwd`/lib:$LD_LIBRARY_PATH

Has this problem been encountered before, and if so, is there a way to fix it?

Thank you for your help!

Hi @RobinArmstrong,

I have not seen this error message before, but just checking if you run the executable on the computing node (either with interactive mode or pbs job), not on the login node. If you are comfortable with sharing your working directory, I can take a look further.

Thank you,
BJ

Hi BJ,

Thanks for taking a look at this. I am indeed running my tests on a compute node, with 36 cores. My working directory is /glade/work/rarmstrong/jedi/testing/tests/3denvar, and in case it’s useful, run_log.txt has a full log of a run where this crash occurred.

-Robin

Hi @RobinArmstrong,

Thanks for sharing your working directory. I tried it myself and got the same error message.
In short, this seems to a configuration issue.

Please see /glade/derecho/scratch/bjung/troubleshooting/RobinA_b.
I made two modification.

  1. In streams.atmosphere_240km file, the static stream should be invariant stream.
  2. The ancillary dataset (such as some binary or ascii tables) of MPAS-Model linked in the working directory is not compatible with the MPAS-Model version of your build. Thus run_3denvar.sh is modified, so that correct MPAS ancillary dataset can be linked (directly from the build directory.

Hope this helps!

Thank you,
BJ

Hi BJ,

Thank you as always for your help. The solution you gave me worked for at time, but now I’m getting new issues. Running with the debug and trace flags turned on for OOPS, I get this error:

dec1317.hsn.de.hpc.ucar.edu: rank 9 died from signal 11

There’s no error logfile from MPAS, as there normally would be if this was something trivial like a missing file, nor are there any obvious output messages to indicate what the problem might be. But it always appears right after these debug messages from OOPS:

OOPS_TRACE[1] State<MODEL>::State read starting
OOPS_TRACE[1] State::State create and read.

I’m wondering if this is an issue related to test background state file; maybe MPAS has updated in such a way that this file is no longer formatted correctly? You can find a complete log at /glade/work/rarmstrong/jedi/testing/tests/3denvar/run_log.txt.

Happy to move this to a new thread if you think that would be more appropriate.

-Robin

Hi @RobinArmstrong,

Can you let me know which version of mpas-bundle, especially mpas-jedi, is being used for your build (develop or release/3.0.x)?

We have introduced the generic model variable naming convention, early last year (2025!). In case you are using develop branch of mpas-jedi repository, the variable names in the 3denvar yaml file should be revised accordingly.

Thank you,
BJ

Hi BJ,

Got it; I’m using the latest commit on develop. I updated all the model variable names in my YAML to match the names in an example from mpas-jedi/test/testinput, and this solved the problem. Thank you!

-Robin

1 Like