MPI.jl: Julia bindings for the Message Passing Interface

MPI.jl is a Julia package for using the Message Passing Interface (MPI), a standardized and widely-supported communication interface for distributed computing, with multiple open source and proprietary implementations. It roughly follows the C MPI interface, with some additional conveniences afforded by the Julia language such as automatic handling of buffer lengths and datatypes.


Introduction
Now over 25 years old, MPI is the stalwart of high-performance computing communication, supported on everything from single machines to billion-dollar supercomputers. Despite its age, it supports several models of communication, and significant engineering effort goes into optimizing performance and supporting the latest networking hardware. Although Julia provides its own suite of distributed computing tools via the Distributed standard library, it is based on a controller/worker model and is currently unable to leverage fast networking hardware such as InfiniBand, which limits its scalability to large problems. MPI.jl leverages the well-established and proven technology, including extensions such as the CUDA-aware interfaces for multi-GPU communication. It is being used by multiple Julia projects, including the CliMA Earth system modeling project [5].

History
MPI.jl originated as a repository developed by Lucas Wilcox in July 2012, 4 months after Julia was first publicly announced. It was registered in the Julia package registry in 2014, and since then has evolved considerably along with the Julia language itself. It has so far received contributions from 51 people, many of which are users who required the addition of a specific piece of functionality. For most of its history, the maintainers have practiced a light touch: most new features were added by users. The only major changes to the codebase were typically precipitated by changes in the Julia language itself, or modifications to the build system to support different MPI implementations or platforms (section 3.4). In the past two years there have been some larger architectural changes while keeping the overall interface consistent. These changes including CUDA-aware support and the addition of buffer types (section 3.3), automatic installation of binary dependencies (section 3.5), and moving code that integrated with Julia's Distributed standard library to a new package (section 3.6).

Simple example and running MPI programs
Most MPI programs utilize a single-program, multiple-data (SPMD) model where multiple processes all run the same program and communicate via messages, with their data determined by the process rank (a 0-based ordering of the processes). An example of this is a simple "round-robin" communication pattern in which each process sends a message containing its rank to its next neighbor using nonblocking point-to-point operations: § ¤

¦ ¥
This can be run using the MPI launcher, typically called mpiexec:

Functionality
Most commonly-used MPI functionality is currently available via the package, including dynamic process management and operations for point-to-point (blocking and nonblocking), collective, one-sided and I/O. The MPI 3.1 standard lists several hundred functions: limited maintainer resources means that the addition of features is mainly driven by needs of the contributors and requests by users, and so lesser-used MPI features are not yet available. This includes neighborhood and non-blocking collectives, persistent operations, buffered/synchronous/ready point-to-point operations. Addition of these features should not require any major architectural changes, and should be able to leverage much of the work below.

Allocation and serialization
For communication operations which receive data, MPI.jl typically defines two separate functions: -one function in which the output buffer is supplied by the user: as it mutates this value, it adopts the Julia convention of suffixing with ! (e.g. MPI.Recv!, MPI.Reduce!). -one function which allocates the buffer for the output (MPI.Recv, MPI.Reduce).
Additionally, we adopt the convention from mpi4py of using lowercase names for functions which are able to handle arbitrary objects. These are typically slower as they rely on serialization and are not type-stable, but can be convenient as they don't require that the object type or size be known by the receiver. Currently only a small number of these functions are provided.

Buffers, datatypes and operators
In C and Fortran, MPI communication functions require three arguments (address, count, and element datatype) to specify their input and/or output buffers e.g. the MPI_Send signature in C has six arguments: § ¤

Application binary interface
The MPI standard specifies C and Fortran application programming interfaces (API), but not an application binary interface (ABI). Consequently, datatypes and enum values vary between different implementations, and require parsing C headers to extract their precise values. After much experimentation, we have settled on two approaches that work reasonably well and require minimal user intervention: -Attempt to identify the MPI implementation by querying MPI_Get_library_version, and use predefined constants and types if known to be compatible with MPICH, Open MPI, or Microsoft MPI. This should cover most MPI implementations released since the completion of the MPICH ABI compatibility initiative in 2016. -Otherwise, at build time it compiles a small C program that outputs the type sizes and constants. One complication is that the opaque C handles might only be defined at link time: in this case, we convert to the Fortran handle values (which are required to be integers), and convert back to C handles when calling MPI.Init(). A similar approach is used by the MPI bindings for Rust [4].

Binary support
Similar to many Julia packages, MPI.jl uses BinaryBuilder and the Artifacts system to automatically install an MPI implementation when the package is installed (currently Microsoft MPI on Windows, MPICH on other platforms). This completely automates the installation procedure for users on single machines, meaning that MPI.jl can be added as a project dependency without users being required to correctly install their own MPI implementation. On high-performance computing systems one would typically want to use system or other externally-provided binaries. To aid this, MPI.jl provides additional hooks to enable switching this at build time via environment variables, and a warning is shown if a user appears to be using the default MPI binary on a HPC cluster. Challenges remain on how to make it easier to switch implementations, such as how to correctly invalidate the package precompilation cache.
There are similar challenges when working with external libraries which also depend on MPI, such as HDF5 and PETSc. For example, HDF5.jl has recently added MPI support, but it requires that the user provide their own HDF5 library that has been correctly linked against the same MPI library used by MPI.jl. This makes it difficult to distribute Julia programs that make use of such functionality in a reproducible manner. We are investigating ways to improve this, such as integration with the Spack package manager [3].

Combining with other modes of Julia parallelism
Julia itself provides three models of parallelism: asynchronous tasks/coroutines (green threading), simultaneous multi-threading, and distributed computing. Though MPI nonblocking operations act in a similar manner to tasks, they can't be integrated directly with Julia's runtime scheduler as MPI does not provide a mechanism to interact with the Julia event loop. Furthermore, calls from Julia to external libraries will block the event loop, which means that calling a blocking MPI operations in a separate task will not yield until it completes. One simple but inefficient approach is to use a spinloop with a call to a non-blocking query operation: § ¤ ¦ ¥ however concurrency will be limited by the available threads. Julia's Distributed standard library is based on remote procedure calls. This is widely used, though by default only supports IP socket-based communication. The MPIClusterManagers.jl package builds on MPI.jl to allow both MPI and Distributed operations, as well as use of MPI as a communication protocol for Distributed. One useful feature of the Distributed library is that it enables interactive distributed computing through a REPL or notebook interface. Unfortunately attempts to provide similar functionality for MPI.jl have so far had limited success. The input and output redirection imposed by the MPI launchers make it difficult to run MPI sessions interactively. Most common MPI implementations support "singleton" initialization where a single process started outside a launcher can call MPI_Init. This is widely used by Julia MPI projects which support both interactive serial and batch parallel functionality. Though the MPI standard encourages implementations to allow singleton processes to connect using the client/server interfaces, we have yet to find any implementations which support such operations.

Ping pong benchmark
The "ping pong" benchmark consists of two MPI processes which alternate sending messages between each other, and is a useful measure of how function call overhead affects communication latency.  Figure 1 compares the ping pong benchmark implemented in C, Julia using MPI.jl, and Python using mpi4py. The MPI.jl benchmark exhibits similar performance to C, whereas mpi4py is notable slower for smaller message sizes, likely due to the interpreter overhead of Python.
In addition, for MPI.jl and mpi4py we also compare the lowercase "generic" MPI.send and MPI.recv functions, which are able to handle arbitrary objects. Here MPI.jl is still faster than mpi4py for small messages, but slower for medium-sized messages. This appears to largely be determined by the relative performance of the serialization in each language. mpi4py is changing its choice of pickle protocol in its next release, which may affect these numbers.

Minimum-spanning tree broadcast
Julia syntax is close to pseudo-code found in the literature to describe parallel algorithms. For example, consider the minimumspanning tree broadcast algorithm in Figure 3a of [1].

¦ ¥
This is nearly identical to the pseudo-code and can be called for all of the datatypes supported by MPI.send and MPI.recv, for example arrays, functions, and dictionaries.

Pooled variance using custom datatypes and operators
Computing the variance of a distributed array x illustrates the power of custom datatypes and reduction operations. Using the standard formula requires two rounds of communication (first to sum x i and broadcast the mean, second to sum the squared differences). Using the sum-of-squares formula can be done in one communication operation (by summing both x 2 i and x i ), but suffers from numerical cancellation error. The pooled variance formula is a numerically stable way of computing the variance of x from the means and variances of a partitioning (x (k) ) K k=1 of x: Var(x) = k n (k) n Var(x (k) ) +x (k) (x (k) −x) .
This can applied recursively as a custom reduction operator on objects containing the mean, variance and length of each element of the partition, requiring only a single MPI.Reduce operation: § ¤ # v a r i a n c e a n d l e n g t h f u n c t i o n p o o l ( S1 : : S u m m a r y S t a t , S2 : : S u m m a r y S t a t ) n = S1 . n + S2 . n m = ( S1 . m e a n * S1 . n + S2 . m e a n * S2 . n ) / n v = ( S1 . n * ( S1 . v a r + S1 . m e a n * ( S1 . m e a n -m ) ) + S2 . n * ( S2 . v a r + S2 . m e a n * ( S2 . m e a n -m ) ) ) / n S u m m a r y S t a t ( m , v , n ) e n d