Basic Threading Examples in JuliaLang v1.3

A major distinguishing point of any programming language is how it deals with concurrency. Programmers want to extract the best throughput possible for their applications, but it is well known that taking advantage of all available CPU cores correctly and efficiently is hard. Here, we look at how JuliaLang unleashes the full power of a modern CPU’s multiple cores. One of our key considerations is to reduce the programmer’s burden. We will discuss how JuliaLang aims to provide a range of modern primitives that are designed to automatically compose effectively, and some of the trade-offs we make to try to simplify the mental model for the programmer. We’ll also briefly discuss our thoughts on future development.


Introduction
A design principle for JuliaLang is to make common tasks easy and difficult tasks possible. This is demonstrated in multiple aspects of the language, from automatic memory management vs. manual memory reuse, to having one type of method with dispatch, to optional type-inference for performance. We now extend this to concurrency. By extending design work that has been present in the language for many years to provide concurrency, we have developed a means to add parallelism that preserves the (relative) simplicity of single-threaded execution for existing code, while allowing new code to benefit from multi-threaded execution. This work has been inspired by parallel programming systems such as Cilk, Intel Threading Building Blocks (TBB) and Go. In this paradigm, any piece of a program can be marked for execution in parallel, and a task will be started to run that piece of code automatically on an available thread. A dynamic scheduler handles making cache-aware decisions on when and where to launch tasks. This model of parallelism has many helpful properties. We see it as somewhat analogous to garbage collection: with GC, you freely allocate objects without worrying about when and how they are freed. With task parallelism, you freely spawn tasks-potentially millions of them-without worrying about where they may eventually run. The model is portable and free from low-level details. The programmer does not need to manage threads, nor even know how many processors or threads are available. The model is nestable and composable: parallel tasks can be started that call library functions that themselves start parallel tasks, and everything works correctly. This property is crucial for a high-level language where a lot of work is done by library functions. The programmer can write serial or parallel code without worrying about how the libraries used are implemented. This model isn't limited to JuliaLang libraries alone: we've shown that it can be extended to native libraries such as FFTW and are working on extending it to OpenBLAS.

Background History
Initially, JuliaLang exclusively provided users the ability to use cooperatively scheduled workers. In other contexts, these may be known as "green threads", "threadlets", or "coroutines". In Julia, we've called these tasks. A task is a unit of work with its own context (stack) whose execution can be interleaved with that of any other task. Additionally, these have the ability to produce a value or an exception, making them useful for structured concurrency, without requiring an extra channel to manage. But they haven't had the ability to work in parallel (simultaneously). Tasks have been useful for writing generators and outstanding for dealing with I/O workloads. Such cases present unpredictable latency and therefore the ability to quickly switch between different control flows is essential. The arrival of an event requires a quick context switch to react to the event, and quickly switching back to resume the original work. A common pattern for a server is to provide a context for each child. Consider the following code snippet for a toy socket echo server. With JavaScript promises: § ¤

¦ ¥
Notice here how we leverage the do-block syntax to additionally reuse the loop logic, task allocation, and error handling from a central place. This works well for latency bound activities such as a web server. Note that this code only uses one core, although as it performs only thread-local actions, it could be run simultaneously on multiple cores without significant change. However, there is also a lot of code that is not written with the expectation of simultaneous access. As such, we want to continue to provide concurrency 1 as well as parallelism 2 . If we define a thread as a unit of work managed by the runtime system, we can call this N : 1 threading, where the runtime library manages N independent operations and maps them onto one system thread (approximately representing a CPU core). In the new system, we'll let the user now additionally have M CPU cores. This goes beyond the classic N :M threading model however, as the programmer can specify cooperative affinities where certain tasks won't interrupt each other. We'll call this N (k : 1) : M scheduling since we're combining the advantages of single-threaded work queues with multiple cores. In N (k : 1) : M scheduling, we have N units of work mapped onto M CPU threads. Additionally, each of those units of work may be composed of k cooperatively scheduled tasks. This is achieved by pinning the k tasks in a group to one CPU thread, while load balancing units of work across the available cores using the partr scheduler, a novel implementation of a parallel depth-first scheduling algorithm. This will be discussed further in section 5 on implementation.
We can extend this model further by factoring out a common factor of P : P to write this as N (k : 1) : M + P (k : 1) : P and derive one further useful use case: the ability to pin one thread to running one task (or task-group). These P tasks could be an over-subscription of the CPU cores, or take away from M , or both. Typical uses for this mode of operation would be high-availability tasks (with low latency requirements, but also minimal computation), such as back-1 When at least two threads are making progress; a more general form of parallelism that can include time-slicing. 2 When at least two threads are executing simultaneously ground I/O processing, blocking work pools (foreign library integration), finalizers, or message server queues. For a couple years, JuliaLang has been able to perform simple loop parallelism with the '@threads for' macro, roughly analogous to OpenMP's '#pragma omp parallel for schedule(static)' without support for reductions. This had been labelled "experimental" while we focused on making JuliaLang's runtime reentrant and threadsafe and clarified requirements for the final parallelism capabilities. The experimental threading infrastructure had no scheduler, could not interact with regular tasks or do I/O, and parallel loops could not be nested. This made it nearly impossible to write many common algorithms or use a large portion of the language while running a threaded region of code.
The new threading runtime addresses all these shortcomings. Furthermore, in addition to the new parallelism constructs that have been introduced, the previous loop parallelism capability has been rewritten on top of this runtime, demonstrating its power and flexibility.

Running Julia with Threads
In the examples below, we will be using JuliaLang v1.3 launched with multiple threads. To follow along on your own machine, you will need to download the upcoming JuliaLang release (currently v1.3.0-rc1) from https:// julialang.org/downloads. Run ./julia with the environment variable JULIA_NUM_THREADS set to the number of threads to use. Alternatively, after installing JuliaLang, follow the steps at http://docs.junolab.org/latest/man/ installation/ to install the Juno IDE. It will automatically set the number of threads based on the number of available processor cores, and also provides a graphical interface for changing the number of threads.

Motivating Examples
The presence and usability aspects of threading, as exemplified here, reflect JuliaLang's general policy of giving users control. One driving philosophy is that users should have the ability to access the full power of their machine. And it should be easy when needed but ignorable when not required. While many, or even most, programs can be written without needing to touch multithreading, some require them, while some benefit from them. In this paper, we'll primarily examine some cases where threads aren't required, but are improved by their presence. Additionally we'll look at a case where the work can be run sequentially with cooperative scheduling, but at greatly reduced performance. Most thread-specific functionality is exported from the Threads submodule of the Base module. For example, we can querying it for the runtime number of threads and the id of the current thread: § ¤ julib hredsFnthreds@A R julib hredsFthredid@A I ¦ ¥

Stochastic Ordering
One of the more visual ways to show we have threads working is to show the scheduler picking up work in semi-random, interleaving, orders. JuliaLang's existing '@threads for' macro would split a range and run a portion on each thread with a static schedule. So in the range below, thread 1 would run items 1 and 2, thread 2 would run items 3 and 4, and so on. Now these threads support doing I/O too with that same schedule. § ¤ sh6b tvsexwriehaV juli``iyp hredsFdthreds for i a IXIP println@iD 4 on thred 4D hredsFthredid@AA end iyp I on thred I Q on thred P IP on thred V W on thred S U on thred R P on thred I R on thred P S on thred Q V on thred R II on thred U IH on thred T T on thred Q

¦ ¥
But now, it's also now possible to do the same example but with a completely dynamic schedule. With the improved language runtime, this takes a few small tweaks now. We use the new '@spawn' macro with the existing '@sync' macro to delineate the work items.
The '@spawn' macro marks a block of code that can immediately start executing, asynchronously, on any free thread. The preexisting '@sync' macro then waits for all (lexical) subtasks to complete, eliminating the boilerplate necessary to track and wait on each task block separately. § ¤ sh6b tvsexwriehaV juli``iyp dsyn for i a IXIP hredsFdspwn println@iD 4 on thred 4D hredsFthredid@AA end iyp P on thred S Q on thred R V on thred U T on thred S IP on thred U U on thred T W on thred V IH on thred S R on thred Q I on thred P S on thred I II on thred I

¦ ¥
But on to even more fun stuff...

Parallel Merge Sort
A classic algorithm, parallel merge sort shows nice performance benefit and scaling from using multiple threads. This function will create O(log(n)) subtasks which will sort independent portions of the array before merging them into a final sorted copy of the input. We use here the ability of each task to return a value to directly fetch the result without requiring an additional channel for data! This operation implicitly waits for the task to finish, then accesses the result value of the Task. § ¤ 5 perform merge sort on`v`using prllel threds funtion psort@vXXestrtetorA hi a length@vA if hi`IHHHHH 5 elow some utoffD run in seril return sort@vD lg a wergeortA end 5 split the rnge nd sort the hlves in prllel reursively mid a @I C hiA bbb I hlf a hredsFdspwn psort@view@vD IXmidAA right a psort@view@vD @mid C IAXhiAA left a feth@hlfAXXtypeof@rightA 5 perform the merge on the result out a similr@vA merge3@outD leftD rightA return out end funtion merge3@outD leftD rightA llD lr a length@leftAD length@rightA dssert ll C lr aa length@outA iD ilD ir a ID ID I dinounds while il`a ll 88 ir`a lr lD r a leftilD rightir if isless@rD lA outi a r ir

¦ ¥
To see the timing results as we add threads, refer to figure 3 at the end.
While not demonstrated here, fetch would also automatically propagate errors, with the result of it being an error thrown if the child task ended by throwing an exception.
Since we are using in-process threads, we could further optimize this to instead mutate the input in-place and to reuse work buffers for additional performance. We have elsewhere tested that and shown the performance improvement is as expected. However, since the scaling improvement was similar between them, we've opted not to include it here. On a single thread, this code is already quite competitive to the optimized serial implementation in the standard library, which does not use any threading: § ¤ julib dtime psort@AY PFTUTWHT seonds @QFHT k llotionsX IFRIT qifD QFTT7 g timeA julib dtime sort@AY IFUITIQU seonds @P llotionsX ISPFSVV wifA julib dtime sort@D lgawergeortAY PFIPQWSV seonds @S llotionsX PPVFVVP wifA

¦ ¥
This shows we are adding some overhead, but it is not substantial. In fact, with 2 threads, we'll already be faster than the serial implementations!
The algorithm given here is limited in the theoretical scaling capability, since the merge step is not parallelized. On large core counts, that can be important, so please see our supplementary code in appendix A for the version with optimal theoretical scaling.

Parallel Primes Sieve
An unusual use of high-level threading operations can be used to (inefficiently) compute prime numbers using the sieve of Eratosthenes. This use of threaded channels is translated from Thomas Hoare's seminal 1978 paper "Communicating Sequential Processes"[3] example 6.1. It works by creating a task for each prime number being generated. Upon receiving (and outputing) a prime, each task will then take responsibility for filtering out multiples of that prime from the input list, as represented in figure 1. To see the timing results as we add threads, refer to figure 3 at the end.
Since we're creating one thread for each number, the overhead here overwhelms the computational cost of the additions. That makes this implementation much slower than the optimized routines typically used now, such as those provided in Primes.jl to compute primes. But is also means we show exceptional (super-linear) scaling. This is because we end up being able to run a better schedule when we can fill and empty the channels in parallel. That is also why the presence of at least a small buffer on the channel can be a significant advantage for the implementation.

Parallel Prefix Scan
Prefix-scan-sum is another classic algorithm that is able to benefit nicely from having multiple threads. Without going into any details about how this operation works or what it does, the short code below can take advantage of all cores and SIMD units available on the native machine-even with a generic ahead-of-time-compiled system image: § ¤

¦ ¥
JuliaLang can express this operation so well because it defines an expressive front-end to describe optimizations to the compiler. Under the hood, it puts together a comprehensive set of features that free the user from dealing with memory management, thread management, nor compile/runtime distinction. The runtime is able to prepare a version of this function specifically optimized for the arguments types. And it spawns closures to be run on all available CPUs. The compiler can also automatically specialize the function for the current processor (both ahead-of-time and just-in-time), adjusting the ABI on-the-fly (with trampolines as needed). And our lightweight threading system will dynamically schedule the work chunks.

Performance
Each of the examples above shows a performance benefit attained from adding threads! On a quad-core laptop (Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz), we observed the scaling and timing numbers shown in Tables 1 and 2. These can be seen plotted graphically in figure 2 and 3.

Integration into an existing language
Another challenge we faced was seeing what would be needed to integrate this work with pre-existing code. JuliaLang is an existing, post version 1.0 language with promises to maintain backwards compatibility and a large third-party code base that depends on it. Any changes needed to have an upgrade-path. Whenever there were existing code that might reasonably be expected to be safe to use from multiple threads, that code needed to be identified and fixed. Fortunately, many key aspects of the language had previously been designed in expectations of becoming threaded. In some other popular languages, we see they have not been able to add unrestricted threading. There were several areas that needed to be tackled to determine the appropriate upgrade path: User-facing APIs: -concurrency basics: Task, and associated functions including schedule, yield, wait -mutexes: ReentrantLock and Condition variables, including lock, unlock, wait -synchronization primitives: Channel, Event, AsyncEvent, Semaphore -IO and other delays: including read, write, open, close, sleep -experimental Threads module: random assortment of building blocks and atomics -memoization-type caches (e.g. inside Regex.PCRE and the Random.GLOBAL_RNG object reference) Once we determined we wanted to make concurrency and parallelism use the same concept (named a Task), that set many priorities. Many of the APIs in our list of user-facing APIs were able to directly add thread-safety "under-the-hood", as they say. This meant that we found that typical user code that already interacted with IO, synchronization, locks, and tasks could continue to operate unchanged. In most cases, we achieved this by adding fine-grained locks on each critical resource. There were a few notable cases:

Changes to Tasks
The existing concurrency primitive of Tasks was enhanced by exposing a new, optional flag to enable thread-migration for it. We call this concept "sticky" tasks, as a default task is only cooperatively runnable on the thread that scheduled them. When set to false, however, the task becomes eligible to be picked up by any other thread. Combined with the internal changes to make wait on events and channels thread-safe, we believe this provides an easyto-use mechanism for selecting between the simpler cooperatively concurrent usage (single-threaded) and the more general simultaneous parallelism (multi-threaded). § ¤ t a sk@@A Eb losure odeA tFstiky a flse 5 t my now get run on ny thred shedule@tA FFF wit@tA

¦ ¥
However, while conceptually simple, the above felt slightly awkward compared to the fairly succinct @async syntax used for creating a concurrent task. We wanted to make it similarly convenient, so we also created a new Threads.@spawn macro and integrated it with the existing @sync macro. § ¤ using hredsX dspwn dsyn egin dsyn onurrentlosure@A dspwn prllellosure@A end 5 wit for ll ¦ ¥

Changes to Condition
The existing Condition object couldn't be made thread-safe. There were two replacements identified: one, replace it with an auto-resetting event with the same API; or two, replace it with a new mutex-based API. We decided to go with the latter option. This meant that existing usage of Condition was only correct if it remained on a single-thread. We decided to mechanically enforce this by asserting on usage that it was always used from the same thread it was created on.

¦ ¥
Previously, this would have been more simply a gondition@AY isonditionmet@A wit@A. While this change may seem more difficult at first glance, we observed that while the lock acquisition here could be hidden inside wait in the first replacement, all of this structure will usually still be required by the isonditionmet function. And the code would get much further complicated by the need to release the lock before calling wait. We concluded therefore that in most cases the code would be made simpler and faster by changing the API to the second option. This also meant that when code was being changed to be thread-safe, it would need to replace uses of Condition with the new Threads.Condition.

Changes to I/O
Changing the I/O code (files, streams, folders, and other platform code) to work on from any thread was another big project. The existing design requires an underlying asynchronous library, with a design similar to Windows IOCP, to efficiently manage large numbers of open event sources and provide the simplicity and concision of the logic shown in section 1.1 on all platforms. For this, we have been using the libuv library. This lets us have most platformspecific code isolated in a separately tested library and provide more commonality in our runtime library. As an initial implementation to make this library safe to use from threads, we've used one big I/O lock around all calls to it. However, this library also has callbacks and will block to wait for external incoming events, so we also needed to integrate it fully with the task scheduler to get it to cooperatively release the lock on demand. We were able to do so by adding an asynchronous channel (uv_async_t) to wake the one thread running the event loop while all other threads sleep on a system mutex (uv_cond_t) when there is no work for them to perform. When we try entering the event loop, we do so only if the count of currently waiting tasks is zero. In the future, this work may allow us to move the event loop entirely to a separate thread (and/or multiple threads). It seems that this design change may thus be making threading support a mandatory requirement for the underlying VM-with the advantage we that we can get more throughput on the large-core systems that are only becoming more common.

Changes to Memoization Caches
The usual strategy for dealing with these was to turn them from true globals into thread-local variables. To assist in that goal, we assign all threads a low numbered threadid. This can then be used to index a global array to access the cache for that thread. For example, instead of one global Random.GLOBAL_RNG object representing the global MersenneTwister pseudo-random number generator (PRNG) state, we use a Random.default_rng() function to retrieve the current PRNG for that thread (or to lazy-initialize one from system randomness on first use). § ¤ funtion defultrng@A tid a hredsFthredid@A dssert H`tid`a length@riehxqsA if dinounds isssigned@riehxqsD tidA dinounds w a riehxqstid else w a wersennewister@A dinounds riehxqstid a w end return w end funtion init@A resize3@empty3@riehxqsAD hredsFnthreds@AA end ¦ ¥

Changes to the Julia Runtime Library
The functionality provided in libjulia also needed to be thread-safe. While some of it consists of stateless helper functions, much of it is where the shared global state for the language lives (by contrast, much of the system library is written in the JuliaLang language itself and as a general principle, the whole system has avoided using mutable global state unless essential). Due to the design of the rest of the language avoiding access to mutable state inside the runtime library, we felt it would acceptable to use fine-grained locked for protecting most accesses. Many of these were added in an earlier version of JuliaLang, while threading was still under highly experimental development. These included such aspects as code-generation (JIT compilation) and GC (memory allocation and freeing). Discussion of the GC design and subsequent updates to make it work well with threads could occupy an entire article of its own, so it will not be discussed here. Although in the future work section later in this paper, some improvements being investigated for the compiler will be discussed. When using locks, there is a hierarchy of access that must be respected to avoid deadlocks. This is documented somewhat sparsely at https://docs.julialang.org/en/v1/devdocs/ locks/. Over time, we'll extend this list as we discover problems or are able to simplify shared resources. There are some known issues already such as the lack of a lock around certain "toplevelonly" operations and an invalid design for the ordering of the Module->lock. These issues will be addressed in time-they are not believed to be insurmountable issues. The missing toplevel lock is interesting, since it is a lock against concurrent execution of any other code. This will require halting all other threads in some way to inhibit accidental observation of the global state while it's in an intermediate inconsistent state. This should be possible in coordination with the GC-safepoint lock, which already has a very similar problem. Some aspects were still too performance critical however to be able to use a lock there, so we also make careful use of atomic pointer-publishing updates in a few specific places. As special-cases of that, we use RCU-type (read-copy-update) updates in some places and write-once in other places). This is known to work on most computer architectures. Others, such as the notorious DEC Alpha, we are content to exclude. In a code-base that already supports garbage-collection, the RCU algorithm is greatly simplified (and writers pay no additional cost), so this is typically preferred if mutation is absolutely required and reads must be fast. Otherwise, a simple lock is used.

Implementation
A prototype implementation of the partr scheduler was first written for us in C by Kiran Pamnany of Intel back in late 2016 3 , following research done on scheduling threads for beneficial cache sharing for best throughput [1]. The goal of this work was effortless composition of threading-capable libraries with a globally depthfirst work ordering (as opposed to 1 : 1/preemptive scheduling, which would try to make progress on all work, or work-stealing scheduling, which is only depth-first local to a thread and is globally breadth-first). The next stage of this work was then to integrate it with the existing JuliaLang runtime system and hoist as much of the implementation as possible into native Julian code. (Aside: one outcome of this work has been to allow us to delete much of the special support code from the C runtime for our prior experimental '@threads' fork/join-style API!) A big challenge of this work has been implementing a sound algorithm for determining when threads should "park" themselves in a sleep mutex or wait for I/O. This requires careful coordination to ensure we don't create a single contention point when trying to schedule and run tasks, but also are responsive to resume when new work arrives (either internally, from another thread, or externally, from I/O streams). This is done by setting a flag in the task to notify it after work is added to the queue. If the running task sees that the thread was previously sleeping, it then additionally notifies its condition variable to wake it up.

Foreign Libraries
An important motivation for this work was our desire to better support multi-threaded capable libraries, without considerable CPU over-subscription killing performance due to cache-thrashing and frequent preemptive CPU context switches. Previously, the only options were often for the user to decide up-front to limit JuliaLang to N threads, and tell the threaded library (such as libfftw or libblas) use M ÷ N (floordiv) cores. The most common choices probably being 1 and M, so only part of the application and running time is able to benefit from the presence of multiple cores in the system. However, given our ability to quickly create and run work items in our thread pool, we are looking at how to work with external libraries also and let them also integrate with our existing thread-pool. This is an on-going area of exploration as we get feedback on the performance and API needs of various libraries. We've successfully adapted FFTW to run on top of our threading runtime instead of its own 4 (a pthreads-based workpool). This took us only a few hours (we were fortunate to be able to enlist the assistance of that library's author). Without any performance tweaking

Future work
While this work has been ongoing for several years already, there are still many interesting and important improvements to consider. We'd like to investigate ways to further expand the thread-safe API surface and integrate powerful thread-sanitizer tooling to help users write better code. There's also substantial room for the standard library to start using this threading runtime whenever possible. However, we need to explore ways to safely and conveniently expose this option to users (which often seem contradictory). Additional performance testing is necessary to fine-tune the heuristic numbers. For example, when adding work items to the dynamic scheduler to run on P cores, what is a good ratio factor k to use when creating chunks of work? Should we make 1P items (assume a static schedule)? or 105P (assume a static schedule on either of P or P − 1 processors for P = 4, 6, 8, 16)? Or perhaps simply 3P is sufficient to balance out much variation? And yet other workloads, however, may want one work item for every input value (like Distributed.pmap does)! We specifically highlight one additional item: concurrent garbage collection. Currently, JuliaLang's runtime library needs to wait for all threads to arrive at a safe-point or be in a safe-region (such as foreign code) before GC can start. This can introduce long pauses if one or more threads are far away from hitting such a region. Presently, those only exist where manually inserted into the code, such as while waiting for a lock or doing allocation. In the future, we intend to investigate options for automatic placement of safe-points by the system to minimize GC start latency without unduly impacting allocation-free code. There are certainly more approaches to handle releasing memory then there are language implementations in existence, possibly multiple times over. So suffice to say this is an area with many possible trade-offs! For an example of where we might also go with this, please take a look at the Mono project's documentation on cooperative thread suspension [5] for how a different language, which shares a common code-generation strategy, handles this.

Conclusions and summary
JuliaLang's approach to multi-threading combines many previously known ideas in a novel framework. While each in isolation is useful, we believe that-as is so often the case-the sum is greater than the parts.

Bad puns
We liken the addition of thread-safety as moving from the age of mechanization... The figure 4 is a windmill. The figure 5 is a atom.