Penguin Parallel: Parallel abstractions
PenguinParallel
is a toolkit of abstractions for parallel programming. It
contains a number of small, single-purpose components that can be combined to
build higher-level abstractions.
User-friendly APIs
The user-friendly APIs for expression parallelism represent the two key types of parallelism explored in this library.
Key abstractions include:
- Pipeline Iterators allow users to structure data and compute using
iterators that can be parallelized across multiple cores. Pipeline iterators
leverage 2 kinds of parallelism: (a) homogenous parallelism, where the same
function is applied to data in parallel, and (b) heterogeneous (pipeline)
parallelism, where two different kinds of operations occur simultaneously on
two independent threads. Homogenous parallelism is exposed via
transformations such as
interleave
, andmap
which run the same computation on different data in two or more threads concurrently. Pipeline parallelism is epitomised by theprefetch
transform, which runs the computation “before” the transformation on a background thread, so that all operations that come “after” can occur in parallel. - Parallel operations on array allow users to commpute data-parallel
algorithms easily. Operations such as
reduce
(and friendsmax
,min
,sum
,product
, …),sort
,filter
,group
,map
(and many more) can be easily parallelized across all available cores of a machine.
For more details, please check out the documentation for each type.
Thread pool
From first-principles, a (CPU) compute-bound application will run at peak performance when overheads are minimized. Once enough parallelism is exposed to leverage all cores, one of the key overheads to minmiize is context switching, and thead creation & destruction. The optimal system configuration is thus a fixed-size threadpool where there is exactly one thread per CPU core (or rather, hyperthread). This configuration results in zero context switching, no additional kernel calls for thread creation & deletion, and full utilization of the hardware.
Unfortunately, in practice, it is infeasible to statically schedule work apriori onto a fixed pool of threads. Even when applying the same operation to a homogenous dataset, there will inevitably be variability in execution time. (This can arise from I/O interrupts taking over a core [briefly], or page faults, or even different latencies for memory access across NUMA domains.) As a result, it is important for peak performance to build abstractions that are flexible and dynamic in their work allocation.
The ThreadPool protocol is a foundational API designed to enable efficient use of hardware resources. There are two APIs exposed to support the two kinds of parallelism:
prun is used as “fire-and-forget” style parallelism where closures are dynamically scheduled to execute on the (fixed-size) threadpool.
pJoin is used for Cilk-style fork-join parallelism using work-stealing to efficiently leverage available resources on a machine. The key abstraction is a single function
pJoin
which takes 2 closures and runs then (optionally) in parallel.
Note: while there should be only one “physical” threadpool process-wide, there can be many virtual
threadpools that compose on top of this one to allow configuration and tuning. (This is why
ThreadPool
is a protocol and not static methods.) Examples of additional threadpool abstractions
could include a separate threadpool per-NUMA domain, to support different priorities for tasks, or
higher-level parallelism primitives such as “wait-groups”.