Parallel Programming

Let us now turn our focus to parallel programming. For many large problems, we could divide them into chunks and evaluate the solution for these chunks at the same time on multiple cores, before combining the results, just like a divide-and-conquer approach. However, doing so is traditionally seen as difficult, and we usually use the same libraries and language primitives that are used for concurrency to develop a parallel program. Writing parallel programs in general-purpose imperative languages can be complex and tedious.

While we could certainly use Haskell’s concurrency features like forkIO, MVar and Chan to develop parallel code, there is a much simpler approach available to us. All we need to do is to annotate some sub-expressions in our functions to make them evaluated in parallel.

Non-Strict Evaluation

In the very beginning of this course, we described Haskell as a non-strict evaluation language. That is, Haskell decides the evaluation strategy for us, unlike other strict evaluation languages where things are evaluated in a deterministic and specific format. For example, in Python, a function call is evaluated by first fully evaluating its arguments, then executing each statement in the function from top down. Haskell generally only evaluates terms by need, giving rise to a notion of lazy evaluation.

The key idea of attaining parallelism in Haskell is by specifying parallel evaluation strategies.

Strict Evaluation

Before we bgin describing how to evaluate terms in parallel, we must first describe how we can even force the evaluation of a term in the first place. For example, in the following program:

ghci> x = [1..]
ghci> y = sum x

virtually nothing is evaluated, and GHCI does not enter an infinite loop. This is because there is as yet no demand for the evaluation of y. Of course, if we attempt to evaluate y, we do arrive at an infinite loop, because evaluating the actual sum of x is required to determine what y is.

Therefore, whenever an expression is encountered, Haskell allocates a thunk as a uncomputed placeholder for the result of the expression evaluation. The thunk is only evaluated by need (usually as little as possible) to evaluate other parts of code.

For example:

ghci> x = [1..]
ghci> case x of { [] -> 0; (x:xs) -> x }
1

Notice that the case expression demands the evaluation of x. However, it does not demand the complete evaluation of x. Instead, it only demands to know the constructor of x. Therefore, when executing x = [1..], Haskell puts a completely unevaluated thunk, for x, and the case expression then evaluates x to head normal form (HNF) (evaluating to the constructor but not its arguments)¹ to perform the case analysis.

Another example of lazy evaluation is with let expressions:

ghci> let x = [1..]; y = sum x in 1 + 2
3

Again, Haskell does not evaluate y at all since it is not demanded in the evaluation of 1 + 2!

This may be a problem for concurrency and parallelism, because it is possible for forkIO to push an I/O action to a different thread, only for that thread to allocate an unevaluated thunk for it, and when its evaluation is demanded, the evaluation is done on the main thread!

expensive :: MVar String -> IO ()
expensive var = do
    putMVar var expensivelyComputedString

main :: IO ()
main = do
    var <- newEmptyMVar
    forkIO $ expensive var
    whatever
    result <- takeMVar var
    print result

The program above gives the impression that the expensive computation is done on the forked thread. However, in reality, what could happen is that the thread running expensive only allocates a thunk for expensivelyComputedString, and returns. Then, when the result is demanded in the main I/O action running in the main thread, it is the main thread that computes the expensively computed string, thereby, achieving nothing from the concurrency.

It is for this reason that Haskell exposes primitives for deciding the evaluation of expressions. The one most used is seq, which introduces an artificial demand for an expression to be evaluated to head normal form:

ghci> :t seq
seq :: a -> b -> b

The expression x `seq` y evaluates to y, but creates an artificial demand for the evaluation of x as well. Therefore, evaluating the following expression does not terminate:

ghci> let x = [1..]; y = sum x in y `seq` 1 + 2

However, notice that the following does terminate:

ghci> let x = [1..] in x `seq` 1 + 2
3

This is because seq only creates an artificial demand for x to be evaluated to head normal form, i.e. up to the evaluation of its constructor.

What we can do instead is to introduce a new evaluation strategy for forcing the full evaluation of a list:

ghci> :{
ghci| deepSeq :: [a] -> b -> b
ghci| deepSeq [] x = x
ghci| deepSeq (x:xs) y = x `seq` deepSeq xs y
ghci| :}
ghci> x = [1..]
ghci> x `seq` 1
1
ghci> x `deepSeq` 1

Using deepSeq now forces the full evaluation of x, which obviously does not terminate because x is infinitely large! However, note that deepSeq only evaluates the elements to HNF—therefore, if x were a, for example, a two-dimensional list, the individual one-dimensional lists in x are only evaluated to HNF, i.e. only their constructors are evaluated.

Parallel Evaluation

Since parallel programming is all about deciding what expressions to evaluate in parallel, all we need is some primitives that tell the compiler to evaluate an expression in parallel, just like seq! The GHC.Conc module exposes two evaluation primitives, par and pseq that allows us to do parallel programming easily:

ghci> import GHC.Conc
ghci> :t par
par :: a -> b -> b
ghci> :t pseq
pseq :: a -> b -> b

par is straightforward to understand: x `par` y is an expression stating that there is an artificial demand for x that could be evaluated to HNF in parallel. However, par does not guarantee the parallel evaluation of x. This is because x could be a cheap computation that does not need to be, and should not be, evaluated in parallel, or that there are not enough cores available for the parallel evaluation of x.

Then, what is pseq for? Notice this: in an expression x `par` f x y, we claim to want to evaluate x in parallel to HNF, and then combine it with y using f in the current thread. However, this requires a guarantee that y is evaluated on the current thread before the current thread attempts to evaluate x. Otherwise, it could be that par will queue a spark for the evaluation of x, and before a new thread can be sparked for that evaluation, the current thread evaluates f x y, which performs the evaluation of x first; therefore, no parallel evaluation of x happens, defeating of par in the first place.

Therefore, we need some primitive that performs the evaluation of an expression to HNF before another expression. seq does not do this; x `seq` y only claims to evaluate x to HNF, but does not enforce that to happen before y. In contrast, pseq does. x `pseq` y guarantees that the evaluation of x to HNF happens before the evaluation of y.

As such, par and pseq allow us to annotate computations with evaluation strategies to describe what computation happens in parallel, and what that computation is in parallel with. For example, the expression x `par` (y `pseq` f x y) states roughly that x happens in parallel with y, then the results are combined using f.

For example, let us try writing a parallel (but still exponential) fibonacci:

fib :: Int -> Integer
fib 0 = 0
fib 1 = 1
fib n = n1 `par` (n2 `pseq` (n1 + n2))
  where n1 = fib (n - 1)
        n2 = fib (n - 2)

Aside from the usual base cases, the recursive case computes the \(n-1\) and \(n - 2\) fibonacci numbers in parallel, then combines them together with addition. Described in words, the recursive case computes fib (n - 1) in parallel with fib (n - 2) by queueing a spark for fib (n - 1) and evaluating fib (n - 2) in the current thread, then adds the results together with plain addition.

Computing fib 45 shows that for large values, having more cores makes a big difference.

> time cabal run playground -- +RTS -N20
Number of cores: 20
1134903170

________________________________________________________
Executed in    3.29 secs    fish           external
   usr time   53.14 secs  319.00 micros   53.14 secs
   sys time    0.47 secs  129.00 micros    0.47 secs

> time cabal run playground -- +RTS -N1
Number of cores: 1
1134903170

________________________________________________________
Executed in   12.93 secs    fish           external
   usr time   12.61 secs  418.00 micros   12.61 secs
   sys time    0.08 secs  171.00 micros    0.08 secs

When Should We Parallelize?

However, notice the usr time for the case of running our program on 20 cores. Clearly, the CPU does more than 4x more work than the single core case; it just so happens that leveraging more cores makes the speed-ups outweigh the additional overhead. Indeed, while par is cheap, it is not free. Although Haskell threads are lightweight, threads in general will always incur some additional overhead, and at some point, the benefits of computing something in parallel are outweighed by the overhead of spawning a new thread for its computation. For example, in the case of computing fib 3, it is frankly completely unnecessary to compute fib 2 and fib 1 in parallel, since both are such small computations that run incredibly quickly.

Let us amend our implementation to only use parallelism for larger values. Smaller values are computed sequentially:

fib :: Int -> Integer
fib 0 = 0
fib 1 = 1
-- sequential for small n
fib n | n <= 10 = fib (n - 1) + fib (n - 2)
-- parallel for large n
fib n = n1 `par` (n2 `pseq` (n1 + n2))
  where n1 = fib (n - 1)
        n2 = fib (n - 2)

The execution time shows a significant speed-up on both the single core and multicore runtimes!

> time cabal run playground -- +RTS -N20
Number of cores: 20
1134903170

________________________________________________________
Executed in  892.37 millis    fish           external
   usr time   13.01 secs    646.00 micros   13.00 secs
   sys time    0.18 secs      0.00 micros    0.18 secs

> time cabal run playground -- +RTS -N1
Number of cores: 1
1134903170

________________________________________________________
Executed in    6.81 secs    fish           external
   usr time    6.71 secs  453.00 micros    6.71 secs
   sys time    0.03 secs    0.00 micros    0.03 secs

Generally speaking, knowing when to parallelize is a matter of experimentation, trial-and-error and engineering experience. It highly depends on the computation you are trying to parallelize, the kind of computation you are doing, the usual inputs to the computation, and so on.

Parallel Strategies

Let us try writing a parallel mergesort:

mergesort :: Ord a => [a] -> [a]
mergesort [] = []
mergesort [x] = [x]
mergesort ls
  | n < 100 = merge left' right'
  | otherwise = par left' $ pseq right' $ merge left' right'
    where n = length ls `div` 2
          merge [] ys = ys
          merge xs [] = xs
          merge (x:xs) (y:ys)
            | x <= y = x : merge xs (y : ys)
            | otherwise = y : merge (x:xs) ys
          (left, right) = splitAt n ls
          left' = mergesort left
          right' = mergesort right

Our mergesort function does a typical merge sort, except from the fact that we are using an immutable list. Let us write a supporting main function to test our program:

main :: IO ()
main = do
  n <- getNumCapabilities
  putStrLn $ "Number of cores: " ++ show n
  let ls :: [Int] = [10000000, 9999999..1]
      ls' = mergesort ls
  print $ length ls'

> time cabal run playground -- +RTS -N20
Number of cores: 20
10000000

________________________________________________________
Executed in    3.58 secs    fish           external
   usr time   16.02 secs  381.00 micros   16.02 secs
   sys time    1.39 secs  159.00 micros    1.39 secs

> time cabal run playground -- +RTS -N1
Number of cores: 1
10000000

________________________________________________________
Executed in    6.11 secs    fish           external
   usr time    5.62 secs    0.00 micros    5.62 secs
   sys time    0.43 secs  586.00 micros    0.43 secs

From before, recall that because Haskell is a lazy language, it may be the case that not all the supposedly parallel computation happens in the other thread. Since both par and pseq evaluate their first arguments only to HNF, it really only does evaluation up until it determines the constructor of the list after sorting, leaving the remainder of the list unevaluated. Then, in main, when we obtain the length of the list, the main thread may then have to evaluate the remainder of the list in the same thread. Let us extract some more performance out of our parallel evaluation by actually evaluating everything deeply in the parallel computation using deepSeq from before:

mergesort :: Ord a => [a] -> [a]
mergesort [] = []
mergesort [x] = [x]
mergesort ls
  | n < 100 = merge left' right'
  | otherwise = par (deepSeq left') $ pseq right' $ merge left' right'
    where n = length ls `div` 2
          merge [] ys = ys
          merge xs [] = xs
          merge (x:xs) (y:ys)
            | x <= y = x : merge xs (y : ys)
            | otherwise = y : merge (x:xs) ys
          (left, right) = splitAt n ls
          left' = mergesort left
          right' = mergesort right

deepSeq :: [a] -> ()
deepSeq [] = ()
deepSeq (x:xs) = x `seq` deepSeq xs

Now we should notice some more performance gains!

> time cabal run playground -- +RTS -N20
Number of cores: 20
10000000

________________________________________________________
Executed in    2.89 secs    fish           external
   usr time   18.04 secs  365.00 micros   18.04 secs
   sys time    0.68 secs  145.00 micros    0.67 secs

> time cabal run playground -- +RTS -N1
Number of cores: 1
10000000

________________________________________________________
Executed in    6.18 secs    fish           external
   usr time    5.59 secs  362.00 micros    5.59 secs
   sys time    0.46 secs  145.00 micros    0.46 secs

Some very smart people have also come up with nice and elegant ways to write parallel code. For example, using the parallel library, we can express parallel programs with Strategy's in the Eval monad:

mergesort :: (Ord a, NFData a) => [a] -> [a]
mergesort [] = []
mergesort [x] = [x]
mergesort ls
  | n < 100 = merge left' right'
  | otherwise = runEval $ do
      l <- rparWith rdeepseq left'
      r <- rseq right'
      return $ merge l r
  where n = length ls `div` 2
        (left, right) = splitAt n ls
        left' = mergesort left
        right' = mergesort right
        merge [] ys = ys
        merge xs [] = xs
        merge (x:xs) (y:ys)
          | x <= y = x : merge xs (y : ys)
          | otherwise = y : merge (x:xs) ys

Strategies also allow us to separate algorithm from evaluation. For example, we can write a parallel fibonacci like so:

fib :: Int -> Integer
fib 0 = 0
fib 1 = 1
fib n | n <= 10 = fib (n - 1) + fib (n - 2)
fib n = runEval $ do 
    n1 <- rpar (fib (n - 1))
    n2 <- rseq (fib (n - 2))
    return $ n1 + n2

Alternatively, we can make clear the distinction between the underlying algorithm and the evaluation strategy with using:

fib :: Int -> Integer
fib 0 = 0
fib 1 = 1
fib n | n <= 10   = n1 + n2
      | otherwise = (n1 + n2) `using` strat
  where n1 = fib (n - 1)
        n2 = fib (n - 2)
        strat v = do { rpar n1; rseq n2; return v }

We will leave it up to you to learn more about parallel Haskell with the parallel library. For more information, you may read the paper by Marlow et al.; 2010 that describes it. We shall not cover these because they, along with par and pseq, are much more Haskell-specific and less applicable to code written in general-purpose languages. The only goal of this chapter, which we hope has been achieved, is to show how easy it is to introduce parallelism to regular sequential programs in a purely functional programming language.

Usually expressions are evaluated to weak head normal form (WHNF), although the distinction is not crucial for our understanding.

References

Simon Marlow, Patrick Maier, Hans-Wolfgang Loidl, Mustafa K. Aswad, and Phil Trinder. 2010. seq No More: Better Strategies for Parallel Haskell. In Proceedings of the third ACM Haskell symposium on Haskell (Haskell '10). Association for Computing Machinery, New York, NY, USA, 91–102. https://doi.org/10.1145/1863523.1863535.

Typed Functional Programming