In 2019, Multicore software should be easier to write than ever. Modern programming languages such as Scala and Rust are maturing, programming frames are getting easier to use and C # and good old C ++ are embracing parallelism as part of their standard libraries.
However, in practice, it’s still a messy process. The whole thing turns out to be difficult to synchronize and once the software works, it mostly only runs a little or no faster at all on a multicore processor. And to make matters worse, it tends to evidence all kinds of elusive errors.
Parallel programming is just a very tough subject, where you run into all sorts of subtle, unexpected effects if you don’t understand what’s happening at all levels, tells Klaas van Gend, software architect at Sioux. ‘I’ve heard people talking about sharing nodes on a supercomputer using virtual machines. But they ruin each other's processor cache; they just get in one other's way.’
At university it was all about Dijkstra, which means mutexes, locks and condition variables. But the moment you turn on a lock, you only ensure that the code is executed on one core whilst the others temporarily do nothing. So, you really only learn how not to program for multicore.
According to Van Gend, the problem is that many developers failed to receive a pedagogically sound basis during their computer science training. ‘At university it was all about Dijkstra, which means mutexes, locks and condition variables. But the moment you turn on a lock, you only ensure that the code is executed on one core whilst the others temporarily do nothing. So, you really only learn how not to program for multicore,’ he says.
That is why Van Gend has taken the multicore training course given by his old employer Vector Fabrics, out of the mothballs. Until a few years ago, Vector Fabrics focused on tooling to provide insight into the perils of parallel software. Together with cto Jos van Eijndhoven and other employees, Van Gend provided training courses on the subject. The company went bankrupt in 2016, but Van Gend, in his current employment, has realised that the problem is still relevant. After having once again given the training course at his present employment, he now also offers it under the High Tech Institute flag, for third parties.
Klaas van Gend is the lecturer of the 3-day training 'Multicore programming in C++'.
One of the important matters when writing parallel software, is finding out how to make it work clearly across /on multiple levels, explains Van Gend. He always makes this point with a simple example: Conway’s Game of Life, the cellular automaton where boxes in a grid become black or white with each new round, depending on the status of their immediate neighbours. ‘At the bottom level of your program you have to check what your neighbouring cells are. You can do that with two for-runs/loops. And then you have a loop for a complete row, and above that for the complete set of rows.‘
‘Most programmers will begin to parallelize at those bottom loops. That is very natural, because that is a piece of code that you can still understand, that still fits in your mind. But it makes much more sense to sit/begin at a higher level and take that outer loop. Then you divide the field into multiple blocks of rows and your workload per core is much larger.’
If you look at matters in that way, it soon becomes clear that there are many things to watch out for. There are also programs where the load is variable. ‘For example, we have an exercise to calculate the first hundred prime numbers. There is already more than a factor of one hundred between prime number ten and prime number ninety-nine. Then you have to calculate load balancing.’
There are also differences in what you can parallelize: the data or the task. ‘Data parallelism is generally suitable for very specific applications, but otherwise you soon find a kind of decomposition of your task. This can be done with an actor model or with a Kahn process network, but data-parallelism can again be part of it. In practice you will see that you always end up with mixed forms.’ It has not just been about algorithms for some time now; the underlying hardware plays a key role. For example, if the programmer doesn’t take the caching mechanisms of the processor into account, the problem of false sharing may arise. ‘I have seen huge applications brought to their knees,' says Van Gend. ‘Suppose you have two threads that are both collecting metrics. If you divide those messily, counters from different threads can end up in the same cache line. The two processors then need to work simultaneously with the same cache and your cache mechanism constantly drags the lines back and forth. That lowers performance greatly.’ For that reason, Van Gend is also skeptical about the use of high-level languages in multicore designs; they have the tendency to abstract the details about the memory layout. 'With a language like C ++ it is still very clear that you are working on basic primitives and you can see that clearly. But high-ranking languages often hastily skim over the details of the data types, which means that the system can never really run smoothly.’
If you only partially understand the model, then you will run into problems. It works well for certain specific situations, but it can’t be used everywhere.
In any case, Van Gend thinks that new languages are no wonder cure for the multicore problem. As a rule, they assume a specific approach that doesn’t have to fit well /necessarily fit well with the application at all. ‘Languages such as Scala or Rust rely heavily on the actor model to make threading easier. If you only partially understand the model, then you will run into problems. It works well for certain specific situations, but it can’t be used everywhere.’
The modern versions of C ++ also offer additions to enable parallel programming. ‘Atomics are now fully involved, for example. With this you can often exchange data without stopping anything. We are also working on a library within which the locking is no longer visible to users at all. If it is necessary, it happens without the user seeing it and also with the shortest possible scope, so the lock is released as soon as possible,’ says Van Gend. Here, it is also important to understand what you are doing. Van Gend, for example, is a lot less enthusiastic about the execution policies’ addition to the standard library in C ++ 17. This allows a series of basic algorithms such as find, count, sort and transform to run in parallel by simply adding an extra parameter in the function call. ‘But that only works for some academic examples; in practice, it will not work,' Van Gend says. ‘These api's are based on a wrong basic assumption. And in the C # api they have made the same mistake again.’
The problem is that with this approach you can only make separate steps. ‘It stimulates the individual paralleling of each operation. You re-share your dataset with each operation, do something, then make it whole again and go on to the next operation. It is always parallel, sequential, parallel, sequential, and so on. That is conceptually very clear, but you have to wait all that time until all the threads are ready and then continue. It is a complete waste of time. On the other hand, with a library such as Openmp the entire set of operations is simply distributed over the threads. This means therefore that you don’t have to wait unnecessarily.’
The funny thing is that Microsoft also played a large part in the Par Lab at the University of Berkeley. This has resulted in a fairly large collection of design patterns for parallel programming, which I deal with extensively in the training course.
The gcc compiler doesn’t provide any support for these parallel functions. Visual Studio does, because the additions eventually come from Microsoft. ‘The funny thing is that Microsoft also played a large part in the Par Lab at the University of Berkeley. This has resulted in a fairly large collection of design patterns for parallel programming, which I deal with extensively in the training course. Microsoft has shown that they understand exactly how to do it properly.’