Multithreading and performance
Knuth said, "Premature optimization is the root of all evil".
I've been working on a little virus scanner project in my spare time. I'd been designing it so that it could be spread across multiprocessors for a performance boost.
A few months ago, I read a blog post estimating that for every core you add to a system, you could expect a 'measly' 30% extra application performance. But extrapolated out to a 32 core system, this is a huge increase, which led to all sorts of "wow" and "ooh" and "this is the second coming of computing". Nevermind that CPU performance is just one factor in your total application performance, and often not even a significant one.
At the time, I was benchmarking and optimising a network intrusion prevention system. With one processor, we got 1x performance. With two processors, 1.3x. So far on track. At four processors, things got slower - only 1.2x.
What's going on here?
This looks a lot like the concept of 'economies of scale', where as you add workers to a factory, you gain production capacity. If you add too many, eventually the workers 'trip over each other' and start reducing the total output of the factory.
The equivalent concept in computing is resource contention. If you have, say, four processors, you'll need at least four threads to fully utilise them. At this point, adding workers is profitable. As you continue adding threads, you might see some slight improvements - maybe there's some I/O contention along the way, and one thread can do its I/O while the others think.
In a perfect world, where there's no overhead from multithreading, you could theoretically add threads forever and your performance would never drop. If there is a spare resource, a thread can step in and make use of it. What kills you in the end is contention and the overheads involved in managing multiple access to a single resource, be it CPU, disk, memory or network time. There's almost always a single resource that constrains your performance.
So why does something like distributed.net work, where there are millions of children and relatively few servers? Why isn't there any overhead there from what could be considered millions of workers in a huge factory?
In distributed.net, the workloads are almost completely independent. The only interaction between workers is when they ping the server to get more work. The workers are carefully engineered to always have work to do, even if they can't ping the server. But if you dive deep into the server code, eventually, there's going to have to be a lock that ensures mutual exclusivity between children. If two children hit the server at the same time and request a work block, what stops them from from both getting the same block? The mutex.
It's conceivable that if you added enough clients to the distributed.net network, the time spent waiting for the mutex would become significant, and clients would stall waiting for more work. An easier way to visualise the multithreading overhead in this situation is to realise that time spent serving clients is time not spent searching RSA keys. I have no doubt that the keyservers run distributed.net themselves - but the overhead from serving clients is most definitely not zero. This is exactly the same as in the single-machine-multiple-cores situation.
So how about a real-world example? I've been working on a little virus scanner project in my spare time. It's designed to run well across multicores - there are child threads that do most of the work, and there's a dispatcher thread that assigns work to the children through a workqueue. Each child thread will take an item off the workqueue (the name of a file that needs scanning) and scan it. The dispatcher thread will receive a signal when the work queue level goes below a low-water mark and then refill the queue to a high-water mark. This way, even if the dispatcher thread isn't scheduled regularly (which we'd prefer, because it's dead time) the children shouldn't run out of work to do.
So let's think about this. Two children, one parent, one shared work queue. The work queue obviously needs some sort of mutual exclusion mechanism. A lock would be obvious, but when the dispatcher is refilling the queue, neither child thread can be scanning. I expect the dispatcher to be largely I/O bound - its job is merely to traverse the filesystem - while the children should be CPU-bound (scanning) with a little I/O (reading the file from disk).
What about a lock-free queue? Lock-free doesn't mean wait-free. Often, the performance cost is pushed to the enqueue or dequeue functions - in the multiple-writers situation, the enqueue function looks something like:
- save the value of the last empty slot to A
- swap our new item with the item in the last empty slot and place its previous value in B
- if A != B, swap them back, because someone else wrote at the same time we did and we just clobbered it. Start again.
Note the 'start again' step. You could theoretically wait forever for the enqueue to succeed. In the multiple-writers situation, overhead is unavoidable - you just shuffle it around. It's like Heisenberg's Uncertainty Principle - you can fiddle the numbers as much as you like, but somewhere along the line you have to pay your dues.
So I ran some benchmarks on my AV scanner. I scanned the Windows/System32 directory on my laptop; I'm going to call it a typical workload. You can make a benchmark say anything you like if you adjust the test scenario properly. I'm not going to get into that debate here, because at the end of the day performance isn't my goal. This is just an interesting little side track.
With one thread, the scan took 401 second. Note that both cores were still enabled on the machine. The second core can still do useful work - OS overhead, maybe help out with I/O. This is not strictly a single-core-vs-dual-core test.
With two threads, the scan took 245 seconds. Not a bad improvement, but not double. Whatever 'helping hand' the second core offered in the first test, it's not there now.
On a side note (from the side track), the laptop is not really designed to run both cores flat out. The fans run at full power and the CPU temperature still rises to about 70 degrees Celcius.
The two threads aren't just contending for that work queue lock any more - they're also contending for hardware resources other than CPU time. Obviously, disk utilisation is way up. Given the comparatively slow access times of hard drives, it's easy to imagine a situation where both threads are trying to read at the same time and the disk head is flicking back and forth between the two files, starving both simultaneously. What's a solution here? The first thing that comes to mind is to move file loading to the dispatcher thread - single disk, single thread. You can buffer the files in advance and ensure that at worst, one thread will always be working.
The two CPUs aren't maxed out. Does this mean that disk utilisation is starving them? The disk isn't maxed out either, but the disk idle time could be explained by the scanner threads working hard on the CPU, which is what we want. So, theoretically, adding more threads should increase overall resource utilisation. Right? A third thread can 'fill in the gaps' and use disk when it's idle, use CPU when it's idle.
A third thread completes its scan in 236 seconds - roughly the same as two threads. The CPUs and disk still aren't maxed out. So what's holding us back? I don't know, but I think we need more threads to fill in those CPU and disk gaps!
With 16 threads, things are significantly slower. 371 seconds for the same work volume. What's going on here? Looking at the console output (which can have hidden locks as well - only one thread can write at a time), most of the threads are sitting idle. The same threads are repeatedly scheduled. At a guess, I'd say there's contention for that work queue lock - but this is merely a theory. The easiest way to see where your bottleneck is is to use a profiler. oprofile on Linux is excellent.
Without a profiler - or if your profiler is giving you confusing results - you must create a scientific experiment to be sure. To test if lock contention is a problem, increase the time taken to obtain the lock and measure the new performance level. Compare this against baseline. If performance drops off, you can say with more certainty that time with the lock held is reducing your performance. If there's no significant change, you should come up with a new theory.
Similarly, if you suspect that disk throughput or CPU performance is a bottleneck, you can reduce those resources to see if things get slower. It seems more intuitive to add more of a resource and see if things get faster - but in reality that's difficult. I can't just add a faster disk or a faster CPU to my laptop to see if things improve. If I halve my processor speed and only see a 10% reduction in application performance, I can be sure that CPU performance is not the bottleneck, and that getting a faster CPU probably won't make much difference to anything.
Using my scanner with a very small workload - about 50 megabytes - a single thread is actually faster than two threads. This is partly due to the initialisation time of the threads (which is several seconds) and partly due to the skewed workload (one thread finished all of the small files quickly, while the other thread took a long time to scan a single large file). You can lie with graphs, and you can lie with benchmarks. I have replicable proof that one core is faster than two!