Do NOT Explicitly Use Threads for Parallel Programming

Mon, November 10, 2008, 11:20 PM under ParallelComputing
I presume you are sold on the goal of achieving fine grained parallelism – if not please read that post.

In that post I talked quite a bit about the need to partition our work into many small chunks. So how could we partition a long running work item that looks like this:
  static void Main()
{
Tree tr = Tree.CreateSomeTree(9, 1);
Stopwatch sw = Stopwatch.StartNew();
WalkTree(tr);
Console.WriteLine("Elapsed= " + sw.ElapsedMilliseconds.ToString());
Console.ReadLine();
}
static int ProcessItem(int treeData)
{
// for demo purposes
Thread.SpinWait(4000000); //TODO something real with data
return treeData; // not being used
}
static void WalkTree(Tree tree)
{
if (tree == null) return;
WalkTree(tree.Left);
WalkTree(tree.Righ);
ProcessItem(tree.Data);
}
…assuming that Tree is defined like this: Tree.cs.

The immediate thought might be to use a thread, for example:
  static void WalkTree(Tree tree)
{
if (tree == null) return;
Thread left = new Thread((o) => WalkTree(tree.Left));
left.Start();
Thread righ = new Thread((o) => WalkTree(tree.Righ));
righ.Start();
left.Join();
righ.Join();
ProcessItem(tree.Data);
}
So now the code runs faster (e.g. on a quad core machine) than the sequential approach – the result is good. But the way we achieved this was not good, for at least 3 reasons:

1. By oversubscribing the CPU like this (creating 1023 threads on a quad core machine) we are paying a hefty price for context switching. Sure, the overall performance increased, but it could have increased even more if we used the correct number of threads. The correct number of threads is, of course, equal to the number of cores on the box. For example on a quad core, 4 threads is the ideal number of threads to execute compute bound work.

2. By making the OS schedule all those threads for our process, our process is getting an unfair advantage over other apps on the same box. That is being a bad citizen on the machine.

3. The most important reason IMO that the approach above is a bad idea is memory consumption. Since CLR 2.0, every managed thread uses 1MB of committed memory. Depending on your machine configuration and environment setup, the code above won't even run and will instead throw an OutOfMemoryException exception. That is a clue that this approach simply does not scale. We need an approach where we increase our data load (e.g. the depth of the tree) and have our algorithm scale. This again goes back to point 1, which is that ideally the number of threads used by our process is equal to the number of cores on the machine.

So even though partitioning the problem into many small chunks was a good thing, using threads was not. The fundamental problem here is that the unit of partitioning in our user mode app is also the unit of scheduling of the OS.

So, we need some kind of user mode "engine" to schedule only as many threads as the number of cores on the machine and we need this to take place automatically for us. We also need to be able to partition the overall compute-bound operation into many work items that will get executed by the "engine". If this sounds familiar, it should: ThreadPool. The CLR thread pool has the correct internal logic and a simple interface (QueueUserWorkItem) for scheduling work items to be executed by the threads it manages. This appears to be the answer to achieving fine grained parallelism.

The problem with the current ThreadPool API is that it has almost no API. You simply throw items to it in a "fire and forget" manner. You get back no handle to the work item. No way of cancelling it, waiting on it, composing a group of items in a structured way, handling exceptions thrown concurrently or any other richer construct built on top of it. So while you should be using the ThreadPool today for parallelism, there is a considerable improvement coming in .NET 4 – I'll cover that in another blog post. In the meantime, I leave it to you dear reader to convert the tree walking sample above to directly use the ThreadPool instead of threads (hint, you'll need a ManualResetEvent or equivalent to simulate the Thread.Join semantics).

As an aside, there are many people (e.g. Jeffrey Richter) that advise to never explicitly use threads for anything at all (not just in the context of parallelism, but under no circumstances whatsoever). Their advice is to always use the ThreadPool wherever you considered using a thread! Can you argue with that?