Do NOT Explicitly Use Threads for Parallel Programming

Mon, November 10, 2008, 11:20 PM under ParallelComputing
I presume you are sold on the goal of achieving fine grained parallelism – if not please read that post.

In that post I talked quite a bit about the need to partition our work into many small chunks. So how could we partition a long running work item that looks like this:
  static void Main()
{
Tree tr = Tree.CreateSomeTree(9, 1);
Stopwatch sw = Stopwatch.StartNew();
WalkTree(tr);
Console.WriteLine("Elapsed= " + sw.ElapsedMilliseconds.ToString());
Console.ReadLine();
}
static int ProcessItem(int treeData)
{
// for demo purposes
Thread.SpinWait(4000000); //TODO something real with data
return treeData; // not being used
}
static void WalkTree(Tree tree)
{
if (tree == null) return;
WalkTree(tree.Left);
WalkTree(tree.Righ);
ProcessItem(tree.Data);
}
…assuming that Tree is defined like this: Tree.cs.

The immediate thought might be to use a thread, for example:
  static void WalkTree(Tree tree)
{
if (tree == null) return;
Thread left = new Thread((o) => WalkTree(tree.Left));
left.Start();
Thread righ = new Thread((o) => WalkTree(tree.Righ));
righ.Start();
left.Join();
righ.Join();
ProcessItem(tree.Data);
}
So now the code runs faster (e.g. on a quad core machine) than the sequential approach – the result is good. But the way we achieved this was not good, for at least 3 reasons:

1. By oversubscribing the CPU like this (creating 1023 threads on a quad core machine) we are paying a hefty price for context switching. Sure, the overall performance increased, but it could have increased even more if we used the correct number of threads. The correct number of threads is, of course, equal to the number of cores on the box. For example on a quad core, 4 threads is the ideal number of threads to execute compute bound work.

2. By making the OS schedule all those threads for our process, our process is getting an unfair advantage over other apps on the same box. That is being a bad citizen on the machine.

3. The most important reason IMO that the approach above is a bad idea is memory consumption. Since CLR 2.0, every managed thread uses 1MB of committed memory. Depending on your machine configuration and environment setup, the code above won't even run and will instead throw an OutOfMemoryException exception. That is a clue that this approach simply does not scale. We need an approach where we increase our data load (e.g. the depth of the tree) and have our algorithm scale. This again goes back to point 1, which is that ideally the number of threads used by our process is equal to the number of cores on the machine.

So even though partitioning the problem into many small chunks was a good thing, using threads was not. The fundamental problem here is that the unit of partitioning in our user mode app is also the unit of scheduling of the OS.

So, we need some kind of user mode "engine" to schedule only as many threads as the number of cores on the machine and we need this to take place automatically for us. We also need to be able to partition the overall compute-bound operation into many work items that will get executed by the "engine". If this sounds familiar, it should: ThreadPool. The CLR thread pool has the correct internal logic and a simple interface (QueueUserWorkItem) for scheduling work items to be executed by the threads it manages. This appears to be the answer to achieving fine grained parallelism.

The problem with the current ThreadPool API is that it has almost no API. You simply throw items to it in a "fire and forget" manner. You get back no handle to the work item. No way of cancelling it, waiting on it, composing a group of items in a structured way, handling exceptions thrown concurrently or any other richer construct built on top of it. So while you should be using the ThreadPool today for parallelism, there is a considerable improvement coming in .NET 4 – I'll cover that in another blog post. In the meantime, I leave it to you dear reader to convert the tree walking sample above to directly use the ThreadPool instead of threads (hint, you'll need a ManualResetEvent or equivalent to simulate the Thread.Join semantics).

As an aside, there are many people (e.g. Jeffrey Richter) that advise to never explicitly use threads for anything at all (not just in the context of parallelism, but under no circumstances whatsoever). Their advice is to always use the ThreadPool wherever you considered using a thread! Can you argue with that?
Tuesday, 11 November 2008 02:21:00 (Pacific Standard Time, UTC-08:00)
Hi Daniel,

Your last statement really challenged me to write one of my very few replys to blog posts. As you know, I really appreciate your work and (this part you might not know yet) absolutely enjoyed your presentation at PDC. I love the task framework and generally agree with the statement that threads are usually a bad idea when it comes to "workers".

However ... saying that threads should *never* be used, on the other hand, is in my opinion a over-generalization, as it really depends on the kind of application. I work mainly on the server side and here it *is* very common - and necessary - to use explict threads for background/cleanup processing and even for finer grained units of work, which might be dispatched to NUBER_OF_CORES threads using in-memory queues. (Hey ... we still run CLR 2 outside of Microsoft ... so we've had to manually build fine-grained-parallelism on top of Threads ;-)). The ThreadPool doesn't seem to be the right place to put this kind of long running (queue-dispatching) tasks, right?

Cheers,
-Ingo
Ingo Rammer
Tuesday, 11 November 2008 02:57:00 (Pacific Standard Time, UTC-08:00)
I was wondering when we'd see the first "threads considered bad" posting and here it is :-) Kind of reminds me of Don Box's famous "learn to let go" statement to C++ developers when pitching GC during the early days of .NET. Really enjoying your parallel blog posts Daniel, keep 'em coming.
Tuesday, 11 November 2008 04:51:00 (Pacific Standard Time, UTC-08:00)
Ok I'll give it a try:

void WalkTree(Tree tree)
{
if (tree == null) return;
ManualResetEvent left = new ManualResetEvent(false);
ThreadPool.QueueUserWorkItem((o) => { WalkTree( tree.Left); left.Set(); });
ManualResetEvent right = ManualResetEvent(false);
ThreadPool.QueueUserWorkItem((o) => { WalkTree( tree.Right); right.Set(); });
left.WaitOne();
right.WaitOne();
ProcessItem(tree.Data);
}

But I dont like the creation of ManualResetEvent for each leaf that is to be walked.

// Ryan
Ryan Heath
Tuesday, 11 November 2008 04:56:00 (Pacific Standard Time, UTC-08:00)
When the CLR itself is using the ThreadPool (ie ASP.NET), I think it is better to not use the ThreadPool when you have long running tasks. You will risk not being able to serve your users requests because the pool is busy completing your long running tasks.

// Ryan
Ryan Heath
Tuesday, 11 November 2008 08:37:00 (Pacific Standard Time, UTC-08:00)
I'm on Jeffrey Richter's side myself. Perhaps it makes sense to create a couple of threads at startup for tasks that will always run, but even then that code is probably better written using some sort of timer callback. The thing that gets me is that so many windows apps use a ton of threads. IE has at least 4 per tab, outlook usually has 30 or 40.

@ryan:
Your issue about CPU bound tasks starving out serving client requests is a very interesting one and you make an excellent point. One option (which makes things trickier, but could work quite well) is to break the operation into pieces and thus spread it across several work items. This will allow new requests to be scheduled on the thread pool. Of course you have to use some type of continuation passing mechanism that muddies the code. Even though it's not usually recommended on servers, ParallelFX might make that a lot easier actually.
brad
Tuesday, 11 November 2008 09:36:00 (Pacific Standard Time, UTC-08:00)
If you use C/C++ or Fortran, OpenMP is the way to go... all you do is to use #pragmas to tell the compiler which loops can be parallelized, and it will do so. The level of parallelization can be controlled with an environment variable. That takes a lot of manual work away from the programmer, and totally follows your advise of not explicitly using threads for parallel programming.
Tuesday, 11 November 2008 14:41:54 (Pacific Standard Time, UTC-08:00)
Andreas: Yes, there are many programming models for achieving parallelism and I do not believe there is one to rule them all, at this stage. Right now, I am fascinated with task and data parallelism and will touch on others (e.g. message passing) in the distant future. The only thing I do believe in is that people should stay away from using threads explicitly for parallelism and should instead choose a higher level programming model and the vision is that all of them would eventually use the same resource manager (on Windows), so they would not compete with each other for the hardware threads. BTW, if you like OpenMP and you live in the native land, you should check out the Parallel Pattern Library (PPL) that ships with the next version of C+ in Visual Studio 2010.

Tom: Thanks, yeah, we are trying to raise the level of abstraction to other things (Tasks with a new improved thread pool engine under the covers) and that is what I am trying to lead to with this post. Stay tuned.

Ryan, Brad, Ingo: Note that the last comment is not one I personally endorsed and it was meant to evoke emotional responses of which I received quite a few in my inbox ;-) I will make no comment about it here, but will come back to that generalization in a future blog post.

Ingo: Also see my comment above. Back to the title of my blog post which I tried to elaborate throughout the blog post: The point is that for fine grained parallelism (I make it very clear at the top that buying into that is a prerequisite for the post), threads should not be used. If you are using threads but are somehow throttling the number you use (e.g. NUBER_OF_CORES) to achieve fine grained parallelism then effectively you are writing your own mini thread pool, right? Sure, writing our own user mode scheduler is one of the few places where threads are the only thing you can use – no disagreement there.

All (inc. the people that sent me email): If fine grained parallelism is not what you are doing (instead you have long running threads or are a server side ASP.NET app or performing I/O operations instead of compute bound work etc) then this post is probably not relevant to you. Do not ignore the "compute bound" references in the blog post and do not ignore the "parallel" in the title. That term is used on this blog conforming to my own definition of parallelism.
Tuesday, 11 November 2008 14:47:00 (Pacific Standard Time, UTC-08:00)
if your threads do mostly computation then you may be right - however if the threads do a lot of I/O than using just 4 threads (in your example) is probably not the optimal number

Also the problem with the thread pool is when you scheduled a lot of threads (more than the max threads in the threadpool)- since when you do that you get sequential behavior and many times that not the desired/optimal approach
Tuesday, 11 November 2008 15:03:52 (Pacific Standard Time, UTC-08:00)
Arnon: Please read my replies to the other comments above. We are talking here about compute bound work. We are not talking about I/O. In hindsight, maybe I should have spelled that out even clearer so thank you for pointing it out.
Even for I/O, the thread pool (and any decent user mode scheduler) will inject additional threads as needed (one more reason to use one). Your comment on max number of threads is interesting. I would say that if you wish under normal circumstances to create more than 250 threads per CPU you have more serious issues in your app but even if you do, the maximum is configurable. Thanks for your comment.
Tuesday, 11 November 2008 23:00:00 (Pacific Standard Time, UTC-08:00)
I have just finished looking at (one of?) your session(s) at pdc2008.
Now I noticed righ was actually not a typo :)

The talk was very informative and entertaining, which is a good thing. I liked it!

// Ryan
Ryan Heath
Thursday, 13 November 2008 13:59:00 (Pacific Standard Time, UTC-08:00)
So, did Ryan get it right? Curious minds want to know :-)
Dan F
Friday, 14 November 2008 07:50:46 (Pacific Standard Time, UTC-08:00)
Ryan: Thanks - now you have seen where these blog posts are going :)

Dan: Yes Ryan's code would work. There are many ways you can achieve the goal, e.g.:
public static void WalkTree(Tree tree)
{
if (tree == null) return;
ManualResetEvent mre = new ManualResetEvent(false);

int count = 2;
ThreadPool.QueueUserWorkItem((o) =>
{
WalkTree(tree.Left);
if (Interlocked.Decrement(ref count) == 0) mre.Set();
});
ThreadPool.QueueUserWorkItem((o) =>
{
WalkTree(tree.Righ);
if (Interlocked.Decrement(ref count) == 0) mre.Set();
});
mre.WaitOne();
ProcessItem(tree.Data);
}
Sunday, 16 November 2008 14:06:00 (Pacific Standard Time, UTC-08:00)
Hi Daniel,

I can see at least three scenarios for using 'raw' threads vs those from the pool (some already mentioned above).

1. long running tasks
2. need for STA
3. need for priority or identity

More interesting thing that pushed me to write a reply, is your

"...if you wish under normal circumstances to create more than 250 threads per CPU you have more serious issues in your app..."

When I faced that issue some time ago, ThreadPool still had those 25 threads and I seemed to explain that overall perf starts to decline if you create more than 25 threads, because of context switching.

Now I cannot find any _real_ prove for this, and I wonder if you have anything to share: why it was 25?

Otherwise, perhaps I'd better off writing some code and test... :)

Thanks,
Andrew
Monday, 22 December 2008 04:18:00 (Pacific Standard Time, UTC-08:00)
Andrew: You can change the priority of a threadpool thread in your callback method and reset it to what it was on exit – in general, though that is a good point. Even better point is the STA example. Long running Tasks is an option that we will offer so you won't need to directly use a thread for that.
Comments are closed.