New and Improved CLR 4 Thread Pool Engine

Mon, November 24, 2008, 12:26 AM under ParallelComputing
A way to think of the CLR thread pool and how it gets used, e.g. when you call ThreadPool.QueueUserWorkItem, is to picture a global queue where work items (essentially delegates) get queued on a global queue and multiple threads pick them out in a First In First Out order. The FIFO order is not something that is documented or guaranteed, but my personal guess is that too many applications rely on it, so I don’t see it changing any time soon.

The image on the left shows the main program thread as it is creating a work item; the second image shows the global thread pool queue after code has queued 3 work items; the third image shows 2 threads from the thread pool that have grabbed 2 work items and executing them. If in the context of those work items (i.e. from the executing code in the delegates) more work items get created for the CLR thread pool they end up on the global queue (see image on the right) and life goes on.

CLR Thread Pool v4 from a System.Threading.Tasks perspective
In CLR 4, the thread pool engine has had some improvements made to it (it has been having positive tweaks in every release of the CLR) and part of these improvements are some performance gains that are achievable when using the new System.Threading.Tasks.Task type. I'll show a code example in another post, but you can think of creating and starting a Task (passing it a delegate) as the equivalent of calling QueueUserWorkItem on the ThreadPool. A way to visualize the CLR thread pool when used via the Task-based API is that, in addition to the single global queue, each thread in the thread pool has its own local queue:

Just as with normal thread pool usage, the main program thread may create Tasks that will get queued on the global queue (e.g. Task1 and Task2) and threads will grab those Tasks typically in a FIFO manner. Where things diverge is that any new Tasks (e.g. Task3) created in the context of the executing Task (e.g. Task2) end up on a local queue for that thread pool thread.

Why Local Queues
With the era of manycore machines upon us and devs taking advantage of parallelism, the number of threads in the thread pool is likely to increase: at a minimum equal to the number of cores for compute bound operations, and likely more due to injection of additional threads as a result of IO bound operations or blocking calls stalling the CPU. Bottom line: more cores = more threads.

With more threads competing for work items, it is not optimal to put up with the contention issues of a single queue that all of them are trying to access safely. This would be amplified by the goal of fine grained parallelism where each work item finishes fairly quickly, so the trips to the global queue would be frequent.

It is for this reason we introduce a local queue per thread, where additional tasks get queued (with no contention) and will then be retrieved by that same thread (again with no contention).

LIFO
So, picking up from the picture further up, let's assume that Task2 additionally creates two more Tasks e.g. Task4 and Task5.

The tasks end up on the local queue as expected, but which Task does the thread pick to execute when it completes its current task (i.e. Task2)? The initially surprising answer is that it could be Task5, which is the last one that was queued – in other words a LIFO algorithm can be used for the local queues.

The reason that this is a good thing is locality. In most scenarios the data required by the last created Task in the queue is still hot in the cache, so it makes sense to pull that down and execute it. Obviously, this means there are no promises on ordering, but some level of fairness is relinquished in the sake of better performance.

Work Stealing
If the only enhancements were the introduction of LIFO local queues then performance is greatly increased, but you should be concerned. You should be concerned about what happens when another thread in the thread pool (likely executing on another core) finishes its work. Luckily, you don't have to be:

The other worker thread completes Task1 and then goes to its local queue and finds it empty; it then goes to the global queue and finds it empty. We don't want it sitting there idle so a beautiful thing happens: work stealing. The thread goes to a local queue of another thread and "steals" a Task and executes it! That way we keep all our cores busy and this contributes to our fine grained parallelism load balancing goal. In the image above notice that "stealing" happens in a FIFO manner, which again for locality reasons is good (its data would be cold in the cache). Furthermore, in many divide and conquer scenarios, tasks generated earlier on are likely to generate more work (e.g. Task6) themselves, which would end up now on this other thread's queue and hence reduce frequent stealing.
Nitpicking note: further up, I mentioned "no contention"; clearly work stealing is the exception to that rule.

What's Next
In addition to the links sprinkled above, you can find a simple implementation of a work stealing threadpool on Joe's blog post and some of the concepts above are touched on in this MSDN mag article by Eric and Erika. If you want to see the pictures above in a fully animated slide, get my PDC deck (slide 8) from this page. Do note that the implementation I am talking about above did not ship with the September CTP of Visual Studio 2010, but is slated for the next drop. Also note that all internal implementation details are subject to change and just shared here for the geek factor ;-)

So you can see how using the Task-based API will yield cool perf gains under the covers. An additional reason to use the new Task-based API is because of its richness. I'll touch on that in a next blog post.

Debugging Parallel Applications with VS2010

Wed, November 19, 2008, 03:48 PM under ParallelComputing
I have mentioned previously on this blog the two new debugger toolwindows our team is adding to Visual Studio 2010: Parallel Tasks and Parallel Stacks. I will be blogging a lot more about these, so for now I encourage you to get a high level understanding by watching my 10-minute screencast. All feedback welcome.

Taking Parallelism Mainstream

Tue, November 18, 2008, 12:58 PM under Links
At the PDC2008 attendees received USB hard disks with tons of software and information. One of the items was a technical whitepaper that my org published and you can open that XPS document directly here. For more, I have a permalink to our Parallel Computing Dev Center on the left pane of my blog.

How Many Cores does Windows Support?

Mon, November 17, 2008, 12:38 PM under Windows
At external and internal parallel programming sessions I have on one of my slides a screenshot of the Task Manager window showing 64 cores (!). There is always someone in the audience that didn’t know Windows can support that many. To see the screenshot for yourself, visit MarkRuss's blog post.

Even if you knew that, maybe you didn’t know that in the next version we are supporting 4 times as many:
"In Windows Server 2008 R2, we have built in support for up to 256 logical processors, which will allow our customers to more fully exploit today’s powerful CPUs,"
To get more info on that and the other Windows 7 enhancements for the manycore era (e.g. UMS), watch this interview with Mark Russinovich

TIP: Task Manager Shortcuts

Sun, November 16, 2008, 01:10 AM under AboutPresenting
In my demos more often than not I bring up task manager (e.g. so I can show the CPU utilization). If you do the same, here are 4 tips.

1. Ctrl+Shift+ESC brings it up (no need to right click on taskbar and select the Task Manager menuitem)

2. On the Performance (or Networking) tab, double click anywhere to maximize that area (as per the screenshot on left). One of the advantages is that now the window can be resized to be much smaller than what it can be otherwise; plus you can focus more clearly without the other clutter. Double click again to restore or click and drag to move the window around.

3. If you want to point out something interesting in the "CPU Usage" and "CPU Usage History", just hold down the Ctrl key to PAUSE the dynamic updating of the graphs. Release the Ctrl key to resume the refreshing.

4. Change the refresh speed to HIGH

Parallel Programming Session from Tech Ed EMEA 2008

Sat, November 15, 2008, 04:02 PM under Events

Last week I repeated in Barcelona for Tech Ed EMEA 2008 the PDC08 session that I gave in LA. The European recording appears to be also available online on the home page for your viewing pleasure (the direct link is here).

Fishbowl Video Interview on Parallelism at Tech Ed Online

Sat, November 15, 2008, 07:00 AM under Links

Mark Dunn asked me some questions on parallel computing during a 15 minute interview at the fishbowl in Barcelona. Find it on this web page (or watch it directly in your media player).

Podcast with NxtGenUG folks from PDC08

Sat, November 15, 2008, 05:37 AM under Links
You may (or may not) wish to listen in on a 5 minute random chat (after a few beers at the Ask The Experts) I had with Chris Hay and John McLoughlin (NxtGenUG boys) on their PDC podcast (38:45 – 43:25).

Do NOT Explicitly Use Threads for Parallel Programming

Mon, November 10, 2008, 11:20 PM under ParallelComputing
I presume you are sold on the goal of achieving fine grained parallelism – if not please read that post.

In that post I talked quite a bit about the need to partition our work into many small chunks. So how could we partition a long running work item that looks like this:
  static void Main()
{
Tree tr = Tree.CreateSomeTree(9, 1);
Stopwatch sw = Stopwatch.StartNew();
WalkTree(tr);
Console.WriteLine("Elapsed= " + sw.ElapsedMilliseconds.ToString());
Console.ReadLine();
}
static int ProcessItem(int treeData)
{
// for demo purposes
Thread.SpinWait(4000000); //TODO something real with data
return treeData; // not being used
}
static void WalkTree(Tree tree)
{
if (tree == null) return;
WalkTree(tree.Left);
WalkTree(tree.Righ);
ProcessItem(tree.Data);
}
…assuming that Tree is defined like this: Tree.cs.

The immediate thought might be to use a thread, for example:
  static void WalkTree(Tree tree)
{
if (tree == null) return;
Thread left = new Thread((o) => WalkTree(tree.Left));
left.Start();
Thread righ = new Thread((o) => WalkTree(tree.Righ));
righ.Start();
left.Join();
righ.Join();
ProcessItem(tree.Data);
}
So now the code runs faster (e.g. on a quad core machine) than the sequential approach – the result is good. But the way we achieved this was not good, for at least 3 reasons:

1. By oversubscribing the CPU like this (creating 1023 threads on a quad core machine) we are paying a hefty price for context switching. Sure, the overall performance increased, but it could have increased even more if we used the correct number of threads. The correct number of threads is, of course, equal to the number of cores on the box. For example on a quad core, 4 threads is the ideal number of threads to execute compute bound work.

2. By making the OS schedule all those threads for our process, our process is getting an unfair advantage over other apps on the same box. That is being a bad citizen on the machine.

3. The most important reason IMO that the approach above is a bad idea is memory consumption. Since CLR 2.0, every managed thread uses 1MB of committed memory. Depending on your machine configuration and environment setup, the code above won't even run and will instead throw an OutOfMemoryException exception. That is a clue that this approach simply does not scale. We need an approach where we increase our data load (e.g. the depth of the tree) and have our algorithm scale. This again goes back to point 1, which is that ideally the number of threads used by our process is equal to the number of cores on the machine.

So even though partitioning the problem into many small chunks was a good thing, using threads was not. The fundamental problem here is that the unit of partitioning in our user mode app is also the unit of scheduling of the OS.

So, we need some kind of user mode "engine" to schedule only as many threads as the number of cores on the machine and we need this to take place automatically for us. We also need to be able to partition the overall compute-bound operation into many work items that will get executed by the "engine". If this sounds familiar, it should: ThreadPool. The CLR thread pool has the correct internal logic and a simple interface (QueueUserWorkItem) for scheduling work items to be executed by the threads it manages. This appears to be the answer to achieving fine grained parallelism.

The problem with the current ThreadPool API is that it has almost no API. You simply throw items to it in a "fire and forget" manner. You get back no handle to the work item. No way of cancelling it, waiting on it, composing a group of items in a structured way, handling exceptions thrown concurrently or any other richer construct built on top of it. So while you should be using the ThreadPool today for parallelism, there is a considerable improvement coming in .NET 4 – I'll cover that in another blog post. In the meantime, I leave it to you dear reader to convert the tree walking sample above to directly use the ThreadPool instead of threads (hint, you'll need a ManualResetEvent or equivalent to simulate the Thread.Join semantics).

As an aside, there are many people (e.g. Jeffrey Richter) that advise to never explicitly use threads for anything at all (not just in the context of parallelism, but under no circumstances whatsoever). Their advice is to always use the ThreadPool wherever you considered using a thread! Can you argue with that?

Parallel Programming and the Virtual PC limitation

Fri, November 7, 2008, 11:24 PM under ParallelComputing
The question of "How do I try/show the parallel bits in VS2010 outside of the VPC?" or variants of that keeps appearing in my inbox. That is because of Virtual PC's limitation to only run one core, which kind of sucks for parallel programming.

Here is the story:
1. The VS2010/.NET 4.0 CTP is only available as a VPC. Sorry, no directly installable bits that I can share. You can still use the VPC to get a feel for the APIs and tools, though.
2. Parallel Extensions are now part of mscorlib.dll with dependencies on the CLR 4.0, so no you cannot copy bits from the VPC and use it with VS2008.
3. You can however, run the virtual image on HyperV and then you have access to all the cores on your box. Instructions for that are here (thanks Grant for incorporating my comment into the blog post).
4. If that is not an option for you either, then your last resort is to use the old Parallel Extensions June CTP that works with VS2008. Of course, via that route, you cannot play with the new debugger bits.

In any case, have fun!

Fine Grained Parallelism

Thu, November 6, 2008, 10:32 PM under ParallelComputing
When trying to take advantage of multi-core machines, the first step is to identify the areas in your code that can literally run at the same time. Unless you are really lucky, what this step really means is you will have to partition your compute bound work into smaller chunks that can run in parallel.

For example, if you are walking a binary tree data structure (typically in a recursive manner) and there is some compute intensive operation carried out on each node, maybe you can partition the work in such a way so the left side of the tree gets processed at the same time as the right side of the tree gets processed. You have partitioned your work into 2. So, on a dual core machine, you should see a performance gain. What performance gain would you further see on a quad core or 24-way machine? None beyond what you saw on the dual core.

Our goal with parallel programming is to write once and have our code scale well as the hardware underneath it gets better, i.e. see incremental benefits when running our app on machines with more cores without changing the code. For this to be achieved, we need to partition our work into multiple chunks in such a way so that ideally: # chunks >= # cores of the highest end machine we expect our app to run on.

The impact of the message in the previous paragraph is that, if we revisit the tree example, maybe we should process each node at the same time. So, if the tree has 500 nodes, our code would scale well up to a 500 core machine. An additional reason this would be a good thing is that if, as is typical, our problem size increases over time (i.e. the number of nodes in our example) our algorithm would also scale to take advantage of even more cores.

There is yet another reason that we need fine grained partitioning. Let's say that we partition our work into 8 work items and we know that the code will always execute on an 8-way machine, so we feel good about our approach. What if those 8 work items do not take the same time to complete? Let's say that 1 of them completes fairly quickly and the remainder take longer. Now we have a core sitting idle instead of helping with the overall work. Had we partitioned the work into much smaller chunks, when a core is finished with the item it executes, it would be able to start working on another chunk hence achieving the desired load balancing.

Hopefully you agree that fine grained parallelism is one of the ways forward in the manycore era. Next I'll show an example using code that shows how we partition the work and then schedule it on the cores.

Threading/Concurrency vs Parallelism

Mon, November 3, 2008, 02:24 AM under ParallelComputing
To take advantage of multiple cores from our software, ultimately threads have to be used. Because of this fact, some developers fall in the trap of equating multithreading to parallelism. That is not accurate.

You can have multithreading on a single core machine, but you can only have parallelism on a multi core machine (or multi proc, but I treat them the same). The quick test: If on a single core machine you are using threads and it makes perfect sense for your scenario, then you are not "doing parallelism", you are just doing multithreading. If that same code runs on a multi core machine, any overall speedups that you may observe are accidental – you did not "think parallelism".

The mainstream devs that I know claim they are comfortable with multithreading and when you drill into "what scenarios they are enabling when using threads" there are 2 patterns that emerge. The first has to do with keeping the UI responsive (and the UI thread affinity issue): spin a thread to carry out some work and then marshal the results back to the UI (and in advanced scenarios, communicate progress events and also offer cancellation). The second has to do with I/O of one form or another: call a web service on one thread and while waiting for the results, do some other work on another; or carry out some file operation asynchronously and continue doing work on the initiating thread. When the results from the (disc/network/etc) IO operation are available, some synchronization takes place to merge the data from the executor to the requestor. The .NET framework has very good support for all of the above, for example with things like the BackgroundWorker and the APM pattern (BeginXYZ/EndXYZ) or even the event-based asynchronous pattern. Ultimately, it all comes down to using the ThreadPool directly or indirectly.

The previous paragraph summarized the opportunities (and touched on the challenges) of leveraging concurrency on a single core machine (like most of us do today). The end user goal is to improve the perceived performance or to perhaps improve the overall performance by hiding latency. All of the above is applicable on multi-core machines too, but it is not parallelism. On a multi-core machine there is an additional opportunity to improve the actual performance of your compute bound operations, by bringing parallel programming into the picture. (More on this in another post).

Another way of putting it is that on a single core you can use threads and you can have concurrency, but to achieve parallelism on a multi-core box you have to identify in your code the exploitable concurrency: the portions of your code that can truly run at the same time.