Parallelising Loops in .NET 4

Wed, January 7, 2009, 05:58 AM under ParallelComputing
Often the source of performance issues in our code is loops, e.g. while, for, foreach. With .NET 4 it becomes easy to make such code perform better by taking advantage of multiple cores.

Parallel.ForEach
For example, given:
IEnumerable<string> arr = ...
foreach (string item in arr){
// Do something
}
, we can parallelise it is as follows:
Parallel.ForEach<string>(arr, delegate (string item){
// Do something
});
, or the tidier directly equivalent version (dropping the superfluous generic which can be inferred and turning the anonymous method syntax into a lambda statement)
Parallel.ForEach (arr, (string item) =>{
// Do something
});

Visual Distinctions
Notice the obvious visual similarities that make it almost automatic to parallelise a loop: the only difference in the parallel version is the modification in the first line (rearranging the "arr" and "string item", which are the real pieces of information) and the fact that after the closing brace at the end there is a closing parenthesis and semicolon. The crucial visual observation here is that the body of the loop remains intact.

Why Does It Work
Let's drill into why the modified code compiles and why it is equivalent in intent (even if it is obvious to some). We turned a block of code into a library method call. The .NET 4 (mscorlib) library offers the static class Parallel that (among others) offer the ForEach method. One of its overloads (its simplest) accepts 2 parameters: a source IEnumerable of TSource and a body of code (in the form of the Action of TSource delegate, of course) that accepts a single parameter which is also of TSource, of course. The method will take the body and call it once for each element in the source. If you reread the last 2 sentences you'll find that is exactly what the original loop construct does as well. The real difference here is that the original runs serially (using only a single core) while the modified runs in parallel (using, by default, all cores).

How Does It Work
Those of you that don't like black magic boxes will ask: what does that method actually do inside in order to run things in parallel? My answer: what do you think it needs to do? Remember our friendly task-based programming model? The implementation of the methods of the static Parallel class uses Tasks (and specifically SelfReplicating tasks). Given that the implementation is subject to change and more importantly given that we have not shipped .NET 4 yet, I will not go into exactly how it uses the Tasks, but I leave that to your imagination (or to your decompiler-assisted exploration ;)).

Trivial Demo Example
In a .NET 4 Console project paste the following in the Program.cs file:
  static string[] arr = Directory.GetFiles(@"C:\Users\Public\Pictures\Sample Pictures", "*.jpg");
static void SimulateProcessing() {
Thread.SpinWait(100000000);
}
static string TID {
get {
return " TID = " + Thread.CurrentThread. ManagedThreadId.ToString();
}
}
Now in the empty Main function paste the following:
    foreach (string ip in arr) {
Program.SimulateProcessing();
Console.WriteLine(ip + TID);
}
Console.ReadLine();
Run it and notice how long it takes as well as that only one thread gets used of course and in Task Manager notice the CPU usage. Now change the loop construct so they are as follows:
Parallel.ForEach(arr, (string ip) => {
Program.SimulateProcessing();
Console.WriteLine(ip + TID);
});
Re-run it and notice how much faster it runs and how the number of threads equals the number of cores on your machine and in Task Manager the CPU usage being at 100%.

Why Not Do It Automatically
Many that see this technology ask "Why not automatically change all loops to run parallelised?". The answer is that you cannot blindly apply a Parallel.ForEach wherever you have a foreach loop. If the body of the loop depends on some shared state, or if each loop iteration is not independent of every other iteration, then race conditions may arise by blindly parallelising. Ultimately, it is multiple threads that execute the body in parallel so there is no room for shared state etc. The static methods of the Parallel class have no magic to deal with that – it is still down to you. If you find yourself needing synchronization in the loop body, be sure to measure the performance because locks and such in a parallelisation scenario potentially negate (or severely limit) the benefits of parallelisation. It is for these reasons that parallelising a loop is an opt-in decision today that only you can make for your code.

A related question arises of why not embedding this functionality in the language (the obvious suggestion being introducing a pfor loop construct). The answer is that having it as a library offering instead of built-in to the language allows it to be used by *all* .NET languages instead of restricting it to a few. Also, once something is embedded into the language it typically stays there forever so we take great care about such decisions e.g. it is too early to tie C# or VB to the System.Threading.Tasks namespace.

For the distant imaginary future, we are thinking about automatically parallelising bits of code if we can (with hints from the application developer with regards to purity and side-effect-free regions of code) and also embedding parallel constructs deeper into the language. Too early to know if anything will come of that...

Can It Do More
Yes! We only saw above one of the overloads of one of the methods. Parallel.ForEach has ~20 other overloads, some of them taking up to 5 arguments and all of them having a return type too; what I am trying to say is that there is much more flexibility and richness even in this simple API. I encourage you to explore the other overloads and also the other two methods on the Parallel class: For and Invoke.
Thursday, January 08, 2009 5:44:00 AM (Pacific Standard Time, UTC-08:00)
I blogged more than a year ago about this and provided IMHO a better non-trivial example. Not that there's anything wrong with your post...

My post is called "Trivially paralellize your .NET CPU bound code" and can be found here:
http://drazen.dotlic.name/weblog/archive/2007/12/06/962.aspx
Thursday, January 08, 2009 6:35:13 AM (Pacific Standard Time, UTC-08:00)
Drazen: Thanks for pointing us to your post. I also blogged more than a year ago (http://www.danielmoth.com/Blog/2007/11/parallel-extensions.html ;-) about this and actually recorded videos showing it all in action. This post is part of a larger picture I am painting... stay tuned! Thanks again for your comment.
Friday, January 09, 2009 1:25:00 PM (Pacific Standard Time, UTC-08:00)
The way you write the ForEach loop can be simplified a little bit more thanks to type inference:

Parallel.ForEach (arr, (string item) =>{
// Do something
});


can become

Parallel.ForEach (arr, item =>
{
  // Do something
});


In a similar way, I've experimented the Parallel.For loop. It lets you write things like this:

Parallel.For(0, height, y =>
{
  for (int x = 0; x < width; x++)
  {
    ProcessPixel(x, y);
  }
}


And yes I love the fact that
Parallel.For(0, height, y =>
is so similar to
for (int y=0; y<height, i++)

I couldn't wait for .NET 4.0 and started using it with the June CTP of the System.Threading dll!
Sunday, January 11, 2009 10:04:31 AM (Pacific Standard Time, UTC-08:00)
odalet: Yes, type inference is one of the great benefits of lambdas esp. when combined with anonymous types (I touch on that halfway through my "Decomposing LINQ video": http://channel9.msdn.com/posts/DanielMoth/LINQs-relationship-to-the-new-C3-and-VB9-features/). Personally, I prefer being explicit unless there is reason not to. Glad you are enjoying the June CTP, just bear in mind that the API has changed since then and that there will never be a release for .NET 3.5. Have fun!
Friday, February 13, 2009 4:33:48 PM (Pacific Standard Time, UTC-08:00)
This library and the way you can use it is really nice and simplifies the code to run in parallel for a problem that is "embarassingly" parallel where we have a ton of data to crunch and each crunching operation is independant from the other ones.

Is there a way to control how many threads get spawned or is the library always trying to run n threads if you have n cores?

I can imagine a case where I would want to be able to control this behaviour in my code:

If I run on a server, I probably don't want to use all the available CPU cycles to run my stuff (well, maybe I do, maybe I don't) as I want to leave enough resources for the server to be able to process concurrent requests from other users.

Thanks.
Patrice
Friday, February 13, 2009 4:37:55 PM (Pacific Standard Time, UTC-08:00)
Patrice: Yes, there is a way to restrict the number of cores to be used by the Parallel.For. Like I mention, there are around 20 overloads to the basic method here, so there is a lot of tweaking you can perform.
Wednesday, November 25, 2009 1:22:11 AM (Pacific Standard Time, UTC-08:00)
are there experts out there who could parallelize this VB.NET function ?

Public Function RandomArray(ByVal iSamples As Integer, ByVal dMu As Double, ByVal dSigma As Double) As Double()
Dim oRnd As New Random
Dim dRandomVariate As Double
Dim dSample As Double
Dim lstSamples As New Generic.List(Of Double)

For i As Integer = 0 To iSamples - 1
dRandomVariate = Math.Sqrt(-2 * Math.Log(oRnd.NextDouble)) * Math.Sin(2 * Math.PI * oRnd.NextDouble)
dSample = Math.Exp(dMu + dSigma * dRandomVariate)

lstSamples.Add(dSample)
Next

Return lstSamples.ToArray
End Function
Thursday, December 03, 2009 7:31:32 PM (Pacific Standard Time, UTC-08:00)
John: The best place for questions not directly relevant to my blog posts are the forums. Please pick the appropriate one here:
http://social.msdn.microsoft.com/Forums/en-US/category/parallelcomputing
Comments are closed.