PLINQ

Sun, January 25, 2009, 06:59 PM under ParallelComputing | LINQ
With VS2008 (more specifically .NET Framework 3.5) a wonderful thing was introduced: LINQ. Given the declarative nature of Language Integrated Query it was a prime candidate for trying to inject automatic parallelization in it (i.e. to run faster by seamlessly taking advantage of multiple cores). The result of those efforts is what I mentioned 18 months ago (Parallel LINQ) and followed with a screencast 4 months later: see the 2nd link in the list here. In this post I'll do a written overview based on the latest bits. Before we continue, you should understand that PLINQ applies only to LINQ to Objects (i.e. IEnumerable-based sources where lambdas are bound to delegates, not IQueryable-based sources where the lambdas are bound to expressions). It also does not interfere with the deferred execution principles of LINQ, of course.

PLINQ as a black box
PLINQ is really simple if you want to treat it as a black box; all you do as a user is add the .AsParallel extension method to the source of you LINQ query and you are done! The following query
var result =
from x in source
where [some condition]
select [something]
...can be parallelized as follows:
var result =
from x in source.AsParallel()
where [some condition]
select [something]
Notice that the only difference is the AsParallel method call appended to the source and we can of course use this pattern with more complex queries.

Why Does It Work
To understand why the above compiles we have to remind ourselves of how LINQ works and that the first version of the code above is really equivalent to:
var result =  source.Where(x => [some condition]).Select(x => [something]);
...so when we parallelize it we are simply changing it to be the following:
var result =  source.AsParallel().Where(x => [some condition]).Select(x => [something]);
In other words the call to AsParallel returns something that also has the typical extension methods of LINQ (e.g. Where, Select and the other 100+ methods). However, with LINQ these methods live in the static System.Linq.Enumerable class whereas with PLINQ they live in the System.Linq.ParallelEnumerable class. How did we transition from one to the other? Well, AsParallel is itself an extension method on IEnumerable types and all it does is a "smart" cast of the source (the IEnumerable) to a new type which means the extension methods of this new type are picked up (instead of the ones directly on IEnumerable). In other words, by inserting the AsParallel method call, we are swapping out one implementation (Enumerable) for another (ParallelEnumerable). And that is why the code compiles fine when we insert the AsParallel method. For a more precise understanding, in the VS editor simply right click on AsParallel, choose Go To Definition and follow your nose from there…

How Does It Work
OK, so we can see why the above compiles when we change the original sequential query with our parallelised query, which we now understand is based on the introduction of new .NET 4 types such as ParallelQuery and ParallelEnumerable – all in System.Core.dll in the System.Linq namespace. But how does the new implementation take advantage (by default when it is worth it) of all the cores on your machine? Remember our friendly task-based programming model? The implementation of the methods of the static ParallelEnumerable class uses Tasks ;-). Given that the implementation is subject to change and more importantly given that we have not shipped .NET 4 yet, I will not go into exactly how it uses the Tasks, but I leave that to your imagination (or to your decompiler-assisted exploration ;)).

Simple Demo Example
Imagine a .NET 4 Console project with a single file and 3 methods, 2 of which are:
  static void Main()
{
Stopwatch sw = Stopwatch.StartNew();
DoIt();
Console.WriteLine("Elapsed = " + sw.ElapsedMilliseconds.ToString());
Console.ReadLine();
}
static bool IsPrime(int p)
{
int upperBound = (int)Math.Sqrt(p);
for (int i = 2; i <= upperBound; i++)
{
if (p % i == 0) return false;
}
return true;
}
…without worrying too much about the implementation details of IsPrime (I stole this method from the walkthrough you get in the VS2010 CTP). So the only question is where is the 3rd method, which clearly must be named Doit. Here you go:
  static void DoIt()
{
IEnumerable arr = Enumerable.Range(2, 4000000);
var q =
from n in arr
where IsPrime(n)
select n.ToString();
List list = q.ToList();
Console.WriteLine(list.Count.ToString());
}
Now if you run this you will notice that on your multi-core machine only 1 core gets used (e.g. 25% CPU utilization on my quad core). You'll also notice in the console the number of milliseconds it took to execute. How can you make this execute much faster (~4 times faster on my machine) by utilizing 100% of your total CPU power? Simply change one line of code in the Doit method:
from n in arr.AsParallel()
How cool is that?

Can It Do More
What the PLINQ implementation does is it partitions your source container into multiple chunks in order to operate on them in parallel. You can configure things such as the degree of parallelism, control ordering, specify buffering options, whether to run parts of the query sequentially etc. To experiment with all that, just explore the other new extension methods (e.g. AsOrdered, AsUnordered) and, finally, the new enumerations (e.g. ParallelQueryMergeOptions). I leave that experimentation to you dear reader ;)