The Moth
Developer, Former MVP, now at Microsoft - Best of
2004
,
2005
,
2006
,
2007
,
2008
,
2009
« Threading/Concurrency vs Parallelism
|
Home
|
Parallel Programming and the Virtual PC ... »
Fine Grained Parallelism
Thu, November 6, 2008, 10:32 PM under
ParallelComputing
When trying to take advantage of multi-core machines, the first step is to identify the areas in your code that can literally run at the same time. Unless you are really lucky, what this step really means is you will have to
partition your compute bound work into smaller chunks that can run in parallel
.
For example, if you are walking a binary tree data structure (typically in a recursive manner) and there is some compute intensive operation carried out on each node, maybe you can partition the work in such a way so the left side of the tree gets processed at the same time as the right side of the tree gets processed. You have partitioned your work into 2. So, on a dual core machine, you should see a performance gain. What performance gain would you further see on a quad core or 24-way machine? None beyond what you saw on the dual core.
Our goal with parallel programming is to write once and have our code scale well as the hardware underneath it gets better, i.e. see incremental benefits when running our app on machines with more cores without changing the code. For this to be achieved, we need to partition our work into multiple chunks in such a way so that ideally:
# chunks >= # cores
of the highest end machine we expect our app to run on.
The impact of the message in the previous paragraph is that, if we revisit the tree example, maybe we should process
each node
at the same time. So, if the tree has 500 nodes, our code would scale well up to a 500 core machine. An additional reason this would be a good thing is that if, as is typical, our problem size increases over time (i.e. the number of nodes in our example) our algorithm would also
scale to take advantage of even more cores
.
There is yet another reason that we need fine grained partitioning. Let's say that we partition our work into 8 work items and we know that the code will always execute on an 8-way machine, so we feel good about our approach. What if those 8 work items do not take the same time to complete? Let's say that 1 of them completes fairly quickly and the remainder take longer. Now we have a core sitting idle instead of helping with the overall work. Had we partitioned the work into much smaller chunks, when a core is finished with the item it executes, it would be able to start working on another chunk hence
achieving the desired load balancing
.
Hopefully you agree that fine grained parallelism is one of the ways forward in the manycore era. Next I'll show an example using code that shows how we partition the work and then schedule it on the cores.
Comments [1]
|
Permalink
Tuesday, June 02, 2009 9:43:23 AM (Pacific Daylight Time, UTC-07:00)
I took advantage of this approach using the thread pool for an ETL project. I broke the tasks down by top level entity (I think it was customer account). Before I took the approach, it took an hour to complete 10 thousand records. After I completed the entire database (500k records) in 30 minutes.
Brownie
Comments are closed.
About
My team's page on MSDN
Subscribe
Contact Form
My screencasts on channel9
Tags
AboutPresenting (5)
Blogging (7)
Career (5)
Communication (5)
dotNET (120)
Events (93)
GPGPU (2)
HPC (6)
IE7 RSS (6)
Links (129)
LINQ (23)
MobileAndEmbedded (148)
Orcas (128)
ParallelComputing (63)
Personal (32)
Random (42)
SideShow (12)
Silverlight (17)
SoftwareProcess (3)
UAC (14)
UserInterfaceDesign (5)
Vista (84)
VisualStudio (112)
Whidbey (31)
Windows (90)
WindowsServer2008 (3)
Latest Posts
Parallel Computing Platform Developer Lab
Slides and code for MPI Cluster Debugger
DirectCompute
GPGPU
Code for Parallelism Features Tour
Managed code and the Shell – Do?
Dev Lead Job opening on my team
Best of "The Moth" 2009
Bug Triage
Parallel Computing Features Tour in VS2010
MPI Cluster Debugger launch integration in VS2010
Parallel Debugging
"Parallel Programming Talk" show
Message Passing Interface (MPI)
Windows HPC Server links
Extension Manager in Visual Studio 2010
Core debugger enhancements in VS2010
Dump debugging with Parallel Stacks
Slides for Parallel Debugging windows
MPI Project Template for VS2010
Archives
March, 2010 (2)
February, 2010 (3)
January, 2010 (3)
December, 2009 (1)
November, 2009 (11)
October, 2009 (12)
September, 2009 (1)
August, 2009 (6)
July, 2009 (5)
June, 2009 (3)
May, 2009 (7)
April, 2009 (5)
March, 2009 (3)
February, 2009 (4)
January, 2009 (6)
December, 2008 (3)
November, 2008 (12)
October, 2008 (6)
September, 2008 (9)
August, 2008 (5)
July, 2008 (5)
June, 2008 (8)
May, 2008 (18)
April, 2008 (11)
March, 2008 (13)
February, 2008 (17)
January, 2008 (15)
December, 2007 (20)
November, 2007 (25)
October, 2007 (19)
September, 2007 (11)
August, 2007 (31)
July, 2007 (24)
June, 2007 (19)
May, 2007 (24)
April, 2007 (18)
March, 2007 (35)
February, 2007 (34)
January, 2007 (17)
December, 2006 (18)
November, 2006 (17)
October, 2006 (23)
September, 2006 (22)
August, 2006 (15)
July, 2006 (15)
June, 2006 (13)
May, 2006 (10)
April, 2006 (5)
March, 2006 (1)
February, 2006 (1)
January, 2006 (2)
December, 2005 (5)
November, 2005 (4)
October, 2005 (3)
September, 2005 (8)
August, 2005 (17)
July, 2005 (11)
June, 2005 (7)
May, 2005 (24)
April, 2005 (15)
March, 2005 (15)
February, 2005 (18)
January, 2005 (23)
December, 2004 (24)
November, 2004 (25)
October, 2004 (10)
September, 2004 (23)
August, 2004 (12)
July, 2004 (1)