Thursday, July 02, 2009

 

For those of you interested in my session on parallelism with VS2010, I gave an extended/longer version of it recently and included a bunch of new slides and updated the existing ones to match the recent release of VS2010 Beta1. Get the updated pptx here (checkout the annotated slides 16 and 17 on Parallel Tasks and Parallel Stacks).

Labels:


Wednesday, June 17, 2009

 

Whether you have read or not my blog posts on the new Visual Studio 2010 debugger windows, you now have the option of seeing them in action in these videos I published on channel9: 17-minute screencast on Parallel Stacks and 14-minute screencast on Parallel Tasks.

If you don't like watching videos and you don't like reading about features, maybe you prefer the guided hands on approach. Launch your VS2010 instance and go through my walkthrough published on MSDN.

Labels:


Saturday, June 06, 2009

 

The new Parallel Stacks window has a special feature (that applies to both Threads View and Tasks View) that we call Method View. It is accessible from the toolbar and it acts on the current stack frame (i.e. the only one with a green arrow icon or, in the absence of that, the one with the yellow arrow). It takes that method and pivots the diagram on it, coalescing all occurrences of that method into a single node in the center, clearly showing the callees and the callers. Threads that do not have that method on their stack are filtered out and not shown. Check out the before and after screenshots:

Switching to another stack frame in Method View will result in that method context becoming the "center of attention", and hence potentially some threads will disappear and others will re-appear on the diagram. The view persists cross debugging sessions (just like all the other options) and to get out of it you simply need to click on the Method View toolbar button again and return to Stack View.

Labels:


Wednesday, May 27, 2009

 

This blog post presumes you read my posts on Parallel Tasks and on Parallel Stacks, which describe the usefulness of the new VS2010 debugger windows for debugging multithreaded applications. In today's post I'll expand on the task-specific support in the Parallel Stacks window: we call this the Tasks View of the Parallel Stacks window and you can switch to that view from the combobox in the toolbar.


Threads versus Tasks
For the purposes of this window, there are 3 main differences when you are looking at call stacks of threads versus call stacks of tasks:

1. Some threads in your application will be running tasks and others will not.
2. A thread could be running more than 1 Task (only the task at the top part of the thread's call stack is actually running, you can assert that the other(s) would be in a waiting state).
3. Call stacks of threads are mapped back to the thread via the Thread ID; call stacks of tasks should be mapped back to the task via the Task ID.

These 3 differences map to the following 3 observations of what you see in the Tasks View of the Parallel Stacks window (compared to the separately covered Threads View of the Parallel Stacks window):

a) We don’t show the threads not executing tasks.
b) We split call stacks of threads executing multiple tasks into separate nodes, so you can clearly see each task call stack separately.
c) The tooltip of the header in the node and on each method context (displaying the stack frames) shows task IDs instead of thread IDs.

Here is an example where one thread is not running tasks (the Main thread):


Trimming frames from the bottom
You will notice in the screenshot above another difference not mentioned so far: some stack frames at the root of the call stack are trimmed. This is what I call showing the "real" call stacks of tasks. If you think about it, your code created a task pointing it to a method and it is from somewhere close to that method that you want to start looking at the call stack. All the other (typically non-user code) stack frames at the root of the thread's call stack are not of interest in this Tasks View, so we trim them out (as per screenshot above).

Trimming frames from the top
We also trim stack frames from the top of the call stack. The consistent decision could have been to trim *all* non-user code at the top since that is not code you would typically care too much about when debugging tasks. However, what is interesting in many scenarios is the first non-user method that was called from the top most user frame – so we kept that one frame and trimmed all the ones above it.

Here is an example of trimming stack frames from the top and also of a thread running 2 tasks:


"Go To Thread" and "Go To Task" menu
Before anyone panics with the stack trimming I mentioned above, remember that you can always view entire thread call stacks in this window by switching to the Threads View. In fact, we made it even easier to switch not just view, but also scroll to the exact position you want by offering a menu item (you'll recall I left that out from my context menu description last time) that switches views and scrolls to the same stack frame you are looking in the other view.

Status on header tooltip
I mentioned the method context tooltip showing task IDs instead of thread IDs and you can see that in the screenshot below. I also mentioned the header tooltip showing task IDs instead of thread IDs; for this tooltip we threw in bonus information by showing you the Status of each task


Feedback Please
Start playing with Visual Studio Beta 1 and let me have it!

Labels:


Wednesday, May 20, 2009

 

Assumed knowledge aka Background Reading
It is important to understand some of the existing Call Stack window features. Even if you don't care about this VS2010 feature, you can improve your debugging skills with any version of Visual Studio by reading that blog post.

Parallel Stacks debugger toolwindow
With applications increasingly having more than one thread and with parallelism gaining momentum, we need the ability to view (and navigate) more than 1 call stack from a single view. Previously I discussed the motivation behind this new window here, and I'll use the exact same code for the following screenshots. Here is the new Parallel Stacks window:

In the picture above you can see the call stacks of 3 threads in a single view. The way you read this picture is that you have one thread that went from Main to A to B. Two other threads started from the same external code and then went to A, but one of them continued to B and then to some external code, and the other continued to C and then some AnonymousMethod. This is also the active stack frame and this is the current thread.

Toolbar buttons from left to right
Threads/Tasks combobox: Switches the view between showing call stacks of threads to showing call stacks of Tasks (and vice versa). More on the Tasks view of the Parallel Stacks window in another post.

Show Only Flagged: Shows call stacks only for the threads that are flagged in the Threads window.

Toggle Method View: Switches between normal view and method view. Method view will be the subject of a separate post.

Toggle TopDown/BottomUp Display: By default stacks are drawn bottom up (so the frame at top of stack appears at the... top). This button allows drawing the diagram top down, more like a call graph.

Auto Scroll To Current Stack Frame: Auto-scrolls the diagram so the current stack frame is in view. Useful in large diagrams when changing the current stack frame via other windows or when hitting a new breakpoint.

Toggle Zoom Control: Shows or hides the zoom control. Note that zooming is always available via Ctrl+mousewheel.

Context Menu Items in random order
The last 5 menuitems are borrowed directly from the Call Stack window and introduce no new behaviors: Go To Source Code, Go To Disassembly, Show External Code, Hexadecimal Display, Symbol Load Information and Symbol Settings.

Go To Task (Thread): This performs the same function as the combobox on the toolbar does, but additionally keeps the same stack frame highlighted. It will be discussed in a separate post.

Switch To Frame: Same as menuitem on Call Stack window. Additionally, with Parallel Stacks, multiple frames may correspond to a method context so the menuitem has submenus, each representing a specific stack frame. If one of the stack frames is on the current thread (so by definition you are clicking on a node with a blue highlight around it) then that stack frame has a checkmark in front of it. For example, right clicking on Program.A and expanding the first menuitem results in this screenshot:


Annotations and Elements Listed
Call Stack Segment or Node: Each "box" contains a series of method contexts for one or more threads. If the node has no arrow lines connected to it, then it represents the entire call path for the thread(s), otherwise you need to follow the arrows to piece together the entire call path of a thread.

Blue Highlight: The blue highlight indicates the call path to which the current thread belongs to.

Arrow lines: Arrow lines connect nodes to make up the entire call path for the thread(s).

Tooltip on Node Header: The tooltip on each node shows the ID(s) of the thread(s) relevant to the node.


Method Context: Represents one or more stack frames in the same method.

Tooltip on method context: The tooltip on each method context shows the full details of all the stack frames it represents. If one of the stack frames is also on the current thread, it is shown in bold font.


Yellow Arrow Icon: A yellow arrow in front of a method context indicates that the active stack frame of the current thread is in that method context. As you'd expect, this icon would appear only at the last frame in a node that also has the blue highlight around it.

Cloth Threads Icon: A cloth threads icon in front of a method context indicates that the active stack frame of a non-current thread is in that method context. As you'd expect, this icon would appear only at the last frame in a node and there would be no blue highlight.

Green Arrow Icon: A green arrow in front of a method context indicates that the current stack frame is in that method context. That method context is also in bold font, and as you'd expect is in a node with a blue highlight. If the method context appears in other places/nodes on the diagram, it is bolded there too. E.g., if in the context menu shown above, we clicked on menuitem '5744: Program.A("hi") Line 23' we would be switching thread and current frame on that thread with a single action, resulting in the picture looking like this:


Navigation Aids
Bird's Eye View: When there are scrollbars present, the space between the scrollbars reveals a button that when pressed shows a thumbnail view of the picture which you can use to quickly scroll around it.

Zoom Control: The zoom control lets you zoom in and out, zoom to fit and zoom to 100%.

Panning: Pressing the mouse button on any white space and dragging, will scroll the diagram.

Call to Action
The call to action is identical to the Call to Action of my post on Parallel Tasks. Read it and come back here to comment ;-)

Labels:


Tuesday, May 19, 2009

 

Interested in the System.Threading.Tasks namespace? Read here the official MSDN documentation.

Labels:


Friday, May 15, 2009

 

Assumed knowledge aka Background Reading
.NET 4 introduces a new class that lives in mscorlib that I have blogged about before: System.Threading.Tasks.Task, which as you'll recall is also what Parallel.For depends on. Collectively, all those new classes is what we refer to as Task Parallel Library and the team that owns it has blogged what is new for TPL in Beta1. Another .NET 4 feature I blogged about before also depends on Tasks for its implementation: Parallel LINQ. The same team owns that and there is a blog post of what is new for PLINQ in Beta1.

If you are a C++ developer, then you'll be happy to know that the task concept also exists in C++ version 10 that ships with VS2010. In addition to the dedicated native concurrency blog I just linked to, there is a ch9 video on the topic.

Parallel Tasks debugger toolwindow
To support the new task-based programming model introduced with Visual Studio 2010 for both managed and native developers, we are introducing a new debugger window: Parallel Tasks. You can open it after you hit a breakpoint, via the same location where all debugger windows live.


Sorting, Reordering, Hiding/Showing columns
Columns are re-orderable (by dragging them left/right) and sortable (by clicking on them) as indicated by the little triangle, which also shows the sort direction. You can also hide or show columns by right clicking on any column and then (un)checking the column menuitem of interest


Grouping on a column
Via the context menu shown above you can group on your column of choice. What is cool about grouping is that it allows you to flag an entire group with one click (more on flagging later), it allows you to collapse the group (preserved between debugging steps) and it shows a count of the items in the group (even when it is collapsed).


Description of the 8 Columns
- Flag Column: Each item can be in a flagged or non-flagged state (the default). You can toggle the state of a row by clicking on the corresponding cell. You would do this to make it easier to keep an eye on an item of interest between debugging steps (e.g. breakpoints, F11) or to flag multiple items and then sort on the column to bring them all to the top. Flagging is not persisted between debugging sessions. Another use of flagging is to filter the task call stacks shown in the new Parallel Stacks window, but that is a topic of another blog post.

- Icon Column: This column is blank by default except for one Task which will have a yellow arrow in this cell. The yellow arrow indicates which one is the current task. The current task is one which is executing on the current thread and whose top stack frame is the current frame. Another icon that can be displayed in this column is the "pause" icon to indicate a frozen thread (more on freezing further down).

- Id: Each Task has a unique Id which can be retrieved programmatically by calling the Id property (for C++ tasks, this will be the memory address). This is useful so you can map the task you are seeing in the window with diagnostic output you may send to some stream; it is also useful for cross-referencing tasks between the Parallel Tasks and the Parallel Stacks windows (which I'll describe in another post).

- Status: The 4 potential values here are Running, Scheduled, Waiting and Waiting-Deadlocked. "Running" are the tasks that are executing code at the moment your app breaks in the debugger (they are at top of stack of a running thread); "Scheduled" are the ones that are sitting in some queue but have not been executed by a thread yet; "Waiting" are the ones that have run, but are now blocked on something, e.g. a Monitor, a critical section; "Waiting-Deadlocked" are the ones that are "Waiting" and additionally we have detected a wait chain with other tasks in the list. For the 2 Waiting states, hovering over the cell displays a tooltip which may have more information, like in the following screenshot.


- Location: The Location column displays the method that the task is currently in (i.e. the top most user frame of the task's call stack). Hovering over the cell displays in a tooltip the entire call stack for the task (not the entire call stack for the thread!). Naturally, Scheduled tasks do not have a value in this column.


- Task: The entry point method that was passed to the task when it was constructed. If there was a state argument explicitly passed to the task, the value of that is also shown.

- Thread Assignment: The id of the thread that the task is executing on. Remember, a task can only execute on a single thread, but a thread can be executing multiple tasks. Naturally, Scheduled tasks do not have a value in this column.

- Parent: The Id of the task that is the parent of this task (if there is one). This is not applicable for native scenarios where the concept of parent and child tasks is not a first class citizen in the programming model.

- Task Group: The address of the task_group that scheduled the task. This is not applicable for managed scenarios.

Parent Child View
When the value of any cell in the Parent column is not blank, we know we have parent child relationship(s) in our view. In that case, we can visualize this relationship slightly better by switching to "Parent Child View" via the column context menu (shown in image further above). This switches the first column to contain a treeview that allows collapsing the child nodes/tasks.


Task Context Menu
The ContextMenu described above is the one displayed when right clicking on a column; there is one that is displayed when right clicking on a (task) listviewitem.
- Copy: Copies the cells in view (tab delimited) to the clipboard.
- Select All: Or via Ctrl+A to select all tasks, in order to perform one of the other operations in bulk, e.g. copying.
- Hexadecimal Display: Global debugger setting that switches all debugger windows to hexadecimal display (or Decimal if unchecked).
- Switch To Task: Sets the selected task to be the current task (as discussed above) and hence get the yellow arrow icon. This action is also performed when double clicking on a Task listviewitem.
- Freeze Assigned Thread: The concept of freezing a thread existed before this Visual Studio release via the Threads window. The thread that is frozen does not continue execution when the debugger continues to the next step, until the user Thaws it (via the same menu). From the Parallel Tasks window, freezing affects the underlying thread and all tasks on it. I.e. this is still a thread concept rather than a task concept.
- Freeze All Threads But This: Freezes all threads in view, except the selected one.
- Flag: Same behaviour as clicking on the flag column, described further above.


Call to Action
When you get the VS2010 Beta 1, let me know what you think about this new debugger window as you write your own task-based algorithms: do you like it or do you love it? ;-) More seriously, I am genuinely interested in your bug reports (so the talented developers on my team can fix them for Beta2), what feature do you particularly like (so we make sure not to lose/break it) and what you'd like to see that is not there (so we can include it for the next release or, if it is a low hanging fruit, in this release).

Labels:


Saturday, May 09, 2009

 

My colleagues have released to DevLabs an incubation project: Axum (fka "Maestro").
"Axum aims to validate a safe and productive parallel programming model for .NET"

" Axum is a language that builds upon the architecture of the Web and principles of isolation, actors, and message-passing to increase application safety, responsiveness, scalability, and developer productivity.
Other advanced concepts we are exploring are data flow networks, asynchronous methods, and type annotations for taming side-effects."

After you visit the page and download the bits, head over to the dedicated Axum blog.

Labels:


Monday, February 23, 2009

 

During my working_from_Greece stint over the Christmas and New Year's period, I squeezed in presenting at an event in Athens. Parts of the recorded presentation are now available on the Greek MSDN pages. Scroll down the page to choose between the 2 presentations


The 1st presentation was this one. The 2nd presentation was a (interactive) Greek version of the code-heavy Parallel Programming session.

Finally, for those of you physically living in Greece, there is a 2-page article/interview on parallel computing in February's PC Magazine.

Labels: ,


Sunday, February 15, 2009

 

That was the title of a 45-minute (inc. questions) presentation I gave last December. It was a basic introduction to the manycore shift including what is parallelism and why software developers should care. The session ended by touching at a very high level on what Microsoft is doing in this space.

It was slides only (well, there were 3 demo/sample apps shown but no code) and you can download the deck in a ZIP file (the slides are a montage from many other decks of other Microsoft employees).

The basic flow has as follows
slide 3: Understanding Moore's Law
slide 4-7: Moore's law is still alive, but is not translated to higher frequencies. That is mainly due to the power wall, which at the end of the day means more heat than the CPU manufacturers can deal with
slide 8: So instead the manycore shift enters with CPU manufacturers adding more cores to the die rather than making a single one go faster. Predictions are for 100-core machines within 5 years (for the record, these are not my predictions)
slide 9: For us software developers, to take advantage of the increased total horsepower of manycore machines (on the client/desktop) you must employ parallelism. No other magic bullet or automatic gain. It is naïve to think that we will not need the increased speed or that we don’t know what to do with it:
a. We have been taking advantage (implicitly or explicitly) of increased CPU speeds over our entire existence. Why do we think we'll stop now?
b. Every shift in the software industry (whether it is the move from console to GUI apps, or desktop to mobile apps or even the recent client side web programming advancements) has been partly possible due to being able to take advantage of higher processor speeds. Why will the next shift in computing (whatever that is) be different?
slide 10: DEMO the morphing application (same one I showed at Tech Ed EMEA)
slide 11: Important to note that not all of the additional cores will be as fast as today's CPUs – they will more likely be of lower frequency; so to get even the same output that we get from one core today, we'll have to use parallelism to leverage more than one cores.
slide 12: Also important to note that it isn’t just Microsoft telling this story. Virtually every industry player is predicting the same situation.
slide 13: So the question is: what do I do with all those cores? Besides the same goals that good multithreading has (responsiveness, scalability and latency-awareness) parallelism takes it to the next level.
slide 14 to 15: Obey Amdahl's Law: do the same thing, but genuinely faster
slide 16: Obey Gustafson's law: do more stuff in the same time!
slide 17: Use speculative execution.
slide 18: DEMO the RayTracer application (same one I showed at PDC)
slide 19: "OK, I am sold. I must use parallelism. Show me how"… "Well, actually it is darn hard today if you try and use traditional multithreading to achieve parallelism"
slide 20: Microsoft established the Parallel Computing Initiative to address the goals/symptoms above
slide 21: Not the only team in Microsoft thinking about this problem. Attacking it from many angles.
slide 22: DEMO Baby Names application
slide 23: I bet you want to see some code… Read the Summary slide, and let's move on to the next session.

Labels:


Thursday, February 05, 2009

 

Last year gave the same session on Parallel Programming twice: at PDC2008 and Tech Ed EMEA 2008 (identical content). The fact that those sessions ended up on the #3 and #2 spots in the ranking order, speaks to the fact that people are really interested and accepting of this topic. It is also testament that the technology Microsoft is releasing with Visual Studio 2010 is very compelling. So I invite you here to take my content and reuse it in your local regions!

The recordings (and slides) of the two identical sessions are available so you can learn by watching them: links from here and here.

I have also captured the session content on this blog:

1. Briefly introduce the manycore shift and clarify the release vehicle for Parallel Extensions.
2. Run one of the samples that ship with Parallel Extensions to demonstrate the end user benefit (no code shown at this point).
3. Clarify the potential difference between parallelism and multi-threading.
4. DEMOnstrate Fine Grained Parallelism via the Task-based Programming model built on the new ThreadPool.
5. DEMOnstrate Debugging Parallel Applications.
6. DEMOnstrate Structured Parallelism via the static Parallel class, e.g. Imperative Data Parallelism.
7. DEMOnstrate Declarative Data Parallelism: PLINQ.

Many conferences/user groups are interested in technical sessions on Parallel Programming in .NET 4.0 and Visual Studio 2010 so use the links above to learn and share.

Labels:


Sunday, January 25, 2009

 

With VS2008 (more specifically .NET Framework 3.5) a wonderful thing was introduced: LINQ. Given the declarative nature of Language Integrated Query it was a prime candidate for trying to inject automatic parallelization in it (i.e. to run faster by seamlessly taking advantage of multiple cores). The result of those efforts is what I mentioned 18 months ago (Parallel LINQ) and followed with a screencast 4 months later: see the 2nd link in the list here. In this post I'll do a written overview based on the latest bits. Before we continue, you should understand that PLINQ applies only to LINQ to Objects (i.e. IEnumerable-based sources where lambdas are bound to delegates, not IQueryable-based sources where the lambdas are bound to expressions). It also does not interfere with the deferred execution principles of LINQ, of course.

PLINQ as a black box
PLINQ is really simple if you want to treat it as a black box; all you do as a user is add the .AsParallel extension method to the source of you LINQ query and you are done! The following query
var result =
from x in source
where [some condition]
select [something]
...can be parallelized as follows:
var result =
from x in source.AsParallel()
where [some condition]
select [something]
Notice that the only difference is the AsParallel method call appended to the source and we can of course use this pattern with more complex queries.

Why Does It Work
To understand why the above compiles we have to remind ourselves of how LINQ works and that the first version of the code above is really equivalent to:
var result =  source.Where(x => [some condition]).Select(x => [something]);
...so when we parallelize it we are simply changing it to be the following:
var result =  source.AsParallel().Where(x => [some condition]).Select(x => [something]);
In other words the call to AsParallel returns something that also has the typical extension methods of LINQ (e.g. Where, Select and the other 100+ methods). However, with LINQ these methods live in the static System.Linq.Enumerable class whereas with PLINQ they live in the System.Linq.ParallelEnumerable class. How did we transition from one to the other? Well, AsParallel is itself an extension method on IEnumerable types and all it does is a "smart" cast of the source (the IEnumerable) to a new type which means the extension methods of this new type are picked up (instead of the ones directly on IEnumerable). In other words, by inserting the AsParallel method call, we are swapping out one implementation (Enumerable) for another (ParallelEnumerable). And that is why the code compiles fine when we insert the AsParallel method. For a more precise understanding, in the VS editor simply right click on AsParallel, choose Go To Definition and follow your nose from there…

How Does It Work
OK, so we can see why the above compiles when we change the original sequential query with our parallelised query, which we now understand is based on the introduction of new .NET 4 types such as ParallelQuery and ParallelEnumerable – all in System.Core.dll in the System.Linq namespace. But how does the new implementation take advantage (by default when it is worth it) of all the cores on your machine? Remember our friendly task-based programming model? The implementation of the methods of the static ParallelEnumerable class uses Tasks ;-). Given that the implementation is subject to change and more importantly given that we have not shipped .NET 4 yet, I will not go into exactly how it uses the Tasks, but I leave that to your imagination (or to your decompiler-assisted exploration ;)).

Simple Demo Example
Imagine a .NET 4 Console project with a single file and 3 methods, 2 of which are:
  static void Main()
{
Stopwatch sw = Stopwatch.StartNew();
DoIt();
Console.WriteLine("Elapsed = " + sw.ElapsedMilliseconds.ToString());
Console.ReadLine();
}
static bool IsPrime(int p)
{
int upperBound = (int)Math.Sqrt(p);
for (int i = 2; i <= upperBound; i++)
{
if (p % i == 0) return false;
}
return true;
}
…without worrying too much about the implementation details of IsPrime (I stole this method from the walkthrough you get in the VS2010 CTP). So the only question is where is the 3rd method, which clearly must be named Doit. Here you go:
  static void DoIt()
{
IEnumerable arr = Enumerable.Range(2, 4000000);
var q =
from n in arr
where IsPrime(n)
select n.ToString();
List list = q.ToList();
Console.WriteLine(list.Count.ToString());
}
Now if you run this you will notice that on your multi-core machine only 1 core gets used (e.g. 25% CPU utilization on my quad core). You'll also notice in the console the number of milliseconds it took to execute. How can you make this execute much faster (~4 times faster on my machine) by utilizing 100% of your total CPU power? Simply change one line of code in the Doit method:
from n in arr.AsParallel()
How cool is that?

Can It Do More
What the PLINQ implementation does is it partitions your source container into multiple chunks in order to operate on them in parallel. You can configure things such as the degree of parallelism, control ordering, specify buffering options, whether to run parts of the query sequentially etc. To experiment with all that, just explore the other new extension methods (e.g. AsOrdered, AsUnordered) and, finally, the new enumerations (e.g. ParallelQueryMergeOptions). I leave that experimentation to you dear reader ;)

Labels: ,


Wednesday, January 07, 2009

 

Often the source of performance issues in our code is loops, e.g. while, for, foreach. With .NET 4 it becomes easy to make such code perform better by taking advantage of multiple cores.

Parallel.ForEach
For example, given:
IEnumerable<string> arr = ...
foreach (string item in arr){
// Do something
}
, we can parallelise it is as follows:
Parallel.ForEach<string>(arr, delegate (string item){
// Do something
});
, or the tidier directly equivalent version (dropping the superfluous generic which can be inferred and turning the anonymous method syntax into a lambda statement)
Parallel.ForEach (arr, (string item) =>{
// Do something
});

Visual Distinctions
Notice the obvious visual similarities that make it almost automatic to parallelise a loop: the only difference in the parallel version is the modification in the first line (rearranging the "arr" and "string item", which are the real pieces of information) and the fact that after the closing brace at the end there is a closing parenthesis and semicolon. The crucial visual observation here is that the body of the loop remains intact.

Why Does It Work
Let's drill into why the modified code compiles and why it is equivalent in intent (even if it is obvious to some). We turned a block of code into a library method call. The .NET 4 (mscorlib) library offers the static class Parallel that (among others) offer the ForEach method. One of its overloads (its simplest) accepts 2 parameters: a source IEnumerable of TSource and a body of code (in the form of the Action of TSource delegate, of course) that accepts a single parameter which is also of TSource, of course. The method will take the body and call it once for each element in the source. If you reread the last 2 sentences you'll find that is exactly what the original loop construct does as well. The real difference here is that the original runs serially (using only a single core) while the modified runs in parallel (using, by default, all cores).

How Does It Work
Those of you that don't like black magic boxes will ask: what does that method actually do inside in order to run things in parallel? My answer: what do you think it needs to do? Remember our friendly task-based programming model? The implementation of the methods of the static Parallel class uses Tasks (and specifically SelfReplicating tasks). Given that the implementation is subject to change and more importantly given that we have not shipped .NET 4 yet, I will not go into exactly how it uses the Tasks, but I leave that to your imagination (or to your decompiler-assisted exploration ;)).

Trivial Demo Example
In a .NET 4 Console project paste the following in the Program.cs file:
  static string[] arr = Directory.GetFiles(@"C:\Users\Public\Pictures\Sample Pictures", "*.jpg");
static void SimulateProcessing() {
Thread.SpinWait(100000000);
}
static string TID {
get {
return " TID = " + Thread.CurrentThread. ManagedThreadId.ToString();
}
}
Now in the empty Main function paste the following:
    foreach (string ip in arr) {
Program.SimulateProcessing();
Console.WriteLine(ip + TID);
}
Console.ReadLine();
Run it and notice how long it takes as well as that only one thread gets used of course and in Task Manager notice the CPU usage. Now change the loop construct so they are as follows:
Parallel.ForEach(arr, (string ip) => {
Program.SimulateProcessing();
Console.WriteLine(ip + TID);
});
Re-run it and notice how much faster it runs and how the number of threads equals the number of cores on your machine and in Task Manager the CPU usage being at 100%.

Why Not Do It Automatically
Many that see this technology ask "Why not automatically change all loops to run parallelised?". The answer is that you cannot blindly apply a Parallel.ForEach wherever you have a foreach loop. If the body of the loop depends on some shared state, or if each loop iteration is not independent of every other iteration, then race conditions may arise by blindly parallelising. Ultimately, it is multiple threads that execute the body in parallel so there is no room for shared state etc. The static methods of the Parallel class have no magic to deal with that – it is still down to you. If you find yourself needing synchronization in the loop body, be sure to measure the performance because locks and such in a parallelisation scenario potentially negate (or severely limit) the benefits of parallelisation. It is for these reasons that parallelising a loop is an opt-in decision today that only you can make for your code.

A related question arises of why not embedding this functionality in the language (the obvious suggestion being introducing a pfor loop construct). The answer is that having it as a library offering instead of built-in to the language allows it to be used by *all* .NET languages instead of restricting it to a few. Also, once something is embedded into the language it typically stays there forever so we take great care about such decisions e.g. it is too early to tie C# or VB to the System.Threading.Tasks namespace.

For the distant imaginary future, we are thinking about automatically parallelising bits of code if we can (with hints from the application developer with regards to purity and side-effect-free regions of code) and also embedding parallel constructs deeper into the language. Too early to know if anything will come of that...

Can It Do More
Yes! We only saw above one of the overloads of one of the methods. Parallel.ForEach has ~20 other overloads, some of them taking up to 5 arguments and all of them having a return type too; what I am trying to say is that there is much more flexibility and richness even in this simple API. I encourage you to explore the other overloads and also the other two methods on the Parallel class: For and Invoke.

Labels:


Tuesday, December 30, 2008

 

In a previous post I made the point about the need to finely partition our compute bound operations and enumerated the benefits of fine grained parallelism. In another post I showed how it is a mistake to directly use Threads to achieve fine grained parallelism. The problem was that the unit of partitioning in our user mode app was also the unit of scheduling of the OS.

System.Threading.Tasks.Task
We are introducing in mscorlib of .NET 4 the System.Threading.Tasks.Task type that represents a lightweight unit of work. The code from my previous post would look like this with Tasks (and it does not suffer from any of the 3 problems that the original code suffers from):
static void WalkTree(Tree tree) 
{
if (tree == null) return;
Task left = new Task((o) => WalkTree(tree.Left));
left.Start();
Task righ = new Task((o) => WalkTree(tree.Righ));
righ.Start();
left.Wait();
righ.Wait();
ProcessItem(tree.Data);
}
Tasks run on the new improved CLR 4 ThreadPool engine – I will not repeat here in this post the performance and load balancing benefits, but will instead focus on the rich API itself.

Creation and Scheduling
An example of the API is what we saw above where we used the Task with the same pattern that we use threads (create and then later start). You can see another example of the creation API if we modify the original Main method to look like this:
static void Main() 
{
Tree tr = Tree.CreateSomeTree(9, 1);
Stopwatch sw = Stopwatch.StartNew();
Task t =Task.StartNew(delegate { WalkTree(tr); });
t.Wait();
Console.WriteLine("Elapsed= " + sw.ElapsedMilliseconds.ToString());
Console.ReadLine();
}
Notice how we can create Tasks and start them with a single statement (StartNew), which is similar to how we use the ThreadPool.QueueUserWorkItem with the added benefit of having the reference to the work in the form of the variable 't'.

Waiting
Also notice above how we preserve the semantics of the code prior to the change by waiting for the work to complete before the Console.WriteLine statement. We saw this method further above in the method WalkTree. In fact in WalkTree, we can change the two calls (left.Wait and righ.Wait) with the more flexible Task.WaitAll(left, right) and there are other options such as a WaitAny method that would block only until any one of the tasks you pass into it complete.

Continuations
We can further change the body of the Main method as follows:
Tree tr = Tree.CreateSomeTree(9, 1);   
Stopwatch sw = Stopwatch.StartNew();
Task t = Task.StartNew(delegate{ WalkTree(tr);});
t.ContinueWith(tt => Console.WriteLine("Done"), TaskContinuationKind.OnAny);
t.Wait(2500);
Console.WriteLine("Elapsed= " + sw.ElapsedMilliseconds.ToString());
Notice how we are waiting with a timeout this time which means that after 2.5 seconds we will see on the console "Elapsed..." (given that our WalkTree work takes longer than that to complete). However, at that point the CPU usage will remain at 100% as our work is still being executed. When it completes, as the CPU usage drops down again, we will also see in the console "Done". This should verify your expectation of the self explanatory ContinueWith method. It is a very powerful method (more here) that enables patterns such as pipelining. You can have many continuations off the same task and you can configure the circumstances under which to continue via the TaskContinuationKind that I encourage you to explore along with the various overloads.

Cancellation
Cancellation is well integrated in the API. Cancelling a task that is scheduled in some queue and has not executed yet means that it will not be executed at all. For a task that is already running, cooperation is needed which means that the task can check a boolean property (IsCancellationRequested) to see if cancellation was requested and act accordingly. Finally, you can see if a task is actually cancelled via another boolean property (IsCanceled) on the Task type. If we modify the 2 lines of code above as follows:
    t.ContinueWith(tt => Console.WriteLine("done"));
t.Wait(2500);
t.Cancel();
...we will see the "Elapsed" message followed immediately by a drop in CPU utilization and the "Done" message.
Note that for the cancelation above to behave as expected, we are assuming that when we cancel a Task, all tasks created in that scope also get cancelled, i.e. when we cancel 't' all the tasks created in WalkTree also get cancelled. This is not the default, but we can easily configure it as such by changing the ctor call in WalkTree for both left and right to be as follows:
...= new Task((o) => WalkTree(tree.Left), TaskCreationOptions.RespectParentCancellation);

Parent Child Relationships
The above correctly implies that there is a parent child relationship between tasks that are created in the scope of an executing task. It is worth noting that parent tasks implicitly wait for their children to complete which is why the waiting worked as expected further above. If we wanted to opt out of that we can create detached children via the TaskCreationOptions.Detached option. I encourage you to experiment with the other TaskCreationOptions...

Task with Result
Let's go way back and peek at the original serial implementation of WalkTree and let's modify it so it actually returns a result:
static int WalkTree(Tree tree) 
{
if (tree == null) return 0;
int left = WalkTree(tree.Left);
int righ = WalkTree(tree.Righ);
return ProcessItem(tree.Data) + left + righ;
}
...as we ponder the question of "How do we parallelize that?" take look again at the code we have at the top of this post that parallelized the version that did not return results.
We can change it to return 0 when there are no more leaf nodes and change it to return the results of ProcessItem, but we have an issue with how to obtain the results of the WalkTree(righ) and WalkTree(left) and add them to our return results. In other words: we are passing a delegate to the Task ctor that returns a result and we need a way to store it somewhere. The obvious place to store it is the Task itself! However, we want this strongly typed so we use generics and we have type that inherits from Task which is Task<T> (in the CTP bits it is called a Future<T>). This new type has a property for returning the Value and the call will block if the task is still executing or it will return immediately if it has executed and the value is already stored. So the code can be modified as follows:
static int WalkTree(Tree tree) 
{
if (tree == null) return 0;
Task<int> left = new Task<int>((o) => WalkTree(tree.Left), TaskCreationOptions.RespectParentCanellation);
left.Start();
Task<int> righ = new Task<int> ((o) => WalkTree(tree.Righ) , TaskCreationOptions.RespectParentCanellation);
righ.Start();
return ProcessItem(tree.Data) + left.Value + righ.Value;
}
Note that if we did not want to block on Value then we could have queried the IsCompleted property of the Task.

In Summary
Above I have given you a brief glimpse of the rich API that Task exposes (and there is a lot more such as a nice exception handling model that aggregates exceptions thrown in parallel into a single AggregateException). Combined with my other posts referenced above, you should feel comfortable (if not compelled) to use this new Task API in all parallelism scenarios where previously you considered using directly Threads or the ThreadPool. Furthermore, the rich API has hopefully enticed you to use it even if you had not considered the ThreadPool or threads before.

Labels:


Monday, November 24, 2008

 

A way to think of the CLR thread pool and how it gets used, e.g. when you call ThreadPool.QueueUserWorkItem, is to picture a global queue where work items (essentially delegates) get queued on a global queue and multiple threads pick them out in a First In First Out order. The FIFO order is not something that is documented or guaranteed, but my personal guess is that too many applications rely on it, so I don’t see it changing any time soon.

The image on the left shows the main program thread as it is creating a work item; the second image shows the global thread pool queue after code has queued 3 work items; the third image shows 2 threads from the thread pool that have grabbed 2 work items and executing them. If in the context of those work items (i.e. from the executing code in the delegates) more work items get created for the CLR thread pool they end up on the global queue (see image on the right) and life goes on.

CLR Thread Pool v4 from a System.Threading.Tasks perspective
In CLR 4, the thread pool engine has had some improvements made to it (it has been having positive tweaks in every release of the CLR) and part of these improvements are some performance gains that are achievable when using the new System.Threading.Tasks.Task type. I'll show a code example in another post, but you can think of creating and starting a Task (passing it a delegate) as the equivalent of calling QueueUserWorkItem on the ThreadPool. A way to visualize the CLR thread pool when used via the Task-based API is that, in addition to the single global queue, each thread in the thread pool has its own local queue:

Just as with normal thread pool usage, the main program thread may create Tasks that will get queued on the global queue (e.g. Task1 and Task2) and threads will grab those Tasks typically in a FIFO manner. Where things diverge is that any new Tasks (e.g. Task3) created in the context of the executing Task (e.g. Task2) end up on a local queue for that thread pool thread.

Why Local Queues
With the era of manycore machines upon us and devs taking advantage of parallelism, the number of threads in the thread pool is likely to increase: at a minimum equal to the number of cores for compute bound operations, and likely more due to injection of additional threads as a result of IO bound operations or blocking calls stalling the CPU. Bottom line: more cores = more threads.

With more threads competing for work items, it is not optimal to put up with the contention issues of a single queue that all of them are trying to access safely. This would be amplified by the goal of fine grained parallelism where each work item finishes fairly quickly, so the trips to the global queue would be frequent.

It is for this reason we introduce a local queue per thread, where additional tasks get queued (with no contention) and will then be retrieved by that same thread (again with no contention).

LIFO
So, picking up from the picture further up, let's assume that Task2 additionally creates two more Tasks e.g. Task4 and Task5.

The tasks end up on the local queue as expected, but which Task does the thread pick to execute when it completes its current task (i.e. Task2)? The initially surprising answer is that it could be Task5, which is the last one that was queued – in other words a LIFO algorithm can be used for the local queues.

The reason that this is a good thing is locality. In most scenarios the data required by the last created Task in the queue is still hot in the cache, so it makes sense to pull that down and execute it. Obviously, this means there are no promises on ordering, but some level of fairness is relinquished in the sake of better performance.

Work Stealing
If the only enhancements were the introduction of LIFO local queues then performance is greatly increased, but you should be concerned. You should be concerned about what happens when another thread in the thread pool (likely executing on another core) finishes its work. Luckily, you don't have to be:

The other worker thread completes Task1 and then goes to its local queue and finds it empty; it then goes to the global queue and finds it empty. We don't want it sitting there idle so a beautiful thing happens: work stealing. The thread goes to a local queue of another thread and "steals" a Task and executes it! That way we keep all our cores busy and this contributes to our fine grained parallelism load balancing goal. In the image above notice that "stealing" happens in a FIFO manner, which again for locality reasons is good (its data would be cold in the cache). Furthermore, in many divide and conquer scenarios, tasks generated earlier on are likely to generate more work (e.g. Task6) themselves, which would end up now on this other thread's queue and hence reduce frequent stealing.
Nitpicking note: further up, I mentioned "no contention"; clearly work stealing is the exception to that rule.

What's Next
In addition to the links sprinkled above, you can find a simple implementation of a work stealing threadpool on Joe's blog post and some of the concepts above are touched on in this MSDN mag article by Eric and Erika. If you want to see the pictures above in a fully animated slide, get my PDC deck (slide 8) from this page. Do note that the implementation I am talking about above did not ship with the September CTP of Visual Studio 2010, but is slated for the next drop. Also note that all internal implementation details are subject to change and just shared here for the geek factor ;-)

So you can see how using the Task-based API will yield cool perf gains under the covers. An additional reason to use the new Task-based API is because of its richness. I'll touch on that in a next blog post.

Labels:


Wednesday, November 19, 2008

 

I have mentioned previously on this blog the two new debugger toolwindows our team is adding to Visual Studio 2010: Parallel Tasks and Parallel Stacks. I will be blogging a lot more about these, so for now I encourage you to get a high level understanding by watching my 10-minute screencast. All feedback welcome.

Labels:


Monday, November 10, 2008

 

I presume you are sold on the goal of achieving fine grained parallelism – if not please read that post.

In that post I talked quite a bit about the need to partition our work into many small chunks. So how could we partition a long running work item that looks like this:
  static void Main()
{
Tree tr = Tree.CreateSomeTree(9, 1);
Stopwatch sw = Stopwatch.StartNew();
WalkTree(tr);
Console.WriteLine("Elapsed= " + sw.ElapsedMilliseconds.ToString());
Console.ReadLine();
}
static int ProcessItem(int treeData)
{
// for demo purposes
Thread.SpinWait(4000000); //TODO something real with data
return treeData; // not being used
}
static void WalkTree(Tree tree)
{
if (tree == null) return;
WalkTree(tree.Left);
WalkTree(tree.Righ);
ProcessItem(tree.Data);
}
…assuming that Tree is defined like this: Tree.cs.

The immediate thought might be to use a thread, for example:
  static void WalkTree(Tree tree)
{
if (tree == null) return;
Thread left = new Thread((o) => WalkTree(tree.Left));
left.Start();
Thread righ = new Thread((o) => WalkTree(tree.Righ));
righ.Start();
left.Join();
righ.Join();
ProcessItem(tree.Data);
}
So now the code runs faster (e.g. on a quad core machine) than the sequential approach – the result is good. But the way we achieved this was not good, for at least 3 reasons:

1. By oversubscribing the CPU like this (creating 1023 threads on a quad core machine) we are paying a hefty price for context switching. Sure, the overall performance increased, but it could have increased even more if we used the correct number of threads. The correct number of threads is, of course, equal to the number of cores on the box. For example on a quad core, 4 threads is the ideal number of threads to execute compute bound work.

2. By making the OS schedule all those threads for our process, our process is getting an unfair advantage over other apps on the same box. That is being a bad citizen on the machine.

3. The most important reason IMO that the approach above is a bad idea is memory consumption. Since CLR 2.0, every managed thread uses 1MB of committed memory. Depending on your machine configuration and environment setup, the code above won't even run and will instead throw an OutOfMemoryException exception. That is a clue that this approach simply does not scale. We need an approach where we increase our data load (e.g. the depth of the tree) and have our algorithm scale. This again goes back to point 1, which is that ideally the number of threads used by our process is equal to the number of cores on the machine.

So even though partitioning the problem into many small chunks was a good thing, using threads was not. The fundamental problem here is that the unit of partitioning in our user mode app is also the unit of scheduling of the OS.

So, we need some kind of user mode "engine" to schedule only as many threads as the number of cores on the machine and we need this to take place automatically for us. We also need to be able to partition the overall compute-bound operation into many work items that will get executed by the "engine". If this sounds familiar, it should: ThreadPool. The CLR thread pool has the correct internal logic and a simple interface (QueueUserWorkItem) for scheduling work items to be executed by the threads it manages. This appears to be the answer to achieving fine grained parallelism.

The problem with the current ThreadPool API is that it has almost no API. You simply throw items to it in a "fire and forget" manner. You get back no handle to the work item. No way of cancelling it, waiting on it, composing a group of items in a structured way, handling exceptions thrown concurrently or any other richer construct built on top of it. So while you should be using the ThreadPool today for parallelism, there is a considerable improvement coming in .NET 4 – I'll cover that in another blog post. In the meantime, I leave it to you dear reader to convert the tree walking sample above to directly use the ThreadPool instead of threads (hint, you'll need a ManualResetEvent or equivalent to simulate the Thread.Join semantics).

As an aside, there are many people (e.g. Jeffrey Richter) that advise to never explicitly use threads for anything at all (not just in the context of parallelism, but under no circumstances whatsoever). Their advice is to always use the ThreadPool wherever you considered using a thread! Can you argue with that?

Labels:


Friday, November 07, 2008

 

The question of "How do I try/show the parallel bits in VS2010 outside of the VPC?" or variants of that keeps appearing in my inbox. That is because of Virtual PC's limitation to only run one core, which kind of sucks for parallel programming.

Here is the story:
1. The VS2010/.NET 4.0 CTP is only available as a VPC. Sorry, no directly installable bits that I can share. You can still use the VPC to get a feel for the APIs and tools, though.
2. Parallel Extensions are now part of mscorlib.dll with dependencies on the CLR 4.0, so no you cannot copy bits from the VPC and use it with VS2008.
3. You can however, run the virtual image on HyperV and then you have access to all the cores on your box. Instructions for that are here (thanks Grant for incorporating my comment into the blog post).
4. If that is not an option for you either, then your last resort is to use the old Parallel Extensions June CTP that works with VS2008. Of course, via that route, you cannot play with the new debugger bits.

In any case, have fun!

Labels:


Thursday, November 06, 2008

 

When trying to take advantage of multi-core machines, the first step is to identify the areas in your code that can literally run at the same time. Unless you are really lucky, what this step really means is you will have to partition your compute bound work into smaller chunks that can run in parallel.

For example, if you are walking a binary tree data structure (typically in a recursive manner) and there is some compute intensive operation carried out on each node, maybe you can partition the work in such a way so the left side of the tree gets processed at the same time as the right side of the tree gets processed. You have partitioned your work into 2. So, on a dual core machine, you should see a performance gain. What performance gain would you further see on a quad core or 24-way machine? None beyond what you saw on the dual core.

Our goal with parallel programming is to write once and have our code scale well as the hardware underneath it gets better, i.e. see incremental benefits when running our app on machines with more cores without changing the code. For this to be achieved, we need to partition our work into multiple chunks in such a way so that ideally: # chunks >= # cores of the highest end machine we expect our app to run on.

The impact of the message in the previous paragraph is that, if we revisit the tree example, maybe we should process each node at the same time. So, if the tree has 500 nodes, our code would scale well up to a 500 core machine. An additional reason this would be a good thing is that if, as is typical, our problem size increases over time (i.e. the number of nodes in our example) our algorithm would also scale to take advantage of even more cores.

There is yet another reason that we need fine grained partitioning. Let's say that we partition our work into 8 work items and we know that the code will always execute on an 8-way machine, so we feel good about our approach. What if those 8 work items do not take the same time to complete? Let's say that 1 of them completes fairly quickly and the remainder take longer. Now we have a core sitting idle instead of helping with the overall work. Had we partitioned the work into much smaller chunks, when a core is finished with the item it executes, it would be able to start working on another chunk hence achieving the desired load balancing.

Hopefully you agree that fine grained parallelism is one of the ways forward in the manycore era. Next I'll show an example using code that shows how we partition the work and then schedule it on the cores.

Labels:


Monday, November 03, 2008

 

To take advantage of multiple cores from our software, ultimately threads have to be used. Because of this fact, some developers fall in the trap of equating multithreading to parallelism. That is not accurate.

You can have multithreading on a single core machine, but you can only have parallelism on a multi core machine (or multi proc, but I treat them the same). The quick test: If on a single core machine you are using threads and it makes perfect sense for your scenario, then you are not "doing parallelism", you are just doing multithreading. If that same code runs on a multi core machine, any overall speedups that you may observe are accidental – you did not "think parallelism".

The mainstream devs that I know claim they are comfortable with multithreading and when you drill into "what scenarios they are enabling when using threads" there are 2 patterns that emerge. The first has to do with keeping the UI responsive (and the UI thread affinity issue): spin a thread to carry out some work and then marshal the results back to the UI (and in advanced scenarios, communicate progress events and also offer cancellation). The second has to do with I/O of one form or another: call a web service on one thread and while waiting for the results, do some other work on another; or carry out some file operation asynchronously and continue doing work on the initiating thread. When the results from the (disc/network/etc) IO operation are available, some synchronization takes place to merge the data from the executor to the requestor. The .NET framework has very good support for all of the above, for example with things like the BackgroundWorker and the APM pattern (BeginXYZ/EndXYZ) or even the event-based asynchronous pattern. Ultimately, it all comes down to using the ThreadPool directly or indirectly.

The previous paragraph summarized the opportunities (and touched on the challenges) of leveraging concurrency on a single core machine (like most of us do today). The end user goal is to improve the perceived performance or to perhaps improve the overall performance by hiding latency. All of the above is applicable on multi-core machines too, but it is not parallelism. On a multi-core machine there is an additional opportunity to improve the actual performance of your compute bound operations, by bringing parallel programming into the picture. (More on this in another post).

Another way of putting it is that on a single core you can use threads and you can have concurrency, but to achieve parallelism on a multi-core box you have to identify in your code the exploitable concurrency: the portions of your code that can truly run at the same time.

Labels:


Friday, October 17, 2008

 

Last week SeanNo, SteveTei and I spent 40 minutes with Charles from channel9 discussing and showing what we are working on in PDT for Visual Studio 2010 for enhancing the debugging and profiling experience for parallel programming.

Watch the ch9 video here.

Labels:


Saturday, October 11, 2008

 

There have been two CTPs for the "Parallel Extensions to the .NET Framework 3.5" (readers of this blog will remember November and June).

Aligned with the recent announcement about .NET 4.0 coming, the pfxteam announced yesterday that Parallel Extensions gets rolled into .NET Framework 4.0.

This is great news because it means that there should be no concerns from a deployment, support and future versioning perspective: it is the same as the .NET Framework :-)

IMO it is a misnomer now to refer to these bits as Parallel Extensions since they are now part of the core: mscorlib.dll and system.core.dll. It is however still relevant to use the terms TPL and PLINQ, the two major components and that is what I did for the abstract of my PDC session.

Labels:


Wednesday, October 01, 2008

 

Previously I asked about properties of Tasks that you'd like to see when debugging and suggested some:
[...]which ones are not scheduled yet [...], which ones are Running and which ones have run a bit and now are Waiting [...]. [...]quickly see the call stack of each Task and also which thread it is executing on. [...]parent-child relationship between Tasks.
Rather than talk about it in totally abstract terms, how about having a few mock up pictures.

After hitting a breakpoint, below are 2 Tasks running and 4 that are waiting (in our example there aren't any Tasks that are scheduled in a queue and not run yet). Then on the picture on the right, we switch to parent child view, we can see that four Tasks are actually waiting for their child tasks to complete:


In this different example below, we grouped by the Location column so we can see that 3 Tasks are running in method H of class P, and 1 task has at the top of its stack the method C of the same class. We also see the IDs of the underlying threads running those Tasks:


I can't wait for you to start working with the Task-based APIs, so you can provide feedback on what supporting tools you'd like to see. Hopefully the Parallel Tasks debugger toolwindow is a good start.

Labels:


Saturday, September 27, 2008

 

My previous post (on active stack frame and current thread) ended by raising an issue, a solution to which I propose below.

Consider the following screenshot of code (inc. snippet from Threads window) that has hit a breakpoint and envisage the call stacks for each thread before reading the rest of this blog post:


As a developer debugging a multi-threaded app, you need the ability to view call stacks of multiple threads at the same time – not just a single one at a time. What would you wish this to look like? The obvious solution would be a window like this:


If we can get to the stage above, why not aim for a few more goals: Draw it as a call graph (method calls go from top to down) and, more importantly, visually indicate when threads share common method call paths. Here is a picture (identical info to previous picture) delivering on those goals:


The “xN” indicates that the call stack segment box is shared by N threads. To piece together the entire call stack for a thread, we mentally concatenate the boxes vertically. So the diagram above reads as follows:
"One thread started in Main. It then called A; A was called by 3 threads in total, so we can deduce that 2 other threads have started their life in A. Of the 3 threads, 2 called B (and that is where their active stack frame is right now). The remaining one thread from A called C and subsequently an AnonymousMethod from C and that is where its active stack frame is now (this is also the current thread)."

What do you think? How can you make this better or what alternative do you propose?

Labels:


Monday, September 08, 2008

 

In a quiz I posted recently, I had some replies in the comments (which are incorrect btw) and more replies over emails that were correct (with a caveat). I am referring to question A (and I touched on the motivation for question B here).

In short, the correct answer to (trick) question A is that the only thing guaranteed is breakpoint #5 will be hit last and BP #1 happens before #2 and #3. There is a race between BP #4 and #1 and then a similar race between BP #2 and #3, which means that we cannot tell the overall order.

Now, I mentioned above that there was a caveat: In majority of replies they stressed that even though we cannot tell for sure, it is likely/probable that BP #4 would be hit before #1 due to overhead of setting up the child Task. Indeed if you run that exact code on your dual core machine, you’ll likely/probably find that that is the case. However, this is certainly not guaranteed or expected.

Take the original code: what 2 statements can we insert at the start of Main (before creating Task t1) that will result in BP #1 being hit before BP #4? Here is one answer:
Thread.CurrentThread.Priority = ThreadPriority.Lowest;
TaskManager tm = new TaskManager(new TaskManagerPolicy(1, 1));
...and of course passing tm as the last argument to the Create methods. Try it!

If you think that changing the thread priority was too artificial/contrived, consider that if it is a kernel event that wakes up the scheduler thread it could result in a priority boost, which has a similar risk of causing pre-emption.

If there is one guarantee about what threads run when, it is that there is no guarantee - and any internal engine (CLR, Windows) has the right to change their implementation details anyway so don't rely on empirical findings.

Labels:


Sunday, September 07, 2008

 

The second question (B) of the quiz I posted here was meant to make you think about the complications of adding a Task reference to the Watch window. I then went on to show how to use Tack.Current and make object id as a technique to help a bit. I continued by taking advantage of the DebuggerDisplayAttribute which essentially results in a productivity boost. The real question though is why during debugging would I want to watch a Task to start with?

Well, my hope is that the Task object would get more public properties that I can use to determine things like e.g. its status. But the real need is to be able to create a holistic view of my system in terms of *all* the Tasks that my code creates explicitly or implicitly. That need is what I’ve been trying to approximate with the Watch window, but so far it has ended up being a very poor solution.

So, my wish as a developer embracing the Task-based programming, is to have some way (e.g. a new VS debugger toolwindow?) of viewing all the Tasks in my system and being able to quickly determine which ones are not scheduled yet (let’s call them Initialized), which ones are Running and which ones have run a bit and now are Waiting on some resource before they can continue. If we imagine that we have this view, then I get greedy and I also want to be able to quickly see the call stack of each Task and also which thread it is executing on. My greed continues to the extent of wanting to easily view the parent-child relationship between Tasks. Have I missed any requirement? What information would you want to see in a dedicated Tasks window?

Labels:


Friday, August 22, 2008

 

Steps:
1. New C# Console Project in VS2008
2. Add Reference to System.Threading.dll (from Parallel Extensions June CTP)
3. Replace the entire contents of Program.cs with the following (overly simplified extract of a more complex piece of code):
using System;
using System.Threading.Tasks;

static class Program
{
static void Main()
{
Task t1 = Task.Create(delegate
{
Task t2 = Task.Create( // breakpoint #1
(o) =>
Console.WriteLine(o.ToString()) // breakpoint #2
, "hey");
t2.Wait(); // breakpoint #3
});
t1.Wait(); // breakpoint #4

Console.ReadLine(); // breakpoint #5
}
}

4. Insert a breakpoint (F9) wherever there is a comment.

Questions:
A. Without running the project, what can you predict about which breakpoints will be hit and in what order?
B. Can you predict at each breakpoint, which of the two variables (t1, t2) is in scope (e.g. if they were in the Watch window)?

Labels:


Monday, July 28, 2008

 

When most people's brains first light up on why parallelism is the next BigThing, some jump to the conclusion that Moore's law is over. Let's clear that up below.

All of you know Gordon Moore's law which boils down to the prediction of
"the number of transistors on a chip will double about every two years"
In the last 30-40 years the previous prediction has manifested itself in clock speed increases and that is what has tricked most of us to associate Moore's law with CPU speed.

So, now that chip manufacturers cannot make single CPUs any faster (well, they can, but they can't cool them down enough to make them useful), they are resorting to having chips with multiple cores, which we are terming the manycore shift. The manycore shift has a profound impact on developers (especially those programming for the desktop client) in that their software now has to learn how to take advantage of parallelism.

So if you followed the logical flow so far, you'll conclude that Moore's law is still alive: we are still getting more silicon, but it does not translate to increased linear speed, but rather to parallel "engines" that your software must learn to utilise.

I am glad we cleared that up :)

Labels:


Tuesday, July 22, 2008

 

One of my pet peeves is ensuring that whenever there is code that explicitly creates a thread, to always have an accompanying statement that names the thread. This is invaluable in debugging. With managed code, it is so simple, just one extra statement to set the thread's Name property (which you can only do once at runtime, of course).

This is such a useful thing to have when debugging, that in VS2008 the ability was added to name threads after execution had started. The idea there is for naming Threads that you don't control but still need to easily identify for debugging purposes. The name does not persist cross debugging sessions and indeed it is not used at runtime by anything other than your... eyes. I describe this feature as part of my video on all the new VS2008 threading enhancements (watch minutes 06:00 through 08:20, although I recommend the full 15 minutes).

At the OS level the ability to have names against threads is not supported (but you can use nice HEX ids to assist with your debugging ;)). Recently I discovered that there is an old trick (or hack IMO) that you can use with certain tools (inc. Visual Studio) to set names on native threads too. To get the details visit this short How To on MSDN. I did a quick search and discovered two more places on the web where this is described. One is on codeproject which is blatantly lifted from MSDN but it adds a couple of screenshots. The other also points to (an obsolete place on) MSDN and adds a bit more detail.

In short, for my managed friends: Search your code base now for the string "= new Thread" and for every match that you find, verify that you are also setting the Name property. You'll thank me later.

Labels:


Saturday, July 12, 2008

 

Yesterday a new 1-hour video interview on C# 4.0 was posted on channel9.

At some point the question is posed to Anders:
"What are some of the big issues that you are thinking about for the future?"
and Anders' reply (ffw to 39:56) identifies concurrency as the #1 thing:
"Concurrency is by far... profoundly changing... Moore's law... we've been ignoring concurrency because we could... now we can't... it is a damn hard problem..."
My favorite quote comes at 45:13 and includes hand gestures too:
"It is foolish to think that somehow we are going to be able to pepper concurrency on your code and you wouldn't need to modify anything in your apps today for them to run concurrently"
For the stuff in-between and to get the complete picture, watch the video.

Labels:


Monday, June 02, 2008

 

Following the first ever drop last December, the latest preview is now available. Ed has the link and details here.

Labels: ,


Friday, May 23, 2008

 

- Here are the slides (save as pptx).
- The demos were a subset of these videos: Samples, Task etc, Parallel class, PLINQ.

Thank you to those that attended my Parallel Extensions session earlier today at DevDays. I don't think I have ever seen so much interest in a technology before (I was answering questions for a good 20' after my 70' session ended). This is turning out to be one of my favourite talks – it just gives itself ;-)

Labels: ,


Friday, February 29, 2008

 

A while back I recorded some screencasts on this hot topic, and I see now that the popular VSJ developer magazine has published my article on Parallel Extensions to the .NET Framework.

For those of you that can't get a hard/physical copy, they have kindly made it available online for your reading pleasure.

Labels: ,


Friday, November 30, 2007

 

The Parallel Extensions to .NET Framework 3.5 is available! Download and play with the first ever public drop - the December CTP.

If you are then ready to dig into it, I have three 20' screencasts (plus more cooking):
1. Tour of the Samples. A tour of what gets installed to get you started.
2. Declarative data parallelism. This is about PLINQ.
3. Imperative data parallelism. This is about the static Parallel class.

After watching the above, visit the relevant MSDN dev centre. For any questions please use the dedicated online forums. For feedback please use the connect site. If you want to congratulate or blame the product team, visit their blog.

Labels: ,


Thursday, November 29, 2007

 

"Parallel Computing Initiative Ushers Computing into the Next Era"
A bit high level (marketing?) at times in this 10-page document, but there is definitely some interesting stuff in there - read it!

Labels: ,


Wednesday, November 28, 2007

 

I am a bit slow so I had to read this page twice on wikipedia, but it was worth it in the end, I think. See if you get it all in one pass: Futures and Promises. Because my imperative head wasn't completely blown away I also spent time trying to grok dataflow programming. Opens your mind doesn't it?

Labels: ,


 

Via Jason, I discovered this post on the concurrency learning curve and following links from there, I found Intel's view on "data and task parallelism". Interesting stuff!

Labels: ,


Tuesday, September 25, 2007

 

Rather than writing or screenshoting the new multithreaded debugging enhancements in Visual Studio 2008, I thought I'd create a 15' video to demonstrate them. See if you can spot what the bug is before I "discover" it in my demo ;-).

Labels: ,


Friday, September 21, 2007

 

Angelos found an interesting article via the UK MSDN Flash. If you did as well, check out more about that on the author's blog (one more reason to be sorry PDC was cancelled).

Anyway...

In a not so unrelated area, on a list that I am a member of, a question came up about affinitising a managed thread to a specific CPU. Jeffrey Richter came to the rescue by pointing out the System.Diagnostics.ProcessThread and its ProcessorAffinity property but then the harder question came along of how to associate that class with the System.Threading.Thread class. Below is the answer in C# from Mr Richter again:
      // Call this passing in 0
public static ProcessThread GetProcessThreadFromWin32ThreadId(Int32 threadId) {
if (threadId == 0) threadId = ThreadUtility.GetCurrentWin32ThreadId();
foreach (Process process in Process.GetProcesses()) {
foreach (ProcessThread processThread in process.Threads) {
if (processThread.Id == threadId) return processThread;
}
}
throw new InvalidOperationException("No thread matching specified thread Id was found.");
}
...where ThreadUtility is a class containing at least:
      [DllImport("Kernel32", EntryPoint = "GetCurrentThreadId", ExactSpelling = true)]
public static extern Int32 GetCurrentWin32ThreadId();

The code (and much more) is part the Power Threading Library which is available on the Wintellect site.

Labels: ,


Monday, July 09, 2007

 

Having done a lot of real world work with threading in the .NET Framework including working with the numerous limitations of NETCF v1.0 (inc. implementing the BackgroundWorker for it), I read with interest Sam's parallel computing blog post to see what links there would be in there and I wasn't disappointed. Check it out if you are new to threading in the managed world. That reminded me about a related topic that I keep mentioning at the end of my LINQ presentations and that I've been meaning to blog about in response to Tim's question: Parallel LINQ (PLINQ).

One of the numerous benefits of LINQ is the move to declarative programming, which has side benefits over and beyond the obvious ones. By writing code that tells the engine what it is that we want it to do rather than how to go about it, we open new possibilities where the engine can take our intent and split/execute it on multiple threads/cores (since ultimately it is responsible on the how). While most devs "get" that, it may sound a bit woolly to others, so here are some links/info about PLINQ.

I believe the first mention of PLINQ is in this eWeek article (August 2006). Joe Duffy announced his involvement (September 2006) and then followed it up with more info and a slide deck (January 2007). Bart watched a presentation on the topic and spills the beans on the AsParallel extension method which he follows with a NON-Plinq code example (April 2007). Finally, watch a real 5' demo by Luca from Tech Ed [between 51:26-56:40] (May 2007).

Of course, it is early days and PLINQ will not ship with VS2008 (or even as part of the wider Orcas-wave) but you get the idea... The earlier you start taking advantage of LINQ, the earlier you'll be able to take advantage of PLINQ when it eventually ships ;)

Labels: , ,


Copyright © Daniel Moth