parallel_for_each from amp.h – part 1

Tue, September 6, 2011, 07:52 PM under GPGPU | ParallelComputing

This posts assumes that you've read my other C++ AMP posts on index<N> and extent<N>, as well as about the restrict modifier. It also assumes you are familiar with C++ lambdas (if not, follow my links to C++ documentation).

Basic structure and parameters

Now we are ready for part 1 of the description of the new overload for the concurrency::parallel_for_each function. The basic new parallel_for_each method signature returns void and accepts two parameters:

  • a extent<N>
  • a restrict(amp) lambda, whose signature is such that it returns void and accepts an index of the same rank as the extent

So it looks something like this (with generous returns for more palatable formatting) assuming we are dealing with a 2-dimensional space:

  // some_code_A
    e, // e  is of type extent<2>
    [ ](index<2> idx) restrict(amp)
      // kernel code
  // some_code_B

The parallel_for_each will execute the body of the lambda (which must have the restrict modifier), on the GPU. We also call the lambda body the "kernel". The kernel will be executed multiple times, once per scheduled GPU thread. The only difference in each execution is the value of the index object (aka as the GPU thread ID in this context) that gets passed to your kernel code. The number of GPU threads (and the values of each index) is determined by the extent object you pass, as described next.

In this context, one way to think about it is that the extent generates a number of index objects. So for the example above, if your extent was setup by some_code_A as follows:

  extent<2> e(2,3);

...then given that: e.size()==6, e[0]==2, and e[1]==3

...the six index<2> objects it generates (and hence the values that your lambda would receive) are:

   (0,0) (1,0) (0,1) (1,1) (0,2) (1,2)

So what the above means is that the lambda body with the algorithm that you wrote will get executed 6 times and the index<2> object you receive each time will have one of the values just listed above (of course, each one will only appear once, the order is indeterminate, and they are likely to call your code at the same exact time). Obviously, in real GPU programming, you'd typically be scheduling thousands if not millions of threads, not just 6.

If you've been following along you should be thinking: "that is all fine and makes sense, but what can I do in the kernel since I passed nothing else meaningful to it, and it is not returning any values out to me?"

Passing data in and out

It is a good question, and in data parallel algorithms indeed you typically want to pass some data in, perform some operation, and then typically return some results out. The way you pass data into the kernel, is by capturing variables in the lambda (again, if you are not familiar with them, follow the links about C++ lambdas), and the way you use data after the kernel is done executing is simply by using those same variables.

In the example above, the lambda was written in a fairly useless way with an empty capture list: [ ](index<2> idx) restrict(amp), where the empty square brackets means that no variables were captured.

If instead I write it like this [&](index<2> idx) restrict(amp), then all variables in the some_code_A region are made available to the lambda by reference, but as soon as I try to use any of those variables in the lambda, I will receive a compiler error. This has to do with one of the amp restrictions, where essentially only one type can be captured by reference: objects of the new concurrency::array class that I'll introduce in the next post (suffice for now to think of it as a container of data).

If I write the lambda line like this [=](index<2> idx) restrict(amp), all variables in the some_code_A region are made available to the lambda by value. This works for some types (e.g. an integer), but not for all, as per the restrictions for amp. In particular, no useful data classes work except for one new type we introduce with C++ AMP: objects of the new concurrency::array_view class, that I'll also introduce in the next post. Also note that if you capture some variable by value, you could use it as input to your algorithm, but you wouldn’t be able to observe changes to it after the parallel_for_each call (e.g. in some_code_B region since it was passed by value) – the exception to this rule is the array_view since (as we'll see in a future post) it is a wrapper for data, not a container.

Finally, for completeness, you can write your lambda, e.g. like this [av, &ar](index<2> idx) restrict(amp) where av is a variable of type array_view and ar is a variable of type array - the point being you can be very specific about what variables you capture and how.

So it looks like from a large data perspective you can only capture array and array_view objects in the lambda (that is how you pass data to your kernel) and then use the many threads that call your code (each with a unique index) to perform some operation. You can also capture some limited types by value, as input only. When the last thread completes execution of your lambda, the data in the array_view or array are ready to be used in the some_code_B region. We'll talk more about all this in future posts…


Please note that the parallel_for_each executes as if synchronous to the calling code, but in reality, it is asynchronous. I.e. once the parallel_for_each call is made and the kernel has been passed to the runtime, the some_code_B region continues to execute immediately by the CPU thread, while in parallel the kernel is executed by the GPU threads. However, if you try to access the (array or array_view) data that you captured in the lambda in the some_code_B region, your code will block until the results become available. Hence the correct statement: the parallel_for_each is as-if synchronous in terms of visible side-effects, but asynchronous in reality.


That's all for now, we'll revisit the parallel_for_each description, once we introduce properly array and array_view – coming next.