concurrency::extent from amp.h

Mon, September 5, 2011, 06:23 PM under GPGPU | ParallelComputing

Overview

We saw in a previous post how index<N> represents a point in N-dimensional space and in this post we'll see how to define the N-dimensional space itself. image

With C++ AMP, an N-dimensional space can be specified with the template class extent<N> where you define the size of each dimension.

From a look and feel perspective, you'd expect the programmatic interface of a point type and size type to be similar (even though the concepts are different). Indeed, exactly like index<N>, extent<N> is essentially a coordinate vector of N integers ordered from most- to least- significant, BUT each integer represents the size for that dimension (and hence cannot be negative).

So, if you read the description of index, you won't be surprised with the below description of extent<N>

  • There is the rank field returning the value of N you passed as the template parameter.
  • You can construct one extent from another (via the copy constructor or the assignment operator), you can construct it by passing an integer array, or via convenience constructor overloads for 1- 2- and 3- dimension extents. Note that the parameterless constructor creates an extent of the specified rank with all bounds initialized to 0.
  • You can access the components of the extent through the subscript operator (passing it an integer).
  • You can perform some arithmetic operations between extent objects through operator overloading, i.e. ==, !=, +=, -=, +, -.
  • There are operator overloads so that you can perform operations between an extent and an integer: -- (pre- and post- decrement), ++ (pre- and post- increment), %=, *=, /=, +=, –= and, finally, there are additional overloads for plus and minus (+,-) between extent<N> and index<N> objects, returning a new extent object as the result.

In addition to the usual suspects, extent offers a contains function that tests if an index is within the bounds of the extent (assuming an origin of zero). It also has a size function that returns the total linear size of this extent<N> in units of elements.

Example code

  extent<2> e(3, 4);
  _ASSERT(e.rank == 2);
  _ASSERT(e.size() == 3 * 4);
  e += 3;
  e[1] += 6;
  e = e + index<2>(3,-4);
  _ASSERT(e == extent<2>(9, 9));
  _ASSERT( e.contains(index<2>(8, 8)));
  _ASSERT(!e.contains(index<2>(8, 9)));

 

Usage

The extent class on its own simply defines the size of the N-dimensional space. We'll see in future posts that when you create containers (arrays) and wrappers (array_views) for your data, it is an extent<N> object that you'll need to use to create those (and use an index<N> object to index into them). We'll also see that it is a extent<N> object that you pass to the new parallel_for_each function that I'll cover in the next post.


concurrency::index from amp.h

Sun, September 4, 2011, 09:40 PM under GPGPU | ParallelComputing

Overview

C++ AMP introduces a new template class index, where N can be any value greater than zero, that represents a unique point in N-dimensional space, e.g. if N=2 then an index<2> object represents a point in 2-dimensional space. This class is essentially a coordinate vector of N integers representing a position in space relative to the origin of that space. It is ordered from most-significant to least-significant (so, if the 2-dimensional space is rows and columns, the first component represents the rows). The underlying type is a signed 32-bit integer, and component values can be negative.

The rank field returns N.

Creating an index

image

The default parameterless constructor returns an index with each dimension set to zero, e.g.

  index<3> idx; //represents point (0,0,0)

An index can also be created from another index through the copy constructor or assignment, e.g.

  index<3> idx2(idx); //or index<3> idx2 = idx;

To create an index representing something other than 0, you call its constructor as per the following 4-dimensional example:

  int temp[4] = {2,4,-2,0};
  index<4> idx(temp);

Note that there are convenience constructors (that don’t require an array argument) for creating index objects of rank 1, 2, and 3, since those are the most common dimensions used, e.g.

  index<1> idx(3);
  index<2> idx(3, 6);
  index<3> idx(3, 6, 12);

Accessing the component values

You can access each component using the familiar subscript operator, e.g.

One-dimensional example:

  index<1> idx(4);
  int i = idx[0]; // i=4

Two-dimensional example:

  index<2> idx(4,5);
  int i = idx[0]; // i=4
  int j = idx[1]; // j=5

Three-dimensional example:

  index<3> idx(4,5,6);
  int i = idx[0]; // i=4
  int j = idx[1]; // j=5
  int k = idx[2]; // k=6

Basic operations

Once you have your multi-dimensional point represented in the index, you can now treat it as a single entity, including performing common operations between it and an integer (through operator overloading): -- (pre- and post- decrement), ++ (pre- and post- increment), %=, *=, /=, +=, -=,%, *, /, +, -. There are also operator overloads for operations between index objects, i.e. ==, !=, +=, -=, +, –.

Here is an example (where no assertions are broken):

  index<2> idx_a;
  index<2> idx_b(0, 0);
  index<2> idx_c(6, 9);
  _ASSERT(idx_a.rank == 2);
  _ASSERT(idx_a == idx_b);
  _ASSERT(idx_a != idx_c);

  idx_a += 5;
  idx_a[1] += 3;
  idx_a++;
  _ASSERT(idx_a != idx_b);
  _ASSERT(idx_a == idx_c);

  idx_b = idx_b + 10;
  idx_b -= index<2>(4, 1);
  _ASSERT(idx_a == idx_b);

Usage

You'll most commonly use index<N> objects to index into data types that we'll cover in future posts (namely array and array_view). Also when we look at the new parallel_for_each function we'll see that an index<N> object is the single parameter to the lambda, representing the (multi-dimensional) thread index…

In the next post we'll go beyond being able to represent an N-dimensional point in space, and we'll see how to define the N-dimensional space itself through the extent<N> class.


concurrency::accelerator_view

Sat, September 3, 2011, 08:32 PM under GPGPU | ParallelComputing

Overview

We saw previously that accelerator represents a target for our C++ AMP computation or memory allocation and that there is a notion of a default accelerator. We ended that post by introducing how one can obtain accelerator_view objects from an accelerator object through the accelerator class's default_view property and the create_view method. concurrency::accelerator_view

The accelerator_view objects can be thought of as handles to an accelerator.

You can also construct an accelerator_view given another accelerator_view (through the copy constructor or the assignment operator overload). Speaking of operator overloading, you can also compare (for equality and inequality) two accelerator_view objects between them to determine if they refer to the same underlying accelerator.

We'll see later that when we use concurrency::array objects, the allocation of data takes place on an accelerator at array construction time, so there is a constructor overload that accepts an accelerator_view object. We'll also see later that a new concurrency::parallel_for_each function overload can take an accelerator_view object, so it knows on what target to execute the computation (represented by a lambda that the parallel_for_each also accepts).

Beyond normal usage, accelerator_view is a quality of service concept that offers isolation to multiple "consumers" of an accelerator. If in your code you are accessing the accelerator from multiple threads (or, in general, from different parts of your app), then you'll want to create separate accelerator_view objects for each thread.

flush, wait, and queuing_mode

When you create an accelerator_view via the create_view method of the accelerator, you pass in an option of queuing_mode_immediate or queuing_mode_automatic, which are the two members of the queuing_mode enum. At any point you can access this value from the queuing_mode property of the accelerator_view.

When the queuing_mode value is queuing_mode_automatic (which is the default), any commands sent to the device such as kernel invocations and data transfers (e.g. parallel_for_each and copy, as we'll see in future posts), will get submitted as soon as the runtime sees fit (that is the definition of immediate).

When the value of queuing_mode is queuing_mode_immediate, the commands will be submitted/flushed immediately.

To send all buffered commands to the device for execution, there is a non-blocking flush method that you can call. If you wish to block until all the commands have been sent, there is a wait method you can call (which also flushes). You can read more to understand C++ AMP's queuing_mode.

Querying information

Just like accelerator, accelerator_view exposes the is_debug and version properties. In fact, you can always access the accelerator object from the accelerator property on the accelerator_view class to access the accelerator interface we looked at previously.

Accelerator also exposes a function that helps you stay aware of the progress of execution. You can read more about accelerator_view::create_marker.

Interop with D3D (aka DX)

If your app that uses C++ AMP to compute data also uses DirectX rendering shaders, e.g. pixel shaders, you can benefit by integrating C++ AMP into your graphics pipeline. One of the building blocks for that is being able to use the same device context from both the compute kernel and the other shaders. You can do that by going from accelerator_view to device context (and vice versa), through part of our interop API in amp.h: *get_device, create_accelerator_view. You can read more on DirectX interop.


concurrency::accelerator

Wed, August 31, 2011, 07:12 PM under GPGPU | ParallelComputing

Overview

An accelerator represents a "target" on which C++ AMP code can execute and where data can reside. Typically (but not necessarily) an accelerator is a GPU device. Accelerators are represented in C++ AMP as objects of the accelerator class.concurrency::accelerator

For many scenarios, you do not need to obtain an accelerator object, since the runtime has a notion of a default accelerator, which is what it thinks is the best one in the system. Examples where you need to deal with accelerator objects are if you need to pick your own accelerator (based on your specific criteria), or if you need to use more than one accelerators from your app.

Construction and operator usage

You can query and obtain a std::vector of all the accelerators on your system, which the runtime discovers on startup.

Beyond enumerating accelerators, you can also create one directly by passing to the constructor a system-wide unique path to a device if you know it (i.e. the “Device Instance Path” property for the device in Device Manager), e.g. accelerator acc(L"PCI\\VEN_1002&DEV_6898&SUBSYS_0B001002etc");

There are some predefined strings (for predefined accelerators) that you can pass to the accelerator constructor (and there are corresponding constants for those on the accelerator class itself, so you don’t have to hardcode them every time). Examples are the following:

  • accelerator::default_accelerator represents the default accelerator that the C++ AMP runtime picks for you if you don’t pick one (the heuristics of how it picks one will be covered in a future post). Example: accelerator acc;
  • accelerator::direct3d_ref represents the reference rasterizer emulator that simulates a direct3d device on the CPU (in a very slow manner). This emulator is available on systems with Visual Studio installed and is useful for debugging. More on debugging in general in future posts. Example: accelerator acc(accelerator::direct3d_ref);
  • accelerator::direct3d_warp represents WARP which is the current CPU fallback. Example: accelerator acc(accelerator::direct3d_warp);
  • accelerator::cpu_accelerator represents the CPU. In this first release the only use of this accelerator is for using the staging arrays technique. Example: accelerator acc(accelerator::cpu_accelerator);

You can also create an accelerator by shallow copying another accelerator instance (via the corresponding constructor) or simply assigning it to another accelerator instance (via the operator overloading of =). Speaking of operator overloading, you can also compare (for equality and inequality) two accelerator objects between them to determine if they refer to the same underlying device.

Querying accelerator characteristics

Given an accelerator object, you can access its description, version, device path, size of dedicated memory in KB, whether it is some kind of emulator, whether it has a display attached, whether it supports double precision, and whether it was created with the debugging layer enabled for extensive error reporting.

Below is example code that accesses some of the properties; in your real code you'd probably be checking one or more of them in order to pick an accelerator (or check that the default one is good enough for your specific workload):

  vector<accelerator> accs = accelerator::get_all(); 
  std::for_each(accs.begin(), accs.end(), [] (accelerator acc) 
  { 
    std::wcout << "New accelerator: " << acc.description << std::endl; 
    std::wcout << "device_path = " << acc.device_path << std::endl; 
    std::wcout << "version = " << (acc.version >> 16) << '.' << (acc.version & 0xFFFF) << std::endl; 
    std::wcout << "dedicated_memory = " << acc.dedicated_memory << " KB" << std::endl; 
    std::wcout << "doubles = " << ((acc.supports_double_precision) ? "true" : "false") << std::endl; 
    std::wcout << "limited_doubles = " << ((acc.supports_limited_double_precision) ? "true" : "false") << std::endl; 
    std::wcout << "has_display = " << ((acc.has_display) ? "true" : "false") << std::endl;
    std::wcout << "is_emulated = " << ((acc.is_emulated) ? "true" : "false") << std::endl; 
    std::wcout << "is_debug = " << ((acc.is_debug) ? "true" : "false") << std::endl; 
    std::cout << std::endl; 
  }); 

accelerator_view

In my next blog post I'll cover a related class: accelerator_view. Suffice to say here that each accelerator may have from 1..n related accelerator_view objects. You can get the accelerator_view from an accelerator via the default_view property, or create new ones by invoking the create_view method that creates an accelerator_view object for you (by also accepting a queuing_mode enum value of queuing_mode_automatic or queuing_mode_immediate that we'll also explore in the next blog post).


ScrollViewer.EnsureVisible for Windows Phone

Sat, July 30, 2011, 08:06 PM under MobileAndEmbedded

In my Translator By Moth app, on both the current and saved pivot pages the need arose to programmatically scroll to the bottom. In the former, case it is when a translation takes place (if the text is too long, I want to scroll to the bottom of the translation so the user can focus on that, and not their input text for translation). In the latter case it was when a new translation is saved (it is added to the bottom of the list, so scrolling is required to make it visible). On both pages a ScrollViewer is used.

In my exploration of the APIs through intellisense and msdn I could not find a method that auto scrolled to the bottom. So I hacked together a solution where I added a blank textblock to the bottom of each page (within the ScrollViewer, but above the translated textblock and the saved list) and tried to make it scroll it into view from code. After searching the web I found a little algorithm that did most of what I wanted (sorry, I do not have the reference handy, but thank you whoever it was) that after minor tweaking I turned into an extension method for the ScrollViewer that is very easy to use:

	this.Scroller.EnsureVisible(this.BlankText);

The method itself I share with you here:

    public static void EnsureVisible(this System.Windows.Controls.ScrollViewer scroller, 
                                          System.Windows.UIElement uiElem)
    {
      System.Diagnostics.Debug.Assert(scroller != null);
      System.Diagnostics.Debug.Assert(uiElem != null);

      scroller.UpdateLayout();

      double maxScrollPos = scroller.ExtentHeight - scroller.ViewportHeight;
      double scrollPos = 
              scroller.VerticalOffset - 
              scroller.TransformToVisual(uiElem).Transform(new System.Windows.Point(0, 0)).Y;

      if (scrollPos > maxScrollPos) scrollPos = maxScrollPos;
      else if (scrollPos < 0) scrollPos = 0;

      scroller.ScrollToVerticalOffset(scrollPos);
    }

I am sure there are better ways, but this "worked for me" :-)


"Hello World" in C++ AMP

Sun, June 26, 2011, 06:02 PM under GPGPU | ParallelComputing

UPDATE: I encourage you to visit a newer and better post with a C++ AMP matrix multiplication example.

Some say that the equivalent of "hello world" code in the data parallel world is matrix multiplication :)

Below is the before C++ AMP and after C++ AMP code. For more on what it all means, watch the recording of my C++ AMP introduction (the example below is part of the session).

    void MatrixMultiply(vector<float>& vC, 
			    const vector<float>& vA,
			    const vector<float>& vB, 
			    int M, int N, int W )
    {
        for (int y = 0; y < M; y++) 
        {
            for (int x = 0; x < N; x++) 
            {
                float sum = 0;
                for(int i = 0; i < W; i++)
                {
                    sum += vA[y * W + i] * vB[i * N + x];
                }
                vC[y * N + x] = sum;
	    }
        }
    }
Change the function to use C++ AMP and hence offload the computation to the GPU, and now the calling code (which I am not showing) needs no changes and the overall operation gives you really nice speed up for large datasets… 
    #include <amp.h>
    using namespace concurrency;

    void MatrixMultiply(vector<float>& vC, 
			    const vector<float>& vA,
			    const vector<float>& vB, 
			    int M, int N, int W )
    {
        array_view<const float,2>      a(M, W, vA);
        array_view<const float,2>      b(W, N, vB);
        array_view<writeonly<float>,2> c(M, N, vC); 

        parallel_for_each(
            c.grid,
            [=](index<2> idx) mutable restrict(direct3d) 
            {
                float sum = 0;
                for(int i = 0; i < a.x; i++) 
                {
                    sum += a(idx.y, i) * b(i, idx.x);
                }
                c[idx] = sum;
            }
        );
    }

Again, you can understand the elements above, by using my C++ AMP presentation slides and recording

Stay tuned for more…


Links to C++ documentation

Wed, June 22, 2011, 05:30 PM under C++

After a recent talk I gave on C++ AMP, one attendee was complaining that they were not familiar with lambdas and another found templates hard to parse. In case you are in the same boat, I thought I'd gather some essential reading material for you (also gives me one link to use in the future for referring people to ;-)

Lambdas are available (in some shape or form) in all modern languages, so do yourself a favor and learn about them:

Templates, have been around in modern languages for even longer than lambdas (e.g. Generics in .NET), so again go dive in:

In fact, why don't you refresh your knowledge and read the entire msdn C++ Language Reference – that's what I am doing!

If you are looking to keep up to date with what is happening in the C++ world, stay tuned on the Visual C++ team (aka WinC++ team) blog and ask questions in the C++ forums.


C++ AMP recording and slides

Fri, June 17, 2011, 02:51 PM under Events | GPGPU | ParallelComputing

Yesterday we announced C++ Accelerated Massive Parallelism.

Many of you want to know more about the API instead of just meta information. I will trickle more code over the coming months leading up to the date when we will share actual bits. Until you have bits in your hand, it is only your curiosity that is blocked, so I ask you to be patient with that and allow me to release this on our own schedule ;-)

You can now watch my 45-minute session introducing C++ AMP on channel9. You will also want to download the slides (pdf), because they are not readable in the recording.


C++ Accelerated Massive Parallelism

Wed, June 15, 2011, 09:16 AM under GPGPU | ParallelComputing

At AMD's Fusion conference Herb Sutter announced in his keynote session a technology that our team has been working on that we call C++ Accelerated Massive Parallelism (C++ AMP) and during the keynote I showed a brief demo of an app built with our technology. After the keynote, I go deeper into the technology in my breakout session. If you read both those abstracts, you'll get some information about what C++ AMP is, without being too explicit since we published the abstracts before the technology was announced.

You can find the official online announcement at Soma's blog post.

Here, I just wanted to capture the key points about C++ AMP that can serve as an introduction and an FAQ. So, in no particular order…

C++ AMP

  1. lowers the barrier to entry for heterogeneous hardware programmability and brings performance to the mainstream, without sacrificing developer productivity or solution portability.
  2. is designed not only to help you address today's massively parallel hardware (i.e. GPUs and APUs), but it also future proofs your code investments with a forward looking design.
  3. is part of Visual C++. You don't need to use a different compiler or learn different syntax.
  4. is modern C++. Not C or some other derivative.
  5. is integrated and supported fully in Visual Studio 11. Editing, building, debugging, profiling and all the other goodness of Visual Studio work well with C++ AMP.
  6. provides an STL-like library as part of the existing concurrency namespace and delivered in the new amp.h header file.
  7. makes it extremely easy to work with large multi-dimensional data on heterogeneous hardware; in a manner that exposes parallelization.
  8. introduces only one core C++ language extension.
  9. builds on DirectX (and DirectCompute in particular) which offers a great hardware abstraction layer that is ubiquitous and reliable. The architecture is such, that this point can be thought of as an implementation detail that does not surface to the API layer.

Stay tuned on my blog for more over the coming months where I will switch from just talking about C++ AMP to showing you how to use the API with code examples…


Speaking at AMD Fusion conference

Thu, June 9, 2011, 05:44 PM under Events | ParallelComputing

UPDATE: C++ AMP session recording and slides now available.

Next Wednesday at 2pm I will be presenting a session at the AMD Fusion developer summit in Bellevue, Washington State.

For more on this conference please visit the official website. If you filter the catalog by 'Speaker Last Name' to "Moth", you'll find my talk.

For your convenience, below is the title and abstract

Blazing-fast code using GPUs and more, with Microsoft Visual C++

To get full performance out of mainstream hardware, high-performance code needs to harness, not only multi-core CPUs, but also GPUs (whether discrete cards or integrated in the processor) and other compute accelerators to achieve orders-of-magnitude speed-up for data parallel algorithms. How can you as a C++ developer fully utilize all that heterogeneous hardware from your Visual Studio environment? How can your code benefit from this tremendous performance boost without sacrificing your developer productivity or the portability of your solution? The answers will be presented in this session that introduces a new technology from Microsoft.

Hope to see many of you there!