"Hello World" in C++ AMP

Sun, June 26, 2011, 06:02 PM under GPGPU | ParallelComputing

UPDATE: I encourage you to visit a newer and better post with a C++ AMP matrix multiplication example.

Some say that the equivalent of "hello world" code in the data parallel world is matrix multiplication :)

Below is the before C++ AMP and after C++ AMP code. For more on what it all means, watch the recording of my C++ AMP introduction (the example below is part of the session).

    void MatrixMultiply(vector<float>& vC, 
			    const vector<float>& vA,
			    const vector<float>& vB, 
			    int M, int N, int W )
    {
        for (int y = 0; y < M; y++) 
        {
            for (int x = 0; x < N; x++) 
            {
                float sum = 0;
                for(int i = 0; i < W; i++)
                {
                    sum += vA[y * W + i] * vB[i * N + x];
                }
                vC[y * N + x] = sum;
	    }
        }
    }
Change the function to use C++ AMP and hence offload the computation to the GPU, and now the calling code (which I am not showing) needs no changes and the overall operation gives you really nice speed up for large datasets… 
    #include <amp.h>
    using namespace concurrency;

    void MatrixMultiply(vector<float>& vC, 
			    const vector<float>& vA,
			    const vector<float>& vB, 
			    int M, int N, int W )
    {
        array_view<const float,2>      a(M, W, vA);
        array_view<const float,2>      b(W, N, vB);
        array_view<writeonly<float>,2> c(M, N, vC); 

        parallel_for_each(
            c.grid,
            [=](index<2> idx) mutable restrict(direct3d) 
            {
                float sum = 0;
                for(int i = 0; i < a.x; i++) 
                {
                    sum += a(idx.y, i) * b(i, idx.x);
                }
                c[idx] = sum;
            }
        );
    }

Again, you can understand the elements above, by using my C++ AMP presentation slides and recording

Stay tuned for more…

Wednesday, June 29, 2011 1:23:08 AM (Pacific Daylight Time, UTC-07:00)
Daniel, could you take some time to comment on the difficulty of parallelizing Strassen's algorithm http://en.wikipedia.org/wiki/Strassen_algorithm for matrix multiplication.

While 3 nested for-loops multiplication is the easiest, it is not the most efficient.
Tanveer Badar
Tuesday, July 5, 2011 7:58:57 PM (Pacific Daylight Time, UTC-07:00)
Tanveer, if you have really large matrices, then strassen would be a good option (you'd partition the data on the CPU and make multiple gpu kernel invocations). That is obviously not a Hello World example (the title of this blog post). When we ship bits, I'll be sure to include an example like that... thanks for the idea.
Comments are closed.