Adding Concurrency Optimization in Silverlight 3

by EFVincent on 08/18/2009

About a week ago Joa Ebert posted Flirting with Silverlight, in which he discussed his experiences with a cool little Lorenz attractor application. The strange attractor app (which you can check at his blog) was his way of playing around with the newly released Silverlight 3 (Props to Joa for some very cool work BTW). It calculates a shape in 3D defined by some 300,000 particles, and animates the spinning and rotating of this shape as the user moves the mouse across the display surface.

His point was to relate his first Silverlight experience, and how the non-optimized Silverlight application compared very favorably with the significantly optimized AS3 version. The HTML5 Javascript version was a non-starter, barely able to achieve a few frames per second compared to 30 for AS3 and SL3.

Joa made his code available. After looking at the SL3 source code I thought it would be trivial to add some basic concurrency that would significantly increase performance. Here’s a comparison of different implementations:

A point of interest, as far as I know there is no way to do concurrency in Action Script 3. In Joa’s application, the bulk of the hard work is done in a loop that executes every frame:

void OnStoryboardCompleted(object sender, EventArgs eventArgs)
{
    targetX += (mouseX - targetX) * 0.1;
    targetY += (mouseY - targetY) * 0.1;

    var index = MaxScreen;
    var maxIndex = MaxScreen;

    while (--index > -1)
        bitmap.Pixels[index] = 0x000000;

    var particle = particles;
    var matrix = Matrix4x4.RotationY(targetX * 0.05) * Matrix4x4.RotationX(targetY * 0.05) * translationMatrix;

    var cx = 275.0;
    var cy = 200.0;

    var w = 0.0;
    var xi = 0;
    var yi = 0;
    var x = 0.0;
    var y = 0.0;
    var z = 0.0;
    var pz = 0.0;

    while (null != particle)
    {
        x = particle.X;
        y = particle.Y;
        z = particle.Z;

        pz = focalLength + x * matrix.I02 + y * matrix.I12 + z * matrix.I22 + matrix.I32;

        if (0.0 < pz)
        {
            xi = (int)( ( w = focalLength / pz ) * ( x * matrix.I00 + y * matrix.I10 + z * matrix.I20 ) + cx );
            yi = (int)( w * ( x * matrix.I01 + y * matrix.I11 + z * matrix.I21 ) + cy );

            index = xi + yi * ScreenWidth;

            if (index > -1 && index < maxIndex)
            {
                var color = bitmap.Pixels[index] + 0x202020;
                bitmap.Pixels[index] = color > 0xffffff ? 0xffffff : color;
           }
        }

        particle = particle.Next;
    }

    bitmap.Invalidate();

    storyboard.Begin();
}
    }

What he’s doing here is writing pixels into a bitmap. At line 9, he sets all the pixels to zero (black), erasing the previous frame. At line 23 he starts a loop that will calculate each pixel that should be turned on for the upcoming frame. In both these cases he’s dealing with a one dimensional array of pixels. This is the perfect case for data parallelism. The work of setting pixels can be divided between processors.

In general, concurrency is not a simple technique to implement correctly. The number and type of problems that one can run into are way beyond the scope of anything short of a very thick book. But this perhaps the simplest case; simple static composition data parallelism. It’s static because you can pre-assign the amount of work each thread will do. It’s data parallelism because each thread will be doing the same kind of work (calculating and flipping pixels), as opposed to a solution where one thread does one thing (like talk on the network) while another thread does something else (like shred xml files).

We get this done by building a function that behaves like a For() loop, but executes the body in parallel between several threads. If you’ve played with the recently released Visual Studio 2010 beta that has the .NET 4.0 framework, you’ll recognize this function which is in the next version of the framework (but in a far more sophisticated and powerful form).

void ParallelFor(int lo, int hi, Action<int> body, int p)
{
    int chunk = (hi - lo) / p;      // Iterations per thread

    AutoResetEvent[] latch = new AutoResetEvent[p];
    for (int i = 0; i < p; i++) latch[i] = new AutoResetEvent(false);

    // Schedule the events to run in parallel
    for (int i = 0; i < p; i++)
    {
        ThreadPool.QueueUserWorkItem((ob) =>
        {
            int pid = (int)ob;
            int start = lo + pid * chunk;
            int end = pid == p - 1 ? hi : start + chunk;
            for (int j = start; j < end; j++)
            {
                body(j);
            }
            latch[pid].Set();
        }, i);
    }
    WaitHandle.WaitAll(latch);
}

ParallelFor() takes lo and hi values, an Action<int>, and the degree of desired parallelism. The Action<int> is a delegate, which is like a pointer to a function. In this case, it’s a pointer to a function that has a single int parameter, and returns void. This function is the work that happens every iteration. Silverlight, .NET, and C# make it possible (I think it’s easy, but I’ve been doing this a while) to use anonymous methods and what amount to functional techniques to make Joa’s code multithreaded.  It may be best to illustrate by example.

while (--index > -1)
    bitmap.Pixels[index] = 0x000000;

Joa’s code above, which initializes the pixels in the bitmap, becomes:

ParallelFor(0, buffer.Pixels.Length, (i) => buffer.Pixels[i] = 0x0, numComputationThreads);

The parameters of ParallelFor are zero for where to start the for loop, Pixels.Length for where to end the loop, the lambda function (i) => buffer.Pixels[i] = 0x0, and the number of threads to use to do the work. The lambda function is anonymous (i.e. we don’t have to create a function with a name). The type of “i” is inferred from context to be an int. The body of the lambda (the part after the “=>”), is the work that is done each time through the loop.

This makes the clearing of last frame’s bitmap concurrent. To make the rendering of the next frame concurrent is a bit more invasive, but still not bad. Here’s the concurrent version.

ParallelFor(0, particles.Length, (i) =>
{
    Particle part = particles[i];
    var x = part.X;
    var y = part.Y;
    var z = part.Z;

    var w = 0.0;
    var xi = 0;
    var yi = 0;

    var pz = focalLength + x * matrix.I02 + y * matrix.I12 + z * matrix.I22 + matrix.I32;

    if (0.0 < pz)
    {
        xi = (int)((w = focalLength / pz) * (x * matrix.I00 + y * matrix.I10 + z * matrix.I20) + cx);
        yi = (int)(w * (x * matrix.I01 + y * matrix.I11 + z * matrix.I21) + cy);

        var index = xi + yi * ScreenWidth;

        if (index > -1 && index < maxIndex)
        {
            var color = buffer.Pixels[index] + 0x202020;
            buffer.Pixels[index] = color > 0xffffff ? 0xffffff : color;
        }
    }

}, numComputationThreads);

buffer.Invalidate();
lastRender = elapsed;

}

There are a few changes that I made to make the optimization a bit easier. Joa used a linked list of particles, I captured an array, just because it was a bit easier to make concurrent. But the significant difference is the body of Joa’s loop is now in an anonymous function. Values that will change in the body of the loop are now defined in the loop (x,y,z, etc). This is because if the loop used the variables defined outside the lambda, then the different threads would try to use the same variables as one another, which is obviously a problem. Other than that, the code is pretty much the same.

So how much of an improvement is there? Yea, a lot. I put a frame counter on Joa’s original code and on my MBP dual Core Duo 2.4Ghz 4Gig ram running Windows 7 release, it was getting about 30 frames per second. On OS X it was about the same. With the concurrency, and set to use 4 threads, it got up to 45 fps. On my quad core Dell M6400 Core 2 Exreme Q9300 at 2.54 Ghz with 8gig ram, it went up to 65 fps. If found it was best to use more threads then you have cores. So 4 threads was good on the dual core, 8 on the quad core.

Of course, this means the app will absolutely peg the processor, which is good if you want max speed. But you can also throttle the frame rate back, say hold it at 30, and use less processor to get the same fps as the original.

Silverlight 3 has caused a bit of a splash in the RIA community if the Twitter chatter is any indication. There is some venomous talk out there; you know the such and such tech sucks, such and such tech rules, and of course Microsoft is evil. I choose to stay out of that fray, perhaps it’s hitting 40 that did it. I don’t care to argue with Fanbois. I will say that the the community will discover that the .NET framework and C# (even VB.NET) make for a serious development platform that will benefit the community at large. It’s a very good time to be a developer!

Here’s a link to a zip file containing my version of the project: Strange Attractor Project

Would you like to see the concepts in this post expanded? Want to know more about concurrency, functional patterns, generics? Leave me a note in the comments or send me a Twitter reply at @efvincent. Oh, and thanks for reading!

{ 15 comments… read them below or add one }

Joa Ebert 08/18/2009 at 10:32 am

Great post and thank you for sharing. I can only agree to your last paragraphs. The one thing I really like about this little “experiment” is:
- You can achieve great results even if you are new to the platform.
- You can actually use your knowledge to improve things.

Kaspar Lüthi 08/18/2009 at 5:04 pm

The strange thing about this strange attractor is, the concurrent Silverlight version gives me a meager 5fps on my home PC. Changing the threads did not change much. In the office the same version was about 15% faster than the AS3 version. At home the AS3 version runs with about 18fps.

But even stranger, the “naive” Silverlight implementation runs very quick at home, feels like more than 25fps.

CPU at home is Pentium D, which seems to be a dual core processor. Maybe Silverlight uses the GPU more than we could possibly hope for? Graphics card is NVIDIA GeForce 9800, which maybe has no (supported) GPU.

Kaspar Lüthi 08/18/2009 at 7:04 pm

Update: now when checking again, with a fresh Firefox 3.5.2, much of the strangeness has gone. Now I have 22fps on the concurrent version. Still unclear, why only the concurrent version was affected by the undefined problem but the “naive” Silverlight version ran just fine.

EFVincent 08/18/2009 at 7:36 pm

Kaspar – Thanks for the comment… That is interesting behavior, especially the bit about the original Silverlight implementation running fastest. It could be that my algo needs improving. I wasn’t happy with the way the AutoResetEventsare being created and destroyed in the render loop, I just didn’t bother coming up with a better reuse strategy for them (I only spent an hour or so on the updated version). This could be putting more garbage collection pressure on your Pentium D box then it can happily accept. As for GPU, the original and updated Silverlight versions would both benefit the same amount from GPU processing, so that can’t be it.

I’ll see if I can’t optimize the AutoResetEvent object creation situation, and also add the ability to adjust the number of particles. Could be there’s a threshold where the number of particles is too much to calculate. It’s at something like 307,000 right now.

EFVincent 08/18/2009 at 7:40 pm

Well, that points at the browser having something to do with it. The interesting thing about the Silverlight plugin (which goes for the Flash plugin as well), is that in addition to living in the browser it has to share process space and other services of the browser. I can’t speak to this in intimate detail, but the plugin is like a parasite, it lives off the browser. It could be that there was something about that version of FireFox that cause one version of the SL3 application to run less efficiently. Of course, now I’m into pure speculation.

OJ 08/19/2009 at 5:37 am

It’s not a surprise that you’ve slowed down the application given the number of work items you’re queuing and the number of events you’re waiting on. If you’ve got a resolution of 300 x 300 (which you don’t, you have more) that’s 90,000 work items. The context switching and waiting would send performance through the floor.

You’d be better off dividing your pixel set by the number of physical hardware threads you have on your machine, and getting each one of those threads to work on a subset of the info. You don’t have to worry about contention if you force each work item to deal with its own stuff. Your overheads would drop substantially and you’d probably see at least some kind of performance improvement.

I wouldn’t blame the browser or the plugin. Too much multithreading is often worse than none!

Just my $0.02 :) Good luck!
OJ

Jerry 08/19/2009 at 8:24 am

Great post Eric!

Fun examples with great commentary and working source are always good in my book :-)

Silverlight, Concurrency, Functional Programming, Generics – all cool…

EFVincent 08/19/2009 at 9:18 am

OJ –

I not sure you looked at the code carefully enough; if there were 90,000 work items, we’d be talking about frames per minute or hour not frames per second!

If you review the ParallelFor() method you’ll see that the pixels are indeed being divided between the threads, and there are the same number of AutoResetEvents as the number of threads.

The number of threads is dictated by the last parameter of the ParallelFor() function, which is driven by the value of the slider. The best performance comes from roughly 2x the number of cores, or 4 threads on a typical dual core box, and 8 from a quad core box.

Setting the number of threads to the number of cores exactly doesn’t produce the best possible frame rate. Slightly higher is better, the renderer will achieve the best balance of favorable scheduling of its threads and the penalty of context switching.

The slider allows you to split the work up into as many as 64 threads, which does indeed bring the frame rate down. The point is to show the effect of attempting to use too high a degree of parallelism.

With regards to the blaming of the browser – Kaspar experienced significantly different behavior running the same component in his particular environment with two different browsers. The browser version was the only (apparent) variable between the two different sets of behavior, which would make it an excellent suspect when investigating the difference in behavior. It would be prudent for Silverlight, (and Flash) developers to understand that there is a relationship between the plugin and the browser, and what the potential impact of that relationship is.

Thanks for the comment!
-e

OJ 08/19/2009 at 6:55 pm

Aloha!

You’re right, I didn’t read the code thoroughly :) Nor did I read the previous comments thoroughly. That’ll teach me eh? It appears you are doing everything that I had mentioned, and your comments re: the browser are indeed appropriate.

Next time, I promise to make sure i read things properly before stating stuff that’s already been done!

Cheers!
OJ

Kaspar Lüthi 08/20/2009 at 5:06 am

“The (SL3) plugin is like a parasite, it lives off the browser” – and off the Flash community…just kidding :)

Back with some news. Accidentally I discovered how to reproduce the slowdown. It’s when I have certain Adobe AIR applications running with visible windows, the concurrent SL3 version significantly slows down. When the AIR app is minimized or terminated, speed is back to normal. “Naive” SL3 version seems to be always OK.

You can eventually try It yourself, one of the AIR apps slowing things down is “thwirl”, a Twitter client, try multiple open account windows. Also the Times Reader slowed it down. Another app, Balsamiq, did not slow it down. It eventually depends how the app authored.

I have AIR 1.5.2 installed. Interestingly the slowdown affects both FF and IE, so this is not a browser issue.

EFVincent 08/20/2009 at 8:36 am

Kaspar – interesting about Air causing other apps to slow down when running certain apps. It would be interesting also to profile your system while those other apps are running. What demands are on the CPU, memory, graphics sub-system, etc. at that time? Are any other applications effected?

Something like the strange attractor application is going to be particularly sensitive to CPU contention. Not many apps are so completely CPU bound. Things like 3D renderers, programs that encode music or video, operations like HDR Merge in Photoshop, those things are sensitive to CPU contention that might not be noticed surfing the web or writing email.

Chad 08/22/2009 at 11:53 pm

Amazing article and a very interesting read.

Kaspar Lüthi 08/23/2009 at 2:22 pm

When having “twirl” open with 4 windows, the CPU is used between 1-3%. As I said, just try it yourself.

In this SL3 Quake demo, the frame rate is not affected by AIR apps: http://www.innoveware.com/ql3/QuakeLight.html

Ole Jak 10/05/2009 at 3:57 pm

Great! But it seems to me that Alchymy is FASTER or at least same speed))) http://www.unitzeroone.com/labs/alchemyPushingPixels/

EFVincent 12/10/2009 at 2:51 pm

Ole Jak – On my quad core Dell, the Alchemy link you forwarded gets to about 31 FPS. If I set the Silverlight multithreaded version to one thread, it also achieves about 31 FPS. However setting the SL version to 8 threads on a quad core box yields about 65 FPS. Alchemy does seem to speed up rendering, but still isn’t doing parallel calculations over multi-cores, which is what this experiment is about.

Even on my MacBook Pro, a dual core box, setting the Silverlight version to 4 threads yields about 45 FPS. Clearly the multithreading code has a very significant impact on computationally intensive applications where data parallelism is possible.

Sidebar: Since the original posting I’m now running SL4 Beta 1, and performance of this application is consistent with SL3 release.

Leave a Comment

Next post: C# Nuts and Bolts: Delegates and Functions, Part I