Parallel.For and Silver Bullets
I’m the first to agree that concurrency is a hard problem to solve. I have bad memories involving C, POSIX threads and mutexes that I’d rather just forget. In the end, dealing with threads sucks, we’re all just waiting for the abstraction, right?
There has been some positive progress in .NET land in terms of making concurrency easier. We know for certain that we don’t want to be responsible for creating threads ourselves anymore, so we now have the Task Parallel Library. The TPL also includes things like Parallel.For and Parallel.ForEach for magically converting our old decrepit loops into shiny new parallel ones.
I bet you’re wondering how much faster one of these shiny new parallel loops can count to 1,000,000 aren’t you? Truth be told, I’ve been listening to a dangerous amount of Tina Turner and I’m in a good mood, so I’ll indulge you. Let’s do it the old decrepit way first (so we can compare).
var timer = new Stopwatch();
timer.Start();
var countedTo = 0L;
foreach (var number in Enumerable.Range(1, 1000000))
{
++countedTo;
}
timer.Stop();
Console.WriteLine("Counted to {0} in {1} ms", countedTo, timer.ElapsedMilliseconds);
It took a mere 11 milliseconds to count to 1,000,000.

Let’s create the parallel version. I’m terribly excited – it’s going to absolutely smash 11 ms.
var timer = new Stopwatch();
timer.Start();
var countedTo = 0L;
Parallel.ForEach(Enumerable.Range(1, 1000000), n =>
{
++countedTo;
});
timer.Stop();
Console.WriteLine("Counted to {0} in {1} ms", countedTo, timer.ElapsedMilliseconds);
I’m on the edge of my chair. I didn’t even have to write any thread related code and I’m about to parallelise this beauty into a concurrent masterpiece. Here we go, are you ready?

Holy crap! What the hell! 116 milliseconds, are you kidding? 959,028? It hasn’t even counted to the end, there must be a bug in the framework.
*Deep Breath*, *Deep Breath*
There’s got to be a logical explanation. It couldn’t have crashed before counting to the end, or it would have never reached the Console.WriteLine, right? countedTo holds a value less than 1,000,000, which must mean that the statement ++countedTo was only successfully executed 959,028 times. It also took 10 times longer to run, which must mean it was doing 10 times more work.
Explanation
So, it turns out Parallel.ForEach isn’t a silver bullet, and was never meant to be. Parallel.ForEach is not an abstraction, it’s simply a way of making what is already possible easier. That in itself doesn’t remove the other complexities surrounding concurrency.
Let’s try and make sense of why we have those results.
First off, the count is way off. The reason for this is quite simple once you’re aware of it. A statement such as ++var is not atomic. An executing thread isn’t running your C# code, it’s running the machine instructions produced by your C# code. In this example, ++var is at least three separate machine instructions, which can be interrupted at any point. The statement looks something like this to the CPU:
mov eax, dword ptr [countedTo] add eax, 1 mov dword ptr [countedTo], eax
Given that there’s only one version of eax, consider what will happen if the first thread only manages to execute the first instruction before being interrupted by the second thread which completes all three. If you’re thinking that upon the return of the first thread, the change made by the second thread will be destroyed, you’d be right. Code running inside a thread must be thread safe, irrespective of which fancy library you used to create the thread.
That explains why the count was way off, but why did it take 10 times longer? Wasn’t this thing in parallel damnit? It’s important to remember that parallel doesn’t imply quicker, though in lots of cases it’s a vehicle for making an algorithm execute faster. Not everything can be parallelised, and even things that can be sometimes won’t benefit from the increased overhead of a parallel solution.
The problem here (besides the concurrency issue) is also one of overhead. The C# compiler is firstly having to create a closure over the countedTo variable in order to allow instances of the thread delegate to access it. Additionally, instead of having just three instructions to increment the count, we have all the additional instructions for managing the thread. In the end, it turns out it is actually doing 10 times more work.
Solution
The solution is to synchronise the executing threads, making sure that only one thread can execute the non-thread safe code. Typically we’ll use a form of locking for this, and in fact the .NET framework allows you to use the lock keyword for this very purpose. Here’s how you’d correct the previous example:
var timer = new Stopwatch();
timer.Start();
var countedTo = 0L;
var lockObject = new object();
Parallel.ForEach(Enumerable.Range(1, 1000000), n =>
{
lock (lockObject)
{
++countedTo;
}
});
timer.Stop();
Console.WriteLine("Counted to {0} in {1} ms", countedTo, timer.ElapsedMilliseconds);
We’re now using an instance of System.Object to lock on, ensuring that only one thread can execute the increment at any one time. Of course, this now creates even more overhead (for such a simple operation to start with) causing it to take even longer:

Conclusion
Parallel.For and Parallel.ForEach might look like silver bullets that will magically allow you to speed up a loop by making it execute in parallel, but this isn’t the case at all. Parallel doesn’t automatically mean performance improvement. It largely depends on whether the task is inherently parallelisable and the overhead involved in achieving it.
You can’t simply change a foreach loop into a Parallel.ForEach loop without thinking about concurrency issues. Neither of these methods will make your code thread safe, only you can do that.
They weren’t kidding, concurrency isn’t a simple problem to solve.


Instead of doing full locking, use the Interlocked functions if possible – thread synchronization not only is hard to get right, it can kill performance.
Btw, Herb Sutter has run a pretty nice column on concurrency related stuff, the last blog post in the series (http://herbsutter.com/2010/07/12/) has links to all the rest.
@f0dder Good advice. Atomicity is more efficiently achieved through an Interlocked* call than by locking, if you have that option.
Thanks for the link too, that certainly looks like interesting reading :)