Everyone talks about concurrency nowadays. …and sadly, everyone knows how to speed up each and every poor sequential algorithm by throwing multiple threads at the problem. - Right?

This series of posts talks about some of the most common mistakes made when programmers try to implement a concurrent system - and believe me when I tell you, there are a lot of senior code monkeys out there who are not aware of the issues covered here. All my examples are based on wannabe-concurrent code I’ve actually seen out in the wild, I just simplified them to give prominence to the actual problem. So let’s get into it!

Y U no do what I told ya?!

If you’ve implemented a concurrent system in the past you probably know the “Hey, I didn’t tell you to… oh wait.” syndrome. No? Ok, let me explain.
Consider the following C++ code.

1 int a = 0, b = 0;
3 void func()
4 {
5     a = b + 1;
6     b = 0;
7 }

Now that looks pretty straight forward, doesn’t it? We initialize the globals a and b with 0 and assuming someone calls func we add 1 to b, write the result to a and then write 0 to b. Right? … No!
The very cool thing about high level languages like C++ is that we don’t have to write assembler. We don’t have to tell the CPU instruction by instruction what it has to do in order to produce the output we desire. Every expression or statement written in a high level language is usually translated into multiple CPU instructions. C++ usually compiles to somewhere around 3-15 instructions per statement. A Python statement, for example, is somewhere between 300 and 1000 instructions. - So let’s take a look at the unoptimized compiler output of the example above.

 1 a:
 2 	.zero	4
 3 b:
 4 	.zero	4
 5 func():
 6 	pushq	%rbp
 7 	movq	%rsp, %rbp
 8 	movl	b(%rip), %eax
 9 	addl	$1, %eax
10 	movl	%eax, a(%rip)
11 	movl	$0, b(%rip)
12 	popq	%rbp
13 	ret

Ha! Just as you would expect. a and b are statically initialized to 0. pushq %rbp and movq %rsp, %rbp basically set up the stack. We can ignore them for now since we don’t actually create anything on the stack.
movl b, %eax moves b into an accumulator register and addl $1, %eax adds 1 to the contents of eax. So far so good. movl %eax, a then writes the contents of the accumulator register to b. Next, movl $0, b writes 0 to b. The last two lines revert the stack pointer; we can again ignore them for now.
Alright, this seems pretty much like what we would expect the compiler output to be. Right? Right. However, nobody actually sends out debug or unoptimized builds to customers. Right? Right. So let’s now have a look at what the compiler does when we enable the full set of optimizations.

 1 func():
 2 	movl	b(%rip), %eax
 3 	movl	$0, b(%rip)
 4 	addl	$1, %eax
 5 	movl	%eax, a(%rip)
 6 	ret
 7 b:
 8 	.zero	4
 9 a:
10 	.zero	4

“Whoops! That’s so totally not what I coded.” - “…well bro, it sort of is, actually.” Let’s analyze the compiler’s output oce again.
The first thing we notice is that the compiler actually removed everything related to the stack which is cool because we didn’t really need those instructions anyway. When entering func we first load b into the accumulator. - We basically have the same code so far. - However, the next instruction, movl $0, b, writes 0 into b! “But.. but… butt, wait. You didn’t add one yet :(“

To understand why the compiler has reordered this instruction we will need to take a look at how (main) memory interactions are actually handled by modern CPUs.
A CPU can do memory reads and writes more or less asynchronously meaning that an instruction can trigger a load from main memory into a register and the CPU will load the memory content in the background. This way it can continue executing instructions until the data requested is actually needed.
When we load b into the accumulator we do actually request a portion of memory to be loaded into a CPU register. The CPU will see the instruction and instruct the memory controller to start fetching the memory. While the memory controller is fetching some bits and bytes from main memory, the actual processing unit can keep executing until the data is needed. If, however, the next instruction already requires the data to sit ready in the accumulator, the CPU has to wait until everything has been fetched. A waiting CPU is sad, very sad. In fact, it will start to cry. Violently.
So better don’t f*ck your CPU. - Anyway, back to the example.

After writing 0 to b the processor adds 1 to the accumulator content and then writes the accumulator to a.

When working with concurrent system it is of extreme importance to keep in mind that the compiler (and the CPU too, more on that later) can reorder instructions however it likes as long as the observable result stays the same. In this case observing does not include other threads looking at us at any point in time.

“Alright. Cool. Now I know, the compiler optimizes memory writes. So what?” - Assume the following pseudo-concurrent code.

 1 int* someData = nullptr;
 2 bool preprocessingDone = false;
 4 void enqueueThread(int* aweSomeData) {
 5   someData = aweSomeData;
 6   preprocessingDone = true;
 7 }
 9 int workerThread() {
10   while(!preprocessingDone);
12   return *someData;
13 }

And the compiler-optimized output.

 1 workerThread():
 2 	cmpb	$0, preprocessingDone(%rip)
 3 	jne	.L2
 4 .L5:
 5 	jmp	.L5
 6 .L2:
 7 	movq	someData(%rip), %rax
 8 	movl	(%rax), %eax
 9 	ret
10 enqueueThread():
11 	subq	$8, %rsp
12 	movl	$4, %edi
13 	call	operator new(unsigned long)
14 	movb	$1, preprocessingDone(%rip)
15 	movl	$5, (%rax)
16 	movq	%rax, someData(%rip)
17 	addq	$8, %rsp
18 	ret
19 someData:
20 	.zero	8
21 preprocessingDone:
22 	.zero	1

Ok, let’s disassemble this assembly (Haha, ha, hahaha, get the pun? Disassemble the assembly. #lol #yolo).
The first instruction checks whether preprocessingDone has been set already or not. If it hasn’t been set, the CPU runs into a busy-loop waiting for preprocessingDone to be set. Once preprocessingDone is set the CPU will fall through to L2, dereference someData and return the dereferenced value according to standard calling conventions. Looks rock solid so far, doesn’t it?
The real problem lies in enqueueThread. Let’s ignore setting up the stack again and go straight to call operator new(ulong). This is where we allocate memory.

In C++ the statement a = new B(c) does multiple things. It allocates memory, initializes this memory (by executing a constructor for example) and it assigns the pointer to the allocated memory to a variable. As we’ve already learned. The compiler is free reorder instruction as long as the result is the same. So all we can be certain about is that when we actually use a later on in the same thread all three steps mentioned above have been executed. The order in which they are executed can be more or less arbitrary. (Of course, we would have to allocate memory before we can write anything into it.)

Back to the example. Since memory I/O takes a few cycles to complete the compiler gives the CPU some time to do so by pulling instructions not depending on the allocated memory up which is the reason the next instruction is movb $1 preprocessingDon.
Whoops. We didn’t set a yet. All we have is uninitialized memory and we’re already setting preprocessingDone to true. The two following instructions then do the actual memory initialization and variable assignment. - What does this mean?

Assume both threads are executed concurrently and now let’s assume enqueueThread gets some CPU time from the scheduler. The CPU gets interrupted just before movl $5, (%rax). So we have allocated memory which hasn’t been initialized. However, preprocessingDone has already been set. The OS now decides to give workerThread some CPU time. The CPU will exit the loop because our flag has already been set by the other thread. As soon as we try to dereference someData we will get an error because enqueueThread hasn’t gotten to the point where it sets a to point to the allocated memory. BOOM!! I can tell from experience that this sort of bug is extremely hard to track down because it only happens if the instructions are executed in this exact pattern. Which might not always be the case since CPU, OS and scheduler might be in completely different states every time you execute your code.

Alright, so far we have covered a simple synchronization issue. Most programmers who looked into multi-threading before probably already knew this. However, I still find myself looking at code which works completely fine until an updated compiler suddenly decides to reorder some writes here and some reads there and suddenly hell breaks loose because all the things break …occasionally.

The next post in this series will be about memory access pattern and how to avoid unintentional sequentialization of threads because of unthoughtful memory access.