Why (concurrency != concurrency) is actually true: Part 1
Everyone talks about concurrency nowadays. …and sadly, everyone knows how to speed up each and every poor sequential algorithm by throwing multiple threads at the problem. - Right?
This series of posts talks about some of the most common mistakes made when programmers try to implement a concurrent system - and believe me when I tell you, there are a lot of senior code monkeys out there who are not aware of the issues covered here. All my examples are based on wannabe-concurrent code I’ve actually seen out in the wild, I just simplified them to give prominence to the actual problem. So let’s get into it!
Y U no do what I told ya?!
If you’ve implemented a concurrent system in the past you probably know the “Hey, I didn’t tell you to… oh wait.” syndrome. No? Ok, let me explain.
Consider the following C++ code.
1 int a = 0, b = 0;
2
3 void func()
4 {
5 a = b + 1;
6 b = 0;
7 }
Now that looks pretty straight forward, doesn’t it? We initialize the globals a
and b
with 0 and assuming someone calls
func
we add 1 to b
, write the result to a
and then write 0 to b
. Right? … No!
The very cool thing about high level languages like C++ is that we don’t have to write assembler. We don’t have to tell
the CPU instruction by instruction what it has to do in order to produce the output we desire.
Every expression or statement written in a high level language is usually translated into multiple CPU instructions. C++
usually compiles to somewhere around 3-15 instructions per statement. A Python statement, for example, is somewhere between
300 and 1000 instructions. - So let’s take a look at the unoptimized compiler output of the example above.
1 a:
2 .zero 4
3 b:
4 .zero 4
5 func():
6 pushq %rbp
7 movq %rsp, %rbp
8 movl b(%rip), %eax
9 addl $1, %eax
10 movl %eax, a(%rip)
11 movl $0, b(%rip)
12 popq %rbp
13 ret
Ha! Just as you would expect. a
and b
are statically initialized to 0. pushq %rbp
and movq %rsp, %rbp
basically
set up the stack. We can ignore them for now since we don’t actually create anything on the stack.
movl b, %eax
moves b
into an accumulator register and addl $1, %eax
adds 1 to the contents of eax
. So far so good.
movl %eax, a
then writes the contents of the accumulator register to b
. Next, movl $0, b
writes 0 to b
. The last
two lines revert the stack pointer; we can again ignore them for now.
Alright, this seems pretty much like what we would expect the compiler output to be. Right? Right. However, nobody actually
sends out debug or unoptimized builds to customers. Right? Right. So let’s now have a look at what the compiler does when
we enable the full set of optimizations.
1 func():
2 movl b(%rip), %eax
3 movl $0, b(%rip)
4 addl $1, %eax
5 movl %eax, a(%rip)
6 ret
7 b:
8 .zero 4
9 a:
10 .zero 4
“Whoops! That’s so totally not what I coded.” - “…well bro, it sort of is, actually.” Let’s analyze the compiler’s output oce again.
The first thing we notice is that the compiler actually removed everything related to the stack which is cool because we didn’t really
need those instructions anyway. When entering func
we first load b
into the accumulator. - We basically have the same
code so far. - However, the next instruction, movl $0, b
, writes 0 into b
! “But.. but… butt, wait. You didn’t add one yet :(“
To understand why the compiler has reordered this instruction we will need to take a look at how (main) memory interactions are actually
handled by modern CPUs.
A CPU can do memory reads and writes more or less asynchronously meaning that an instruction can trigger a load from main memory into a register
and the CPU will load the memory content in the background. This way it can continue executing instructions until the data requested
is actually needed.
When we load b
into the accumulator we do actually request a portion of memory to be loaded into a CPU register. The CPU
will see the instruction and instruct the memory controller to start fetching the memory. While the memory controller is fetching some bits and bytes from main
memory, the actual processing unit can keep executing until the data is needed. If, however, the next instruction already requires the data
to sit ready in the accumulator, the CPU has to wait until everything has been fetched. A waiting CPU is sad, very sad. In fact, it will start to cry. Violently.
So better don’t f*ck your CPU. - Anyway, back to the example.
After writing 0 to b
the processor adds 1 to the accumulator content and then writes the accumulator to a
.
When working with concurrent system it is of extreme importance to keep in mind that the compiler (and the CPU too, more on that later) can reorder instructions however it likes as long as the observable result stays the same. In this case observing does not include other threads looking at us at any point in time.
“Alright. Cool. Now I know, the compiler optimizes memory writes. So what?” - Assume the following pseudo-concurrent code.
1 int* someData = nullptr;
2 bool preprocessingDone = false;
3
4 void enqueueThread(int* aweSomeData) {
5 someData = aweSomeData;
6 preprocessingDone = true;
7 }
8
9 int workerThread() {
10 while(!preprocessingDone);
11
12 return *someData;
13 }
And the compiler-optimized output.
1 workerThread():
2 cmpb $0, preprocessingDone(%rip)
3 jne .L2
4 .L5:
5 jmp .L5
6 .L2:
7 movq someData(%rip), %rax
8 movl (%rax), %eax
9 ret
10 enqueueThread():
11 subq $8, %rsp
12 movl $4, %edi
13 call operator new(unsigned long)
14 movb $1, preprocessingDone(%rip)
15 movl $5, (%rax)
16 movq %rax, someData(%rip)
17 addq $8, %rsp
18 ret
19 someData:
20 .zero 8
21 preprocessingDone:
22 .zero 1
Ok, let’s disassemble this assembly (Haha, ha, hahaha, get the pun? Disassemble the assembly. #lol #yolo).
The first instruction checks whether preprocessingDone
has been set already or not. If it hasn’t been set, the CPU runs into
a busy-loop waiting for preprocessingDone
to be set. Once preprocessingDone
is set the CPU will fall through to L2
, dereference
someData
and return the dereferenced value according to standard calling conventions. Looks rock solid so far, doesn’t it?
The real problem lies in enqueueThread
. Let’s ignore setting up the stack again and go straight to call operator new(ulong)
. This
is where we allocate memory.
In C++ the statement a = new B(c)
does multiple things. It allocates memory, initializes this memory
(by executing a constructor for example) and it assigns the pointer to the allocated memory to a variable. As we’ve already learned. The compiler
is free reorder instruction as long as the result is the same. So all we can be certain about is that when we actually use a
later on
in the same thread all three steps mentioned above have been executed. The order in which they are executed can be more or less arbitrary.
(Of course, we would have to allocate memory before we can write anything into it.)
Back to the example. Since memory I/O takes a few cycles to complete the compiler gives the CPU some time to do so by pulling instructions
not depending on the allocated memory up which is the reason the next instruction is movb $1 preprocessingDon
.
Whoops. We didn’t set a
yet.
All we have is uninitialized memory and we’re already setting preprocessingDone
to true
. The two following instructions
then do the actual memory initialization and variable assignment. - What does this mean?
Assume both threads are executed concurrently and now let’s assume enqueueThread
gets some CPU time from the scheduler.
The CPU gets interrupted just before movl $5, (%rax)
. So we have allocated memory which hasn’t been initialized. However, preprocessingDone
has already been set. The OS now decides to give workerThread
some CPU time. The CPU will exit the loop because our flag has already been set
by the other thread. As soon as we try to dereference someData
we will get an error because enqueueThread
hasn’t gotten to the point where it sets a
to point to the allocated memory. BOOM!! I can tell from experience that this sort of bug is extremely hard to track down
because it only happens if the instructions are executed in this exact pattern. Which might not always be the case since
CPU, OS and scheduler might be in completely different states every time you execute your code.
Alright, so far we have covered a simple synchronization issue. Most programmers who looked into multi-threading before probably already knew this. However, I still find myself looking at code which works completely fine until an updated compiler suddenly decides to reorder some writes here and some reads there and suddenly hell breaks loose because all the things break …occasionally.
The next post in this series will be about memory access pattern and how to avoid unintentional sequentialization of threads because of unthoughtful memory access.