Concurrency
Make things confusing fast 🧵
Examples in this workshop will be in Go. Reason: It's rather simple there. C requires pthreads, which is a bit of an arcane library. Python has threads, but they suck greatly (GIL). Other languages like Javascript are single threaded by nature (well, there are web workers, but that's embarassing). Parallel programming in bash would be fun, but you might not share my sense of humor.
We will not talk about GPU Programming, which is something that is also part of parallel programming but it's seldom enough that you have to do that yourself.
»The art of distributing work to maximize the use of resources with minimal overhead.«
(while not shooting yourself in the knee by getting confused with the mindboggling behemoth you created)
Today I won't give you tips on how to optimize the shit out of your parallel programs. Parallel programs are a big and rather obvious speed-up by their own and I don't think I have to prove that programs that uses several cores are faster than a single core, if the task at hand can be splitted up in indepdendet subtasks. For parallel programs the same principles as the CPU talk apply - especially true and false sharing. Correct use of the cache is the most important thing here. Today we will rather focus on how to achieve parallel programs in a safe way.
It really is an art, since there are no easy guidelines.
There are two ways to be comfortable writing parallel code:
Don't. ¹
¹ Unless you really, really need the performance and you proved this by benchmarks.
Just to repeat: Concurrency hurts readability, will almost inevitably cause bugs and eat your hamster. Proceed at your own risk.
If you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
Especially true for parallel programming as our brain is really not build to think this way. So our mind's horizon is never far away when doing parallel programming.
What's the difference again?
Concurrent = concurrency is the composition of independently executing routines (execution might be interrupted at an time, how can me make it design code to allow that?) Parallel = parallelism is the simultaneous execution of (possibly related) computations (several instructions get executed at the same time).
All parallel programs are also concurrent.
Why is the difference important? See here: https://go.dev/blog/waza-talk
How can we achieve parallel execution? We have to use several cores and an easy way to do that without any real extra work is to use several processes. The OS scheduler will take care that the usage of cores gets maxed out.
Threads are scheduled like processes by the kernel. No real difference is made between processes and threads in that regard. In fact, the kernel does not really handle them much differently. The only real difference is that they share the same memory.
Coroutines are actually a much different concept than the other two (processes and threads) as it's not something that's powered by the operating system and differes wildly between implementation. Coroutines can be seen as concurrent routines, i.e. routines that can be halted and continued at any given time and can be composed to build parallel programs.
In case of Go, there is a scheduler that is started inside every program written in Go. It starts a number of threads (see GOMAXPROCS) and schedules the set of go routines over the set of threads.
Note: Diagram is only for a single core. Several cores of course can do the same. There are no coroutines in this picture since the CPU does not see them. Technically it doesn't figure the difference between threads and processes either.
How can be actually manage to have a function that can be interrupted? Well, we already know how: Context switches! And those happen all the time, even if your program is not multithreaded.
Linux' scheduling is preemptive. This means that a high priority task can be worked on by interrupting a task with lower priority.
Preemption points: The scheduler can interrupt a process at pretty much any point in time. Normally this happens in any of those cases:
Benchmark of a webserver handling dummy requests:
"Concurrency" is the number of requests per sec thrown at the server, y axis is the actual handled requests.
Just throwing multithreading on a problem makes it complex, but does not necessarily solve it.
Source: https://unixism.net/loti/async_intro.html
More details on the individual benchmarks:
https://unixism.net/2019/04/linux-applications-performance-introduction
Most of the time, when used in the right dose, multithreaded programming can easily speed up things. That's why this part of the workshop focuses more on the safe use of parallel programming instead of squeezing every last bit of performance out of parallel programming. Multiple threads are basically a single optimization and deserve their own chapter therefore.
The contents in this part of the workshop are best applied with the understanding of the CPU and Memory chapters.
As with humans that work on a project in paralle, parallel jobs need to synchronize to be useful. There is a big toolbox to do so.
If you use processes you obviously need to synchronize too sometimes. Potential ways can be to use filesystem locks or mlock() on shared memory.
Failure to synchronize leads to race conditions and other bugs that are really not fun to find. Debuggers won't work and prints might change timings so deadlocks or race conditions might not always occur.
var count int func inc() { for idx := 0; idx < 100000; idx++ { // critical section start count++ // critical section end } } func main() { go inc() go inc() time.Sleep(time.Second) fmt.Println(count) }
Keep critical sections as small as possible - for performance & sanity.
Question for you: * Why does this not happen if we reduce the 100000 to e.g. 1000? * Why is there a sleep in main()? * What synchronisation primitives do you know?
If you don't mention "sleep" then you're a little dishonest ;-)
Never start a goroutine/thread/process without knowing how it will stop.
The previous slide had a race condition: The program might have exited before the go routine do anything. In general: You should think about how your goroutines are terminated.
Why? Similar to memory leaks, the phenomen of goroutine leaks exist. Most of them come from place where people think "Ah, I don't need to close that go routine".
More background: https://dave.cheney.net/2016/12/22/never-start-a-goroutine-without-knowing-how-it-will-stop
Just kidding. Don't!
Okay, let's be honest. Sometimes you have no other way and we all did it.
In all seriousness: there are some edge cases where sleep() is the only way to reach some sort of sync state. Especially when we have to wait on some external process that we cannot control.
If you happen to have such an edge case, then do it in a loop:
for sync_condition_not_met() { sleep(small_amount_of_time) }
But often enough it's just a lazy way to wait until something has finished. This turns out to be flaky and depends often on the execution speed of the machine. Which is one reason why flaky unittests exist.
And yes, I'm very guilty of this myself.
var count int var mu sync.Mutex func inc() { for idx := 0; idx < 100000; idx++ { mu.Lock() count++ mu.Unlock() } // or better if a complete function is locked: // mu.Lock() // defer mu.Unlock() }
Question: Would it still work if I would move the mutex into the function?
Variants:
// buffered channel with 10 items c1 := make(chan int, 10) c1 <- 1 // send fmt.Println(<-c1) // recv // unbuffered channel: c2 := make(chan int) go func() { c2 <- 1 // send, locks until recv }() fmt.Println(<-c2)
Might be called prioq, fiber or something in other languages. Basically a slice or linked list protected with a mutex (in case of a buffered channel) or a single data field (in case of unbuffered channel)
Channels can be buffered or unbuffered:
Channels can be closed, which can be used as signal to stop. A send to a closed channel panics. A recv from a closed channel returns the zero value immediately.
A nil channel panics when something is send. A nil channel block forever on receiving.
We will see channels later in action.
c1 := make(chan int) // unbuffered c2 := make(chan int, 10) // buffered // A send on c1 would block until another go routine // will receive from it. On ch2 we can send 10 times // until the same happens. // channel are open or closed. // send on a closed channel panics // recv on a closed channel returns the zero value. close(c1) _, ok := c1 // ok would be false.
The good thing on channels: The values you send over it are (shallowly) copied, not referenced! This means that the other thread can safely use it
// Init the semaphore: semaphore := make(chan bool, 10) for i := 0; i < cap(tokens); i++ { semaphore <- i } // Limit number of jobs to 10 parallel jobs: for _, job := range jobs { <-semaphore go func(job Job) { // ... do work here ... semaphore <- true }(job) }
A semaphor is generalization of a mutex (which is also called binary semaphore)
Very easy way to limit the number of go routines. Basically a lightweight pool - good for one-time jobs.
Metaphor: A bouncer before a club.
It's corona times and he knows that only 10 people are allowed in the club (sad times) He counts up when he let's somebody in and counts down when someone leaves. If the club is full new visitors have to wait. Whem somebody leaves then a new person may enter the "critical section" (club).
select { case <-c1: // executed when c1 has // incoming data. case result := <-c2: // executed when c2 has // incoming data. default: // executed when nothing // on both channels. If no // 'default' given then // select blocks. // Without default, we block. }
select exists to be multiplex between several channels and to figure out if we way send or receive from a channel.
This feature does not exactly exist in most other languages. Usually condition variables are used for this outside of Go or something like await/asnyc in languages that have it.
wg := &sync.WaitGroup{} for idx := 0; idx < 10; idx++ { wg.Add(1) go func() { defer wg.Done() someJob(1, 2, 3) }() } // wait here for all jobs to finish: wg.Wait()
A barrier is basically an inverted semaphore: Instead of counting up until you hit a limit (which means that too many jobs at the same time), you count down until you reach zero (which means that all jobs are done) All threads have to arrive a certain point before any can continue.
Alternative names: Wait Groups, Latch.
Question: Would it still be correct if we move the wg.Add(1) to the go routine? No! There's a chance that wg.Wait() would not wait yet, because no go routine did start yet.
// Init: m := sync.Mutex{} c := sync.NewCond(&m) // Sender: c.L.Lock() // c.L == m condition = true c.Broadcast() // or c.Signal() for a single go routine. c.L.Unlock() // Receivers: c.L.Lock() for !condition { c.Wait() } // React on condition change here. c.L.Unlock()
Probably the most brainfuck-y of the primitives.
When to use:
Channels are a good replacement if you just need to wake up a single go routine. If you need to wake up many go routines at the same time (Broadcast()) then condition variables are way more efficient.
Context is a pattern that can be used in a similar way (although rather exclusively for cancellation)
func fetchData(url string) <-chan Result { ch := make(chan Result, 1) go func() { // use `url` somehow and do some potentially // long running I/O work. ch <- Result{...} }() return ch } func main() { promise := longRunningTask(1, 2) // ...do something else... // await the result: fmt.Println(<-promise) }
Promises are a good way to make asynchronous code look like synchronous code. A good example is fetching stuff via HTTP. While waiting for the response you can potentially do something else.
You can also chain promises together. I.e. automatically do something once the promise returns - by adding another go routine. This is called promise chaining.
Other languages like Python/Javascript have first-class support for async/await which kinda doing the same background. Go-routines are however a more flexible concept and it's easy to write libraries that emulate this behaviour (and others have done so)
var n atomic.Uint64{} n.Store(val int64) n.Load() (curr int64) n.Add(delta int64) (new int64) n.Swap(val int64) (old int64) n.CompareAndSwap(old, new int64) (swapped bool)
Example: code/counter
Atomic: A thing that happens in one go. Either it fails completely and leaves no trace or it work fully. Some operations can be executed on the CPU atomically with guarantees of never being interrupted by another thread, signal or ISR. Those are the above operations.
This works by having special instructions to execute those that inform the CPU that this operation has to be atomic.
If you chain several atomic operations (e.g. Store+Load) they are of course not atomic together!
type countdown struct { isStopped int32 } func (cd countdown) Stop() { cas := atomic.CompareAndSwapInt32 if !cas(&cd.isStopped, 0, 1) { // cd.isStopped is already 1! // do not stop code twice if // Stop() called more than once. return } // Do actual stopping here. }
The most exotic looking is CompareAndSwap and surprisingly it's the one that is the most important one. It is roughly comparable to this code:
if *n == old { *n = new return true } return false
It's main use is implementing lockfree datastructures that notice when a value was modified behind their back.
Additional use: Making sure that we don't stop twice. (actual code example in the firmare's ui)
// Pop has to do two things: // read the next elem and change the queue's head. // How to do this without locking? func (q *Queue) Pop() *Elem { oldHead := q.head for !cas(q.head, oldHead, oldHead.next) { // value was swapped - reload! oldHead = q.head } return oldHead }
Let's recall what a Pop() actually does:
(this assumes that the queue is based on a linked list)
Those are two operations and they are not atomic together. If two threads call Pop() at the same time, we might have the issue that one thread overwrites the results of the other.
In a traditional implementation we could use a mutex to protect this. Newer CPUs (i.e. >Year 2000) have CAS instructions, so we can implement it without locks at all.
OS-level contention: done by the process/thread scheduler. Threads are waiting on sleep, syscalls or waiting on a mutex.
Program level contention: waiting on locks, busy polling, atomics.
Lock-free data structures are so popular because they allow a little cheat here: No syscalls involved, so they do not get scheduled away. Mutex locks involve a call to futex() in some cases, which is a syscall.
How to find out which threads content others or which threads get starved? You gonna need to find out with tracing tools!
Contention: 100 threads that operate on a database that allows at most 10 parallel connections. 90 threads have to wait while 10 do work. Minimizing means to give the 90 threads some meaningful work while they wait.
Real world example for starvation: Telephone Hotline where some people call the hotline all the time automatically, while some normal people don't come through anymore. With the example above: Maybe some threads use an outdated db library that makes it more unlikely to get a connection from the connection pool.
Several primitives combined build a pattern.
Classical producer-consumer problem.
Pools often use a queue (i.e. a channel or some other prioq). I.e. you can produce more to some point than you consume. Can be a problem.
Example: code/producer_consumer
Example: code/producer_consumer
Features of a tracer:
Usage:
A bit more background: https://blog.gopheracademy.com/advent-2017/go-execution-tracer
Several pools connected over channels.
// DO NOT: func work() { report := generateReport() encoded := report.Marshal() compressed := compress(encoded) sendToNSA(compressed) }
Talk about the naive implementation where time of finish will be influenced by a single long running job.
// Instead: c1 := make(chan type1) c2 := make(chan type2) c3 := make(chan type3) go func() { for job := range c1 { c2 <- generateReport() } }() go func() { for report := range c2 { c3 <- report.Marshal() } }() // ...
This can also be easily combined with the pool pattern to start several go routines per pipeline step, allowing us to easily balance out steps that take longer than others. Not shown here, take this as homework.
func fibonacci() chan int { ch := make(chan int, 10) go func() { a, b := 1, 1 for { ch <- a a, b = b, a + b } }() return ch } for elem := range iter() { ... }
Problem: How to stop? Best to use context.Contex
Note: You should probably buffer a little here.
What kind of problems do we need to solve with primitives and patterns?
var counter int func f() { for(idx := 0; idx < 10000; idx++) { counter++ } } // ... go f() go f()
Shared state / Copy over reference:
Channels copy data on send. Copies do have issues when being accessed by several threads. There is a small overhead of course, but it is much smaller than false sharing. This also means though: Do not send pointers over channels, as the pointer value itself is copied but of course not the value it points to.
Less scope is better. If a variable is only visible to a single thread or goroutine, then it cannot have issues. Avoid global state anyways.
Proper synchronisation:
At some point you need to resort to sync primitives of course. If you need to use too much of it, chances are you have an issue in your architecture though.
Race detector / tests / rust:
Parallel code is complicated. Use whatever tools are available to ensure correctness.
# or for non-tests: go run -race main.go $ go test -race ./package WARNING: DATA RACE Read by goroutine 185: net.(*pollServer).AddFD() src/net/fd_unix.go:89 +0x398 ... Previous write by goroutine 184: net.setWriteDeadline() src/net/sockopt_posix.go:135 +0xdf ...
Herr Rittler likes this.
More info: https://go.dev/doc/articles/race_detector
Disadvantages:
Still: You should probably enable it in your tests.
ch := make(chan int) // thread1: ch <- 42 // thread2: if someCondition { result := <-ch }
Unbuffered channels are prone to deadlocks. In this example we will have a deadlock in thread1 if thread2 does not go into the if block.
For this we probably should have used a buffered channel.
func foo() error { mu.Lock() if err := bar(); err != nil { return err } mu.Unlock() return nil }
Absolute classic. Forgetting to unlock in one error case.
Luckily, in Go we have the defer statement, so we can unlock the mutex in all cases.
func foo() error { mu1.Lock() mu2.Lock() // ... defer mu1.Lock() defer mu2.Lock() } func bar() error { mu2.Lock() mu1.Lock() // ... defer mu2.Lock() defer mu1.Lock() }
The lock hierarchy needs to be preserved. Otherwise deadlocks might happen.
Tip: In Go progamms you can press Ctrl+\ or send SIGABRT or SIGTERM to the program to make it print a stack trace. Or use a debugger.
Example: code/deadlock
Example:
A system that does not make any progress for prolonged times. Relatively seldom, but can happen.
Usual cause: Too primitive retry mechanism.
Real life example: Two processes trying to execute an SQL transaction that depend on each other. SQL server will stop the transaction and make them retry - if the retry mechanism is the same, then it might take a long time to resolve the situation.
resultCh := make(chan int) go longRunningJob(resultCh) // Give job 5 seconds to complete: select { case result := <-resultCh: // do something with `result` case <-time.After(5*time.Second): log.Warnf("Oh no! No result yet.") // BUT: longRunningJob still running! }
// Init: parentCtx := context.Background() timeout := 5 * time.Second ctx, cancel := context.WithTimeout(parentCtx, timeout) // Check for cancellation: select { case <-ctx.Done(): return ctx.Err() default: // if not cancelled // we land here. } // Cancellation: cancel()
Especially useful for HTTP request handlers. In Go, each of them has a context that is cancelled when the request is not needed anymore.
You almost made it! Just one slide left!
And it's a very easy one and won't take a lot of time at all!
Goal: No philosopher should starve.
We won't solve it today - possible solutions are on wikipedia: https://en.wikipedia.org/wiki/Dining_philosophers_problem
But please don't look and better go and talk to me to see if you figured out a valid solution!
Bonus: If you can name all philosophers pictured above:
Two problems that can occur:
Solution:
🏁
Final words:
Hope you had some fun and learned something. After all you endured 15 hours of me trying to talk english. I'm aware that most of you did not do your homework yet. I will and can not force you to do it, but in my opinion it is very essential that you do some practice yourself. It's like in school where the math teachins seemed obvious until you go and have a try yourself. So please make me proud. ;-)
There is a calendar entry in December where you can show your approach, if you want. It's totally optional, but if you want feedback by me or the others, then this is a good time for it.
Feedback time! How did you like the workshop? What could have been done better or differently? What did you like, what didn't you like? You can also give me feedback via Slack or in the December meeting.