Memory
Bookkeeping is hard 📝
Most stuff in this session is related to this PDF:
Two major types in use today:
SDRAM = Synchronous DRAM
DDR-SDRAM = Double Data Rate SDRAM
(and DDR2 doubles that and so on)
Dynamic sounds good, doesn't it? Well, it isn't...
Pros:
Cons:
So basically...
again, hardware is at fault and instead of fixing it with some Pfiffikus we software devs have to cope with slow main memory.
Fun fact: DRAM enables a hardware-based security attack: ROWHAMMER. Changing a row of DRAM cells can, if done very often, switch a nearby row. This can be used to change data like "userIsLoggedIn".
ECC comes with a price & performance tag.
NUMA = Non Uniform Memory Architecture
Is the access to all memory equally fast?
NUMA is a term you might come across.
Linux is NUMA capable and that's why it's such a popular server and superomputer operating system. Or one of the reasons at least.
The large sequential slab of memory needs to be distributed to all programs that require it.
Understandin how the kernel and processes manage their memory makes it possible to use less of it and make more efficient use of it.
For this we need to start at the basics...
func recursive(depth int) { if depth <= 0 { return } var a int fmt.Printf("%p\n", &a) recursive(depth - 1) } // ... recursive(10) // Output: 0xc000070e70 -> diff: 80 bytes due to: 0xc000070e20 -> stack pointer, frame pointer 0xc000070dd0 -> registers, params, local vars ...
More details on calling a function:
https://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64
Stack grow direction: Depends on OS / your programming language.
Every function calls causes a so called "frame" to be placed on top of the stack. Each frame contains parameters, local variables, but also a return link to the previous frame.
The program knows in which frame we are by looking at a thing called "base pointer". A special pointer kept in a fixed register and changed on each function call. There's also a pointer that points to the current end of the stack, so using this we know how big a frame is. When a function returns we can simply change those two registers using the "return link". The current stack frame is then deallocated and will be overwritten when another function is called.
Malicious software sometimes is able to overwrite the return link to somewhere else, e.g. using a buffer overflow. This leads the program to execute some other other data as code, potentially doing evil things.
Registers:
ebp: Base pointer. Points to start of function. Cell at adress contains "return link to last function" (i.e. pointer to instruction offset) esp: Initially the base pointer, but grows with each variable put on the stack. eip: Pointer that points to current instruction (not on the stack, but your code is somewhere else in memory)
Stack origin: ebp. Stack pointer: esp.
https://en.wikipedia.org/wiki/Stack-based_memory_allocation
Good explanation here too: https://people.cs.rutgers.edu/~pxk/419/notes/frames.html
Why not use the Stack for everything?
endless recursive programs cannot break everything. Also running over the extents of a buffer in C (Security issue!) will overwrite parts of the stack, so limiting it makes sense.
Example: code/stackoverflow
//go:noinline func f() *int { v := 3 return &v } func main() { // Two for the stack: // a=0xc00009aef8 b=0xc00009aef0 a, b := 23, 42 // Two for the heap: // c=0xc0000b2000 d=0xc0000b2008 c, d := f(), f() }
Contrary to the stack, the memory is not bound to the function and therefore will survive the return of a function. The downside is that the memory needs to be freed
Languages like Go allocate automatically on the heap if they have to - they do this either when the compiler cannot prove that the value does not escape the function stack or when the allocation is too big for the stack. More on this later. Thanks to the GC memory is freed automatically after it's used. Having a GC is often understood as "I don't need to think about memory" though, which is not the case. You can help the GC to run faster and avoid memory leaks that can arise through edge cases.
Languages like Python allocate everything on the heap. They almost never use stack based memory for anything. Most interpreted languages use a combination of reference counting and garbage collection. Very convenient but also the slowest way to go.
Languages like C (and partly Rust) pass the duty of memory management to the programmer. While this make it possible to be clever, it also opens up ways to fuck up tremendously by creating memory leaks, double frees, forgotten allocations or use-after-free scenarios.
Heap memory must be cleaned up after use. Go does this with a GC.
int *ptrs[100]; for(int i = 0; i < 100; i++) { ptrs[i] = malloc(i * sizeof(int)); } // ... use memory ... for(int i = 0; i < 100; i++) { free(ptrs[i]); }
malloc() is a function that returns N bytes of memory, if available. It uses a syscall of the kernel (sbrk()), but most of the logic is a library in userspace.
malloc() manages internally a pool of memory internally, from which it slices of the requested portions. Whenever the pool runs out of fresh memory, the malloc implementation will ask the kernel for a new chunk of memory. The exact mechanism is either over sbrk(2) or mmap() (we will see mmap later)
As malloc() needs to cater objects of many different sizes (as seen in the example above) it is prone to fragmentation.
malloc() can fail! Originally it was supposed to fail if there is no memory left, but on Linux there is "infinite" virtual memory and overcommitting (we come to those later), which is why it does not fail for this reasons. It will fail however if you ask it a too large block, have some memory restrictions on your processes (cgroups) or other administration reasons.
As mentioned above, the memory allocated from the pool needs to be freed, so it can be re-used. This is done by the free() call.
malloc() needs to track which parts of its pool are in-use and which can be issued on the next call. It does by the use of free-lists. Each block returned by malloc() has a small header (violet) that points to the next block. The memory returned by malloc() is just behind this small header.
Once allocated, a free block is taken out of the list and added to the "allocated" list. This means that every allocation has a small space and time overhead.
On free(), the opposite happens: The block is put back into the freelist and out of the "allocated" list. (i.e. an allocation is O(log n), instead of O(1) as with the stack)
It is interesting to note that there are different implementations of this, with different advantages and drawbacks. One very high performant implementatin is jemalloc.
Useful Links:
// In C: char *s; s = malloc(20); s = malloc(30); // leak: 20 bytes.
// In Go: var m map[string][]byte{} func f(v int) { // the slice will be still referenced after // the function returned, if not delete()'d m["blub"] = make([]byte, 100) return v * v }
Other sources of memory leaks:
Use pprof to find memory leaks in Go.
In C it's very easy to forget a free(), therefore quite some impressive tooling developed over the years. The most prominent example is valgrind: https://valgrind.org
Python: Also has memory leaks, finding them is much harder since the tooling is not great (at least when I looked last time). Also: Memory leaks can happen on the C-side or in the python code itself. If they happen in a C-module you're pretty much fuc.. lost.
Heap
Heap requires some implementation of malloc(). There are many different implementations of it in C, using different strategies to perform well under certain load. Choosing the right kind of allocator is a science in itself. More info can be obtained here:
https://en.wikipedia.org/wiki/Memory_management#Implementations
In languages like Go you don't have a choice which memory allocator you get. The Go runtime provides one for you. This makes sense as it is coupled very tightly with the garbage collector. Go uses a similar implementation, but is more sophisticated. Main difference: it keeps pre-allocated arenas for differently sized objects. i.e. 4, 8, 16, 32, 64 and so on.
The grow direction of the heap and stack is not really important and you should keep in mind that every thread/goroutine has their own stack and there might be even more than one heap area, possibly backed by different malloc() implementations.
GC is a utility that remembers allocation and scans the memory used by the program for referenes to the allocations. If no references are found it automatically cleans up the associated memory.
This is very ergonomic for the programmer, but comes with a peformance impact. The GC needs to run regularly and has, at least for a very small amount of time, stop the execution of the program.
Good reference for the Go GC: https://tip.golang.org/doc/gc-guide
// Prefer this... m := make(map[string]someStruct) // ...over this: m := make(map[string]*someStruct)
Example: code/allocs
# counting words with a map: $ go test -v -bench=. -benchmem noptr 577.7 ns/op 336 B/op 2 allocs/op ptr 761.4 ns/op 384 B/op 10 allocs/op
"GC Pressure" describes the amount of load a garbage collector currently has. The more small objects it has to track, the higher the load. You can help it by reducing the amount of different objects and making use of sync.Pools (see later)
One way to less use memory is to use less pointers:
The "10" will increase with input size! Longer runs will cause more GC for the ptr case.
$ go build -gcflags="-m" . ./main.go:5:2: moved to heap: x
Only heap allocated data is managed by the garbage collector. The more you allocate on the heap, the more pressure you put on the memory bookkeeping and the garbage collector.
Good guide for the details: https://tip.golang.org/doc/gc-guide#Eliminating_heap_allocations
Picture source: https://dev.to/karankumarshreds/memory-allocations-in-go-1bpa
s := make([]int, 0, len(input)) m := make(map[string]int, 20) // ... // If you need to concatenate many strings: var b strings.Builder b.Grow(100 * 13) for idx := 0; idx < 100; idx++ { b.WriteString("Hello World!\n") } fmt.Println(b.String())
Example: code/prealloc
This helps the GC in two ways:
s := make([]int, 0, maxLineLen) for _, line := range lines { s = s[:0] // re-use underlying array! // do something with s and line here }
// avoid expensive allocations by pooling: var writerGzipPool = sync.Pool{ // other good candidates: bytes.Buffer{}, // big slices, empty objects used for unmarshal New: func() any { return gzip.NewWriter(ioutil.Discard) }, } w := writerGzipPool.Get().(*gzip.Writer) // ... use w ... writerGzipPool.Put(w)
Example: code/mempool
Pooling is the general technique of keeping a set of objects that are expensive object, if they can be re-used. Typical examples would be thread pools that keep running threads around, instead of firing up a new one for every task. Same can be done for memory objects that are expensive to allocate (or have long-running init code like gzip.Writer).
Pools can be easily implemented using an array (or similar) and a mutex. sync.Pool is a Go-specific solution that has some knowledge of the garbage collector which would be not available to normal programs otherwise. It keeps a set of objects around until they would be garbage collected anyways. I.e. the objects in the pool get automatically freed after one or two GC runs.
// type StringHeader struct { Data uintptr, Len int } func stringptr(s string) uintptr { return (*reflect.StringHeader)(unsafe.Pointer(&s)).Data } func main() { s1 := "123" s2 := s1 s3 := "1" + "2" + "3" s4 := "12" + strconv.FormatInt(3, 10) s5 := "12" + strconv.FormatInt(3, 10) fmt.Printf("0x%x 0x%x 0x%x 0x%x\n", stringptr(s1), // 0x000049a4c2 stringptr(s2), // 0x000049a4c2 stringptr(s3), // 0x000049a4c2 stringptr(s4), // 0xc000074ed0 stringptr(s5), // 0xc000074f80 ) }
type stringInterner map[string]string func (si stringInterner) Intern(s string) string { if interned, ok := si[s]; ok { return interned } si[s] = s return s } func main() { si := stringInterner{} s1 := si.Intern("123") s2 := si.Intern(strconv.Itoa(123)) fmt.Println(stringptr(s1) == stringptr(s2)) // true }
Advantage:
Further examples and the full impressive benchmark can be found here:
https://artem.krylysov.com/blog/2018/12/12/string-interning-in-go
// Measuring speed of string comparisons: BenchmarkStringCompare1-4 1.873 ns/op BenchmarkStringCompare10-4 4.816 ns/op BenchmarkStringCompare100-4 9.481 ns/op BenchmarkStringCompareIntern1-4 1.830 ns/op BenchmarkStringCompareIntern10-4 1.868 ns/op BenchmarkStringCompareIntern100-4 1.965 ns/op
Example: code/internment
Internment scales incredibly well.
Good usecases:
Bad usecases:
$ GOMEMLIMIT=2000M go run app.go
Linux only supports setting a max amount of memory that a process (or cgroup) may consume. If the limit is exceeded, then the process (or cgroup) is killed. This makes the limit a hard limit, which is seldomly useful.
What is more useful is to have a soft limit, that makes the application attempt to free memory before it reaches the limit. As the garbage collector normally has a backlog of short-lived (i.e. memory on the heap that gets regularly freed) it could peak over a hard limit (6G in the diagram) for a short moment of time. By setting a GOMEMLIMIT we can tell the GC to run the
More Info: https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications
type Item struct { Key int64 Blob []byte } func (i Item) Copy() Item { blob := make([]byte, len(i.Blob)) copy(blob, i.Blob) return Item{Key: i.Key, Blob: blob} } type Items []Item func (items Items) Copy() Items { copyItems := Items{} for _, item := range items { copyItems = append(copyItems, item.Copy()) } return copyItems }
Steps:
Example taken from timeq: https://github.com/sahib/timeq/blob/main/item/item.go#L80
Let's talk about the elephant in the room: The adress of a value is not the adress in physical memory. How can we proof it?
Wait, those addresses I saw earlier... are those the addrs in RAM? Hopefully not, because otherwise you could somehow find out where the OpenSSH server lives in memory and steal it's keys. For security reasons it must look for each process like he's completely alone on the system. What you saw above are virtual memory addresses and they stay very similar on each run.
The concept how this achieved is called "virtual memory" and it's probably one of the more clever things we did in computer science.
$ cat /proc/<pid>/maps 55eab7237000-55eab7258000 rw-p [heap] ... 7f54a1c18000-7f54a1c3a000 r--p /usr/lib/libc.so.6 ... 7ffe78a26000-7ffe78a47000 rw-p [stack]
Each process has a »Page Table« mapping virtual to physical memory.
On process start this table is filled with a few default kilobytes of mapped pages (the first few pages are not mapped, so dereferencing a NULL pointer will always crash)
When the program first accesses those addresses the CPU will generate a page fault, indicating that there is no such mapping. The OS receives this and will find a free physical page, map it and retry execution. If another page fault occurs the OS will kill the process with SIGSEGV.
Picture above showing htop on my rather old laptop with a normal workload. The amount of virtual memory for some programs like signal-desktop is HUGE and only a tiny portion is actually used.
Fun fact: The program I was actively using was gimp, but the actual performance hogs were all browser-based applications. Brave new world.
If you want to flex: Use btop for even prettier screens.
# Create some space for swapping: $ dd if=/dev/zero of=swapfile count=1024 bs=1M $ swapon ./swapfile # Check how eager the system is to use the swap # with a value between 0-100. This is the percentage # of RAM that is left before swapping starts. $ cat /proc/sys/vm/swappiness (a value between 0-100) # 0 = only swap if OOM would hit otherwise. # 100 = swap everything not actively used. # 60 = default for most desktops. # <10 = good setting for database servers
Linux can use swap space as second-prio memory if main memory runs low. Swap is already used before memory goes low. Inactive processes and stale IO pages get put to swap so that memory management can make use of that space to provide less fragmented memory regions.
How aggressive this happens can be set using vm.swappiness. A value between 0 and 100.
Rules:
# Show the peak residual memory usage: $ /usr/bin/time -v <command> ... Maximum resident set size (kbytes): 16192 ...
Example: code/virtualmem
Start ./virt and observe in htop how the virtual memory is immediately there and the residual memory slowly increases second by second. The program will crash if you wait long enough.
Start with '/usr/bin/time -v ./virt' and interrupt at any time.
Note that time only samples in a certain interval. If you're unlucky you might miss the actual peak. So don't rely on values to be exact here.
Works similar to the CPU profile and gives us a good overview. The little cubes mean "x memory was allocated in y size batches".
The pprof output is also available as flamegraph if you prefer this kind of presentation.
No way around it. Profiling and benchmarking leave a gap: long running applications where you do not expect performance issues. In that case you should always monitor resource usage so you can check when and how fast memory usage increased (and maybe correlate with load)
When you notice issues you can do profiling via pprof.
Especially long-running memory leaks are hard to debug (i.e. when memory accumulates over the course of several days e.g.)
In this it can help to combine monitoring and profiling. This is sometimes called "continuous profiling" therefore. Pyroscope is one of those tools.
A short article on how to integrate this with Go services: https://grafana.com/blog/2023/04/19/how-to-troubleshoot-memory-leaks-in-go-with-grafana-pyroscope/
Demo for Go: https://demo.pyroscope.io/?query=rideshare-app-golang.cpu%7B%7D&from=1682450314&until=1682450316
Another way to enable continuous profiling is to use eBPF - small programs that run in kernel space. That's more lowlevel, but is pretty much the most powerful ever and I expect many tools to emerge in the next year that make use of it.
Alternatives:
Userspace-Daemons that monitor memory usage and kill processes in a very configurable way. Well suited for server systems.
🏁