I/O & Syscalls
Speaking with the kernel 🐧
Examples: | Low latency | High latency |
---|---|---|
Low throughput | SDCards | SSHFS |
High throughput | SSD | HDD |
Fun fact: An extreme example of high latency with high throughput is IPoAC (IP over Avian Carrier), i.e. sticking an USB stick on a homing pidgeon. This was even standardized (jokingly): https://en.wikipedia.org/wiki/IP_over_Avian_Carriers
Big advantage: You could debug issues with too many seeks by audio!
Write software for SSDs. There were some crazy tricks like FIEMAP to make applications re-order their reads in the order of how they are placed on disk. (Huge speedup on HDD, small speedup on SSD), but those will become pointless more and more.
Source: http://databasearchitects.blogspot.com/2021/06/what-every-programmer-should-know-about.html?m=1
SSDs are divided into blocks (512kb), which are divided into pages (often 4K). Pages can be read or overwritten. Pages cannot be erased, only blocks can be. Updates of a pages are written to new blocks. If space runs out, old blocks with many stale pages are erased and can be re-used. The number of physical writes is therefore higher than the number of logical writes. The more space is used, the higher the write amplication factor though.
What we can do about it: Buy bigger SSDs than you need. Also avoid rewriting pages if possible. Secret: SSD have some spare space to keep working they don't tell you about.
Also enable TRIM support if your OS did not yet, but nowadways always enabled. This makes it possible for the OS to tell the SSD additional blocks that are not needed anymore.
Let's be honest: I/O is one of the cases where it's the easiest to kill the problem by throwing a lot of hardware on it. The easiest way to increase the available bandwidth is using a RAID0, i.e. coupling several disk to build one logical unit out of them. Depending on your usecase you can of course use other raid levels:
https://en.wikipedia.org/wiki/Standard_RAID_levels
But that's not the point of this workshop. The point is how you can increase the throughput of your applications so you're able to reach this bandwidth (and maybe also on how you can defer having to buy more hard disks).
Even memory is a file: /dev/mem Or a complete usb stick: /dev/sda Or randomnes: /dev/urandom
Below device drivers: hardware controllers - beyond this talk. They can also re-order writes and are mostly concerned with durability, i.e. a SSD controller will try to distribute the blocks he used to make sure they have a similar amount of write cycles.
// Example: writing to a file // as documented in glibc: // ssize_t write( // int fd, // file descriptor // const void buf[], // data // size_t count // size of data // ); write(1, "Hello world!\n", 12);
Compiled:
; use the `write` system call (1) movl rax, 1 ; write to stdout (1) - 1st arg movl rbx, 1 ; use string "Hello World" - 2nd arg ; (0x1234 is the addr of the "Hello World!\0") movl rcx, 0x1234 ; write 12 characters - 3rd arg movl rdx, 12 ; make system call via special instruction syscall ; The return code is now in the RAX register.
Disclaimer: The 'syscall' instruction is not the only instruction and kind of deprecated in favor of another one. But it's similar enough and better to explain.
All available syscalls and their ids are here: https://filippo.io/linux-syscall-table/
Only method of userspace to talk to kernel. How to call is ISA specific.
The syscall instruction performs a context switch: This means the current state of the process (i.e. the state of all registers in the CPU) is saved away, so it can be restored later. Once done, the kernel sets the register to its needs, does whatever is required to serve the system call. When finished, the process state is restored and execution continues.
Context switches also happen when you're not calling any syscalls. Simply when the scheduler decide this process is done with execution.
There is a syscall for every single thing that userspace cannot do without the kernel's help.
Luckily for us, glibc and Go provide us nice names and interfaces to make those system calls. They usually provide thin wrappers that also do some basic error checking. Watch out: fread is doing buffering in userspace!
Can anyone think of another syscall not in the list above? exit! chdir ... (There are about 300 of them)
Also, what things are no syscalls? Math, random numbers, cryptography, ... i.e. everything that can be done without any side effects or hardware.
$ man 2 read
Every man page in section refers to a system call.
»Reduce the number of syscalls and thou shalt be blessed!«
char buf[1024]; int fd = open("/some/path", O_CREAT|O_RDONLY|O_TRUNC); size_t bytes_read = 0; while((bytes_read = read(fd, buf, sizeof(buf))) > 0) { /* do something with buf[:bytes_read] */ } close(fd);
There are two costs here: Copying the data and context switching.
Looks fairly straightforward and most of you might have written something like that already. Maybe even for sockets or other streams. BUT here's the thing: every read needs one syscall and all bytes from the file are copied to a userspace-supplied buffer. This model is flexible, but costs performance. With mmap() and io_uring we will see options that can, sometimes, work with zero copies.
Sidenote: Always be nice and close your file descriptors. That has two reasons:
char buf[1024]; size_t bytes_in_buf = 0; int fd = open("/some/path", O_CREAT|O_WRONLY|O_TRUNC); do { /* fill buf somehow with data you'd like to write, * set bytes_in_buf accordingly. */ } while(write(fd, buf, bytes_in_buf) >= 0) fsync(fd); close(fd);
Q1: Does this mean that the data is available to read() when write() returned? Q2: Is the data saved on disk after write() returns?
but you should mostly be able to assume this.
sadly not guaranteed depending on the storage driver and hardware. (Kernel has to rely on the hardware to acknowledge received data)
---
There is a bug here though:
write() returns the number of written bytes. It might be less than bytes_in_buf and this is not counted as an error. The write call might have simply been interrupted and we expect that it is called another time with the remaining data. This only happens if your program uses POSIX signals that were not registed with the SA_RESTART flag (see man 7 signal). Since it's default, it's mostly not an issue in C.
Go hides this edgecase for you in normal likes fd.Write() or io.ReadAll(). However, the Go runtime uses plenty of signals and if you use the syscalls package for some reason, then you might be hit by this kind of bug. This does not affect only write() but also read() and many other syscalls.
Also please note: There is some error handling missing here.
// Don't: No pre-allocation possible func ReadEntry() ([]byte, error) { // allocate buffer, fill and return it. }
// Better: buf can be pre-allocated. func ReadEntry(buf []byte) error { // use buf, append to it. }
// Do: Open the reader only once to // reduce number of syscalls func ReadEntry(r io.Reader, buf []byte) error { // use buf, append to it. }
This is a reminder to the last session. Many Read()-like functions get passed a buffer in, instead of allocating one. This is good practice, as it allows calling ReadEntry() in a loop and re-using a buffer during that. Even better is of course no copying the data at all, but that's a different story.
Usecases:
Otherwise: Prefer the simpler version.
Userspace buffered functions. No real advantage, but limiting and confusing API. Has some extra features like printf-style formatting. Since it imposes another copy from its internal buffer to your buffer and since it uses dynamic allocation for the FILE structure I tend to avoid it.
In Go the normal read/write is using the syscall directly, bufio is roughly equivalent to f{read,write} etc. fsync() is a syscall, not part of that even though it starts with "f"
$ dd if=/dev/urandom of=./x bs=1M count=1024 $ dd if=x of=/dev/null bs=1b 4,07281 s, 264 MB/s $ dd if=x of=/dev/null bs=32b 0,255229 s, 4,2 GB/s $ dd if=x of=/dev/null bs=1024b 0,136717 s, 7,9 GB/s $ dd if=x of=/dev/null bs=32M 0,206027 s, 5,2 GB/s
Good buffer sizes: \(1k - 32k\)
Each syscall needs to store away the state of all registers in the CPU and restore it after it finished. This is called "context switch".
Many syscalls vs a few big ones.
Try to reduce the number of syscalls, but too big buffers hurt too.
# (Unimportant output skipped) $ strace ls -l /tmp openat(AT_FDCWD, "/tmp", ...) = 4 getdents64(4, /* 47 entries */, 32768) = 2256 ... statx(AT_FDCWD, "/tmp/file", ...) = 0 getxattr("/tmp/file", ...) = -1 ENODATA ... write(1, "r-- 8 sahib /tmp/file", ...)
Insanely useful tool to debug hanging tools or tools that crash without a proper error message. Usually the last syscall they do gives a hint.
Important options:
-C: count syscalls and stats at the end.
-f: follow also subprocesses.
-e: Trace only specific syscalls.
Good overview and more details here: https://biriukov.dev/docs/page-cache/2-essential-page-cache-theory/
# wait for ALL buffers to be flushed: $ sync # pending data is now safely stored.
// wait for specific file to be flushed: if(fsync(fd) < 0) { // error handling } // pending data is now safely stored.
That's why we have the sync command before the drop_cache command.
For I/O benchmarks always clear caches:
# 1: Clear page cache only. # 2: Clear inodes/direntries cache. # 3: Clear both. sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
Example: code/io_cache
# Move is atomic! $ cp /src/bigfile /dst/bigfile.tmp $ mv /dst/bigfile.tmp /dst/bigfile
This only works obviously if you're not constantly updating the file, i.e. for files that are written just once.
Defines layout of files on disk:
Do you know what filesystems you use? What filesystems you know?
Actual implementation of read/write/etc. for a single filesystem like FAT, ext4, btrfs. There are different ways to layout and maintain data on disk, depending on your use case.
Syscalls all work the same, but some filesystems have better performance regarding writes/reads/syncs or are more targeted at large files or many files.
Most differences are admin related (i.e. integrity, backups, snapshots etc.) and not so much performance related. But if you need things like snapshots and don't want external tools then btrfs of ZFS are incredibly fast.
What OS do you think of when you hear "defragmentation"? Right, Windows. Why? Because NTFS used to suffer from it quite heavily. FAT suffered even more from this.
Fragmentation means that the content of a file is not stored as one continuous block, but in several blocks that might be scattered all over the place, possibly even out-of-order (Block B before Block A). With rotational disk this was in issue since the reading head had to jump all over the place to read a single file. This caused noticeable pauses.
Thing is: Linux filesystems rarely require defragmentation and if you are in need of defragmentation you are probably using an exotic enough setup that you know why.
Most Linux filesystems have strategies to actively, defragment files (i.e. bringing the parts of the file closer together) during writes to that file. In practice, it does not matter anymore today.
Performance is not linear. The fuller the FS is the, more it will be busy with background processes cleaning things up.
Stacking filesystems (like with using encryption) can slow things down. Often this without alternatives though. Only with RAID you have the option to choose hardware RAID.
Journaling filesystems like ext4 use something like a WAL. They write the metadata and/or data to a log before integrating it into the actual data structure (which is more complex and takes longer to commit). Data is written twice therefore with the advantage of being able to recover it on crash or power loss. Disabling it speeds things up at the risk of data loss (which might be okay on some servers).
Examples of FUSE filesystems:
FUSE gives you very decent performance, as most of the logic still runs in kernel space.
// Handle files like arrays: int fd = open("/var/tmp/file1.db") char *map = mmap( NULL, // addr 1024 // map size PROT_READ|PROT_WRITE, // acess flags MAP_SHARED // private or shared fd, // file descriptor 0 // offset ); // copy string to file with offset map[20] = 'H'; map[21] = 'e'; map[22] = 'l'; map[23] = ';'; map[24] = 'W'; map[25] = 'o'; map[26] = 'r'; map[27] = 'd';
Example: code/mmap
Benchmarking IO is especially hard: Often you just benchmark the speed of your page cache for reading/writing. Always clear your cache and use fsync() during benchmarking extensivey!
Maybe one of the most mysterious and powerful features we have on Linux.
Typical open/read/write/close APIs see files as streams. They are awkward to use if you need to jump around a lot in the file itself (like some datbases do).
With mmap() we can handle files as arrays and let the kernel manage reading/writing the required data from us magically on access. See m[17] above, it does not require reading the respective part of the file explicitly.
Good mmap use cases:
Image source:
https://biriukov.dev/docs/page-cache/5-more-about-mmap-file-access/
https://unixism.net/loti/async_intro.html
The image below can be achieved using special system calls like epoll(), poll() or select(): They "multiplex" between several files. Basically they work all the same: You given them a list of files and once invoked epoll() waits until one of the files are ready to be read from. This minimizes polling on userspace side and keeps the wait between I/O as low as possible.
This is however only possible for network I/O - normal files cannot be polled. Beyond the scope of this talk however.
A technique to introduce polling mechanisms to files too and benefit from it.
SQ: Submission Queue: Commands like read file 123 at offset 42. CQ: Completion Queue: Here is the dat aof file 123 at offset 42.
Advantage: Does only need syscalls during the setup of the interface, but not during operation as the data transfer is done via a memory mapping that has been set up during the setup phase.
// Skip the page cache; see `man 2 open` int fd = open("/some/file", O_DIRECT|O_RDONLY); // No use of the page cache here: char buf[1024]; read(fd, buf, sizeof(buf));
This flag can be passed to the open() call. It disables the page cache for this specific file handle.
Some people on the internet claim this would be faster, but this is 90% wrong. There are 2 main use cases where O_DIRECT has its use:
Re-orders read and write requests for performance.
In the age of SSDs we can use dumber schedulers. In the age of HDDs schedulers were vital.
# Default level is 4. Lower is higher. $ ionice -c 2 -n 0 <some-pid>
Well, you can probably guess what it does.
Example: code/fadvise
Example: code/madvise
fadvise() and madvise() can be used to give the page cache hints on what pages are going to be used next and in what order. This can make a big difference for complex use cases like rsync or tar, where the program knows that it needs to read a bunch of files in a certain order. In this case advises can be given to the kernel quite a bit before the program starts reading the file.
The linked examples try to simulate this by clearing the cache, giving a advise, waiting a bit and then reading the file in a specific order.
The examples also contain some noteable things:
package main import( "os" "io" ) // Very simple `cp` in Go: func main() { src, _ := os.Open(os.Args[1]) dst, _ := os.Create(os.Args[2]) io.Copy(dst, src) }
cp is not faster because it copies data faster, but because it avoids copies to user space by using specialized calls like:
Find out using strace cp src dst. If no trick is possible it falls back to normal buffered read/write.
# Show programs with most throughput: $ iotop
Finding max throughput:
# Write: $ dd if=/dev/zero of=./file bs=32k count=10000 # Read: $ sync; echo 3 | sudo tee /proc/sys/vm/drop_caches && \ dd if=./file of=/dev/null bs=32k
NOTE: dd can be nicely used to benchmark the throughput of your disk! Just dd from /dev/zero for write perf and to /dev/null for read perf. But you have to use conv=fdatasync for both and clear the page cache (see below) in case.
Not copying: using mmap, io_uring. If using read() file API then try to minimize copying in your application.
type ReaderFrom interface { ReadFrom(r Reader) (n int64, err error) } type WriterTo interface { WriteTo(w Writer) (n int64, err error) }
You might have heard that abstractions are costly from a performance point of view and this partly true. Please do not take this an excuse for not adding any abstractions to your code in fear of performance hits.
Most bad rap of abstractions come from interfaces that are not general enough and cannot be extended when performance needs arise.
Example: io.Reader/io.Writer/io.Seeker are very general and hardly specific. From performance point of view they tend to introduce some extra allocations and also some extra copying that a more specialized implementation might get rid of if it would know how it's used.
For example, a io.Reader that has to read a compressed stream needs to read big chunks of compressed data since compression formats work block oriented. Even if the caller only needs a single byte, it still needs to decompress a whole block. If the API user needs another byte a few KB away, the reader might have to throw away the curent block and allocate space for a new one, while seeking in the underlying stream. This is costly.
Luckily, special cases can be optimized. What if the reader knows that the whole stream is read in one go? Like FADV_SEQUENTIAL basically. This is what WriteTo() is for. A io.Reader can implement this function to dump its complete content to the writer specified by w. The knowledge that no seeking is required allows the decompression reader to make some optimizations: i.e. use one big buffer, no need to re-allocate, parallelize reading/decompression and avoid seek calls.
So remember: Keep your abstractions general, check if there are specific patterns on how your API is called and offer optimizations for that.
🏁