← Back to homepage

Day 4: Foiled by a Napkin

Today

CPU Caching (code)

I was able to get a piece of code running which I think tests something very specific related to CPU caching.

fn run(stride_length: usize) {
    const COUNT: usize = 10000;
    let data = vec![1u8; COUNT * stride_length];

    for _ in 0..1000 {
        for i in 0..COUNT {
            let datum: &u8 = &data[i * stride_length];

            unsafe {
                let _ = std::ptr::read_volatile(datum);
            }
        }
    }
}

This time:

  • There are is a fixed number of iterations
  • We are reading a single byte per iteration

My hypothesis with this code is that when the stride length is small the performance would be better than when stride length is large. In particular, once stride length becomes greater than CACHE_LINE_SIZE / 2 - 1, then each byte we read could potentially be a cache miss and we would need to bring that page into cache.

Runtime vs Stride Length

For once reality agreed with our hypothesis. Runtime increases linearly for strides of length < 128 (which is the cache line size on an M3 chip). After 128 bytes it jumps all over the place, but always at almost the same level. It stays roughly the same level (5-7.5ms) because the stride length is sufficiently large that every single read incurs at least a single cache miss and requires the CPU to read from memory. I’m not quite sure why the runtime is all over the place after 128 but I would bet it has something to do with alignment. For example, stepping by 192 bytes at a time (which is 1.5 times the size of a cache line) takes approximately 5.7ms which is much lower than the 8ms runtime with a stride length of 174.

What is cool to see here is that we can do a bit of reverse Napkin Math to make sure that the numbers we are seeing line up with what we should have expected (reminder to try and do some of these calculations before running the experiments next time!)

Nevermind, now that I did the Napkin Math I’m getting surprising results. For example, for 1 byte strides, we expect to get 127 cache hits + 1 read from main memory per 128 strides, since this is the cache line size on an M3. From Latency Numbers Every Programmer Should Know, (0.5ns L1 cache hit latency * (127 cache hits per 128 strides) + 100ns cache miss) * 10000 bytes data size / 128 byes cache line size * 1000 iterations =~ 12ms. But we’re getting 2ms so this is 1 order of magnitude off!

I spent a lot of the rest of the day trying to figure out where the delta between my empirical results and the napkin math is coming from. I even tried provisioning a bare metal server from Vultr and running my code on that to see if something special is happening on my M3 machine. But this took up most of my day and I’m not sure I gained any additional understanding.

To summarize what I’ve learned so far:

  • You have to make sure when doing these sorts of performance experiments that the compiler isn’t optimizing something away that you think should impact your results
  • Don’t assume cache lines are 64 bytes
  • Don’t assume any particular caching architectures. Different CPUs can do things very differently e.g. M-series chips have P/E cores
  • Running this code on shared VMs in the public cloud will not give reliable results, because of the noisy neighbour problem. You need bare metal servers

None of these were really what I set out to learn (impact of data structure alignment) but nonetheless I’m happy to get started learning about this topic.

Summary

I aimed to get started on the pytorch ARENA material today, which I didn’t touch, and continue with some CPU caching experiments, which I spent most of my day on.

I’m not super upbeat about my progress today, since it felt like I just spun my wheels a lot trying to understand why my napkin math didn’t line up with reality. I’ve been in this place before, where I’m unable to reconcile theoretical calculations to empirical measurements. I get frustrated because I end up trying to randomly change my code a bunch trying to get the numbers to line up. The better thing to do would be to have a hypothesis and specifically test that. But it’s hard to make hypotheses without a better understanding of the fundamentals. I’m contemplating going through nand2tetris so I can start developing a more fundamental understanding of this material. I think that that point it will become easier to know what questions to ask.

Tomorrow

There are two calls I’m excited for tomorrow - one on building our volitional muscles and the other on AI safety. Given that I don’t feel like I accumulated momentum today I think it would be best tomorrow to actually make progress on finishing my bytecode interpreter.