Making Deep Learning Go Brrrr From First Principles

Website link - (https://horace.io/brrr_intro.html)

Overfitting when training loss is way lower than test loss
Regularisation is waste if training loss is identical to your validation loss - why tho??
Three regimes
- Compute Bound Regime
- Memory Bandwidth Bound
- Overhead

Just like with training ML models, knowing what regime you're in allows you to narrow in on optimizations that matters. For example, if you're spending all of your time doing memory transfers (i.e. you are in an memory-bandwidth bound regime), then increasing the FLOPS of your GPU won't help. On the other hand, if you're spending all of your time performing big chonky matmuls (i.e. a compute-bound regime), then rewriting your model logic into C++ to reduce overhead won't help.

Compute-bound regime

Performance is limited by the processor's ability to perform calculations, not by data transfer rates
- Rewriting model logic in C++ to reduce overhead won't help because the bottleneck is the computation itself, not the overhead from higher-level languages.
to optimize the DL system - we should aim to use full computational power of our hardware GPUs
- minimize the time spent on non-computational tasks
- processing time is dedicated to intensive calculations like matrix multiplication
Maximizing compute is crucial because computational tasks are fixed and can't be reduced without altering the operations. Memory bandwidth can be optimized, but computation time remains a constant necessity.
- Computation time remains a constant necessity because the number of operations required for tasks like matrix multiplication in deep learning is fixed by the algorithm. These operations can't be reduced without altering the algorithm itself.

Memory Bandwidth Bound

if we are in memory-bandwidth bound regime - then increasing FLOPS won't help... why?
- In a memory-bandwidth bound regime, the bottleneck is the data transfer rate, not the computation speed
In a memory-bandwidth bound regime, the time to fetch data dominates.
operator fusion is simple

x1 = x.cos() # Read from x in global memory, write to x1
x2 = x1.cos() # Read from x1 in global memory, write to x2

x2 = x.cos().cos() # Read from x in global memory, write to x2

Min-cut optimal recomputation i.e. activation checkpointing with AOTAutograd - also read this

Reasoning about Memory-Bandwidth Costs

Overhead

@article{he2022brrrrfromfirstprinciples,
  author={Horace He},
  title={Making Deep Learning Go Brrrr From First Principles},
  year={2022},
  url={https://horace.io/brrr_intro.html},
}