Making Deep Learning Go Brrrr From First Principles

Website link - (https://horace.io/brrr_intro.html)

Just like with training ML models, knowing what regime you're in allows you to narrow in on optimizations that matters. For example, if you're spending all of your time doing memory transfers (i.e. you are in an memory-bandwidth bound regime), then increasing the FLOPS of your GPU won't help. On the other hand, if you're spending all of your time performing big chonky matmuls (i.e. a compute-bound regime), then rewriting your model logic into C++ to reduce overhead won't help.

Compute-bound regime

Memory Bandwidth Bound

x1 = x.cos() # Read from x in global memory, write to x1
x2 = x1.cos() # Read from x1 in global memory, write to x2
x2 = x.cos().cos() # Read from x in global memory, write to x2

Min-cut optimal recomputation i.e. activation checkpointing with AOTAutograd - also read this

Reasoning about Memory-Bandwidth Costs

Overhead


@article{he2022brrrrfromfirstprinciples,
  author={Horace He},
  title={Making Deep Learning Go Brrrr From First Principles},
  year={2022},
  url={https://horace.io/brrr_intro.html},
}