Neural Networks Zero to Hero

This is playlist :

Exercises:

The spelled-out intro to neural networks and backpropagation: building micrograd

you should now be able to complete the following google collab, good luck!:
https://colab.research.google.com/drive/1FPTx1RXtBfc4MaTkf7viZZD4U2F9gtKN?usp=sharing

Exercise

The spelled-out intro to language modeling: building makemore

Useful links for practice:

Python + Numpy tutorial from CS231n https://cs231n.github.io/python-numpy... . We use torch.tensor instead of numpy.array in this video. Their design (e.g. broadcasting, data types, etc.) is so similar that practicing one is basically practicing the other, just be careful with some of the APIs - how various functions are named, what arguments they take, etc. - these details can vary.
PyTorch tutorial on Tensor https://pytorch.org/tutorials/beginne...
Another PyTorch intro to Tensor https://pytorch.org/tutorials/beginne...

Exercises:
E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?
E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?
E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?
E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?
E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?
E06: meta-exercise! Think of a fun/interesting exercise and complete it.

Exercises:

Building makemore Part 2: MLP

Useful links: - PyTorch internals ref http://blog.ezyang.com/2019/05/pytorc...

E01: Tune the hyperparameters of the training to beat my best validation loss of 2.2
E02: I was not careful with the intialization of the network in this video. (1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve? (2) Can you tune the initialization to get a starting loss that is much more similar to (1)?
E03: Read the Bengio et al 2003 paper (link above), implement and try any idea from the paper. Did it work?

Exercises:

Building makemore Part 3: Activations & Gradients, BatchNorm

Useful links:

"Kaiming init" paper: https://arxiv.org/abs/1502.01852
BatchNorm paper: https://arxiv.org/abs/1502.03167
Bengio et al. 2003 MLP language model paper (pdf): https://www.jmlr.org/papers/volume3/b...
Good paper illustrating some of the problems with batchnorm in practice: https://arxiv.org/abs/2105.07576
E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that
1) the network trains just fine or
2) the network doesn't train at all, but actually it is
3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.
E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.