Batch Normalization Accelerating Deep Network Training b y Reducing Internal Covariate Shift

Batch Normalization: Summary and Applications

Core Concept

Batch Normalization (BN) is a technique used in deep learning to standardize the inputs to a layer for each mini-batch. This helps in reducing internal covariate shift and generally leads to faster and more stable training.

Key Steps

  1. Compute mean and variance across the mini-batch
  2. Normalize inputs using these statistics
  3. Scale and shift using learnable parameters (γ and β)

Mathematical Formulation

For a layer with d-dimensional input x = (x⁽¹⁾...x⁽d⁾), we normalize each dimension:

  1. Mini-batch mean: μᵦ = (1/m) ∑ᵢ₌₁ᵐ xᵢ
  2. Mini-batch variance: σ²ᵦ = (1/m) ∑ᵢ₌₁ᵐ (xᵢ - μᵦ)²
  3. Normalize: x̂ᵢ = (xᵢ - μᵦ) / √(σ²ᵦ + ε)
  4. Scale and shift: yᵢ = γx̂ᵢ + β

Where:

Benefits

Standard Process vs. Adaptive Approaches

Standard Batch Normalization

Adaptive/Modified Approaches

Considerations for Specific Tasks (e.g., Medical Imaging)

Challenges

  1. Class imbalance
  2. Importance of preserving original feature distributions
  3. Diverse input data (e.g., different types of scans)

Potential Solutions

Key Takeaways

  1. Batch Normalization is powerful but not one-size-fits-all
  2. Consider the nature of your data and task when applying normalization
  3. In tasks where preserving original distributions is crucial, explore adaptive techniques
  4. Balance between standardization benefits and preserving important feature characteristics

Remember: The goal is to find the right balance between the benefits of normalization and the preservation of task-specific important information in the data.


Batch Normalization: Mathematical Exercises

Exercise 1: Basic Batch Normalization Calculation

Given a mini-batch of 4 samples with 3 features each:

X = [
    [2, 4, 6],
    [4, 6, 8],
    [6, 8, 10],
    [8, 10, 12]
]

Calculate:
a) The mini-batch mean (μᵦ) for each feature
b) The mini-batch variance (σ²ᵦ) for each feature
c) The normalized values (x̂) for each sample, assuming ε = 0.01
d) The final output (y) for each sample, assuming γ = 2 and β = 1

Exercise 2: Backpropagation through Batch Normalization

Consider a single feature x in a mini-batch of size m = 3:

x = [1, 2, 3]

The normalized value x̂ is calculated as: x̂ = (x - μ) / √(σ² + ε)

Given:

Calculate:
a) ∂L/∂x (gradient of loss with respect to input x)
b) ∂L/∂μ (gradient of loss with respect to mean)
c) ∂L/∂σ² (gradient of loss with respect to variance)

Exercise 3: Effect of Batch Size on Normalization

Compare the normalization results for the following data when treated as:
a) A single batch of 6 samples
b) Two mini-batches of 3 samples each

Data:

X = [1, 2, 3, 4, 5, 6]

Calculate the normalized values (x̂) for both cases, assuming ε = 0.01.
How does the batch size affect the normalization results?

Exercise 4: Conditional Batch Normalization

In a binary classification task, you decide to use different normalization parameters for each class. Given:

Class 0 samples: [1, 2, 3]
Class 1 samples: [4, 5, 6]

γ₀ = 1, β₀ = 0 (for Class 0)
γ₁ = 2, β₁ = 1 (for Class 1)

Calculate the final output (y) for each sample using conditional batch normalization.

Bonus Challenge: Adaptive Normalization

Design a simple adaptive normalization scheme for a 1D input where the normalization parameters (γ and β) are functions of the input mean. Provide the mathematical formulation and explain your reasoning.


Batch Normalization: Exercise Solutions