RBF Kernel

RBF Kernel: Mathematical Deep Dive and Implementation

1. Mathematical Definition

The RBF Kernel is defined as:

$k (x ₁, x ₂) = e x p (- ½ (x ₁ - x ₂) ᵀ Θ ⁻ ² (x ₁ - x ₂))$

Where:

$x ₁, x ₂ \in ℝ ᵈ$ (d-dimensional real vector space)
$Θ$ is the lengthscale parameter (can be a scalar or vector for ARD)
$\exp (\cdot)$ is the exponential function

2. Key Properties

2.1 Positive Definiteness

The RBF Kernel is positive definite for all valid inputs, which is crucial for its use in Gaussian Processes.

2.2 Stationarity

It is a stationary kernel, meaning it only depends on the difference between inputs: $k (x ₁, x ₂) = k (x ₁ - x ₂)$ .

2.3 Isotropy / Anisotropy

Isotropic when $Θ$ is a scalar (same in all directions)
Anisotropic when $Θ$ is a vector (different scales for each dimension)

3. Lengthscale Parameter (Θ)

Controls the smoothness of the function
Larger $Θ$ : smoother functions, longer-range correlations
Smaller $Θ$ : more complex functions, shorter-range correlations

4. Implementation Details

4.1 Efficient Computation

The code uses an efficient computation strategy:

Divide inputs by lengthscale: $x' = \frac{x}{Θ}$
Compute squared distances: $d ² = | | x' ₁ - x' ₂ | | ²$
Apply RBF function: $k = \exp (- ½ d ²)$

4.2 Automatic Relevance Determination (ARD)

When ard_num_dims > 1, each input dimension gets its own lengthscale, allowing the model to determine which features are most relevant.

4.3 Handling Diagonal Computations

The diag parameter allows efficient computation when only the diagonal of the kernel matrix is needed.

4.4 Gradient Considerations

The implementation checks for cases where gradients are required (e.g., when inputs require gradients or using ARD) and uses a more general computation in these cases.

5. Mathematical Breakdown of the Implementation

5.1 Distance Computation

The covar_dist method computes squared distances:

$d ² = | | \frac{x ₁}{Θ} - \frac{x ₂}{Θ} | | ²$

5.2 RBF Function

The postprocess_rbf function applies the RBF transformation:

$k = e x p (- ½ d ²)$

5.3 Optimized Computation

For cases without gradients or ARD, it uses RBFCovariance.apply, which likely implements a more efficient, possibly GPU-optimized version of the computation.

6. Relation to Other Concepts

6.1 Fourier Transform

The Fourier transform of the RBF kernel is another Gaussian, making it useful for spectral methods.

6.2 Connection to Normal Distribution

The RBF kernel can be interpreted as the correlation between outputs when the latent function is modeled as a Gaussian process with a particular covariance structure.

7. Practical Considerations

No explicit outputscale parameter (use with ScaleKernel for scaling)
Handles batch computations
Allows for active dimension selection
Supports priors and constraints on the lengthscale parameter

8. Derivatives

The derivative with respect to x is:

$\frac{\partial k}{\partial x} = - k (x ₁, x ₂) * Θ ⁻ ² * (x ₁ - x ₂)$

This is useful for optimization and certain GP techniques like gradient matching.