Linear Kernel

Linear Kernel: Mathematical Deep Dive

1. Basic Definition

The Linear Kernel is defined as:

$k (x ₁, x ₂) = v * x ₁ ᵀ * x ₂$

Where:

$x ₁, x ₂ \in ℝ ᵈ$ (d-dimensional real vector space)
$v \in ℝ ⁺$ (positive real number)

2. Matrix Formulation

For a set of n input vectors $x ₁, . . ., x ₙ$ , we can form a design matrix $X$ :

$X = [x ₁ ᵀ; x ₂ ᵀ; . . .; x ₙ ᵀ]$

The kernel matrix K is then:

$K = v * X * X ᵀ$

3. Properties

3.1 Positive Semi-Definiteness

The Linear Kernel is positive semi-definite (PSD), which is a crucial property for kernel functions. To prove this:

For any vector $α \in ℝ ⁿ$ :

$α ᵀ K α = v * α ᵀ (X X ᵀ) α = v * (X ᵀ α) ᵀ (X ᵀ α) = v * | | X ᵀ α | | ² \geq 0$

This holds because v > 0 and the squared norm is always non-negative.

3.2 Linearity in Feature Space

The Linear Kernel corresponds to a linear function in the feature space. If φ(x) = x is our feature map, then:

$k (x ₁, x ₂) = v * ⟨ φ (x ₁), φ (x ₂) ⟩$

Where $⟨ \cdot, \cdot ⟩$ denotes the inner product.

4. Connection to Linear Regression

The Linear Kernel is closely related to linear regression. In the context of Gaussian Processes:

$f (x) G P (0, k (x, x^{'}))$

With a Linear Kernel, this is equivalent to Bayesian Linear Regression:

$f (x) = w * x, w h e r e w N (0, v I)$

5. Eigendecomposition

The eigendecomposition of K can provide insights:

$K = U Λ U ᵀ$

Where:

$U$ is the matrix of eigenvectors
$Λ$ is a diagonal matrix of eigenvalues

For the Linear Kernel:

The rank of K is at most min(n,d)
The non-zero eigenvalues correspond to the directions of maximum variance in the data

6. Gram Matrix Computation

In practice, we often work with the Gram matrix. For inputs $X = [x ₁, . . ., x ₙ]$ and $Y = [y ₁, . . ., y ₘ]$ :

$K_{X} Y = v * X Y ᵀ$

Elements: $K_{i j} = v * x_{i} ᵀ * y_{j}$

7. Derivative

The derivative of the kernel with respect to its inputs is useful for optimization:

$\frac{\partial k (x_{1}, x_{2})}{\partial x_{1}} = v \cdot x_{2}$

$\frac{\partial k (x_{1}, x_{2})}{\partial x_{2}} = v \cdot x_{1}$

8. Variance Parameter

The variance parameter v scales the kernel:

$\frac{\partial k (x_{1}, x_{2})}{\partial v} = x_{1}^{⊤} x_{2}$

This gradient is used when learning v from data.

9. Relation to Distance Metrics

The Linear Kernel is related to the Euclidean distance:

$| | x ₁ - x ₂ | | ² = (x ₁ - x ₂) ᵀ (x ₁ - x ₂) = x ₁ ᵀ x ₁ + x ₂ ᵀ x ₂ - 2 x ₁ ᵀ x ₂$

If x₁ and x₂ are normalized $(| | x ₁ | | = | | x ₂ | | = 1$ ), then:

$k (x ₁, x ₂) \propto 1 - ½ | | x ₁ - x ₂ | | ²$

This shows how the kernel relates to similarity in Euclidean space.