← Notes

The neural tangent kernel

The neural tangent kernel was one of the few deep-learning theory ideas that were useful before they became a concept. It doesn't solve generalization, but it makes a very stubborn object analyzable. Jacot, Gabriel, and Hongler [1] found that if you take a network to infinite width under the right scaling, gradient descent on the parameters becomes kernel gradient descent in function space. The kernel isn't chosen by hand, it's induced by the network at initialization. For a model $f_\theta$, the tangent kernel is $K_\theta(x, x') = \nabla_\theta f_\theta(x)^\top \nabla_\theta f_\theta(x')$. In finite networks this kernel changes as you train. In the infinite width limit under the standard parameterization, it converges to some deterministic kernel $K_\infty$ and $\|K_{\theta_t} - K_\infty\|$ approaches $O(1/\sqrt{n})$ in width $n$.

Function-space gradient descent then satisfies $\dot{f}_t = -K_\infty (f_t - y)$ on the training set and integrates to $f_t = y + e^{-K_\infty t}(f_0 - y)$. Parameter-space non-convexity stops mattering: the function-space dynamics are linear and driven by a positive semidefinite kernel.

NTK as baseline

The first thing the NTK explained is why very wide networks optimize so easily. If the kernel is well conditioned on the training data, gradient descent has an easy road to interpolation. The complicated nonconvex path is, at leading order, kernel regression with some specific architecture induced kernel. Du et al. [4] and Lee et al. [2] pushed this picture further and showed that wide networks of any depth evolve like their first-order Taylor expansion around initialization.

Figure 2 of Lee et al. 2019. Predictions from the linearized infinite-width model match the trajectory of the actual wide finite network during gradient-descent training.
Figure 2 of Lee et al. [2]. The linearization at initialization tracks the wide-network trajectory closely under gradient descent.

Du et al. extend the same machinery to a clean proof that overparameterized networks reach zero training loss with a polynomial-width requirement and a global-convergence guarantee that the nonconvex landscape never gave you. But the spectrum of $K_\infty$ does more than set the speed of convergence. Its eigendecomposition $K_\infty = \sum_k \lambda_k \phi_k \phi_k^\top$ implies that the residual along eigenmode $\phi_k$ shrinks like $e^{-\lambda_k t}$, so large-eigenvalue modes are learned quickly and small-eigenvalue modes slowly, or not at all under early stopping. Early stopping, the frequency principle, and the spectral bias of MLPs all become statements about $\{\lambda_k\}$.

Figure 1 of Cao et al. 2019 (spectral bias of deep learning). Projection lengths along the lowest few eigenmodes of the NTK as a function of training step: low-frequency (small k) components are fit much faster than higher-frequency components.
Figure 1 of Cao et al. [arxiv 1912.01198]. Low-frequency components of the target are absorbed by the network long before higher-frequency components, in the order predicted by the NTK spectrum.

Arora et al. write down an exact algorithm for computing $K_\infty$ for fully connected and convolutional networks of arbitrary depth, making these spectral predictions empirically testable on real datasets.

Figure 2 of Bordelon, Canatar, Pehlevan 2020. Spectrum-dependent generalization-error scaling: per-mode learning curves $E_k(p)/E_k(0)$ versus number of training samples for varying eigenmode index, input dimension, and depth, all approaching the predicted $1/p^\alpha$ envelope.
Figure 2 of Bordelon et al. [arxiv 2002.02561]. Generalization on each NTK eigenmode follows a spectrum-dependent power law in sample count.

The lazy-training caveat

The catch is that the same condition that makes the theory clean also removes one of the main things deep networks seem to be doing. In the NTK limit, features do not move. Parameters drift by $\|\theta_t - \theta_0\| = O(1/\sqrt{n})$ in width $n$ while the function changes by $O(1)$, so the network is producing its outputs by reweighting an almost fixed collection of random features. Chizat, Oyallon, and Bach call this the lazy-training regime and emphasize that it is a property of the scaling, not a universal description of neural networks.

Figure 1 of Chizat, Oyallon, Bach 2019. Lazy regime versus feature-learning regime trajectories on a 2D classification problem: the lazy regime stays near initialization while feature learning moves substantially.
Figure 1 of Chizat, Oyallon, and Bach [3]. The NTK limit is powerful because it freezes feature movement; that is also what it cannot explain.

That caveat matters; a convolutional network trained in the lazy regime can optimize while failing to learn the representations that make convolutional networks useful. A transformer that looks like a fixed random-feature model is not the object that in-context learning, induction-head formation, and abstraction discussions are pointing at. The NTK gives a rigorous theory of one limit. The question is whether that limit keeps the right phenomena. Geiger et al [5] report a sharp empirical separation: at moderate width and standard initialization scale networks operate near the lazy regime; at lower initialization scale (or if explicit feature-learning parameterizations are used) the same architecture enters a regime where features evolve and test error improves.

The transition is controlled by initialization scale and width, not by anything intrinsic to the architecture. That is a slightly disappointing answer if you wanted neural networks to be feature learners by default.

The NTK is not wrong; it is a baseline. A phenomenon that already appears in the NTK limit can be attributed to width, interpolation, and fixed random features, with no representation learning needed. A phenomenon that disappears in the NTK limit is the work of feature learning, finite-width fluctuation, architecture-specific structure, or nonlinearity in the optimization. That makes the NTK a useful negative control for theoretical claims about deep learning, and the simpler question to put to such a claim is: would it still hold if the features were frozen? For most optimization claims, yes. For most generalization and capability claims, no. The Distill circuits thread is the non-theorem-shaped version of what feature learning looks like when someone manages to pry a model open. Greg Yang's $\mu$P writeup is the practical entry into Tensor Programs, and the microsoft/mup repo is what to grab when the goal is hyperparameter transfer and not the theory.

Feature learning is the missing term

The frontier after the NTK was to build infinite-width limits in which features actually move. Mean-field limits treat each unit as a particle in a measure and study gradient flow on that measure; in this scaling features evolve and the kernel is no longer constant. Tensor-program analyses catalogue the parameterizations that produce sensible infinite-width limits at all. The maximal-update parameterization $\mu$P, introduced by Yang and Hu in Tensor Programs IV (arxiv 2011.14522), is the one that keeps both feature learning and stable optimization in the limit. The follow-up Tensor Programs V (arxiv 2203.03466) derives the $\mu$Transfer hyperparameter-transfer rules from that analysis: tune at small width, scale to large width, and the learning-rate schedule transfers without retuning.

Feature learning means the tangent kernel is moving substantively: $K_{\theta_t} - K_{\theta_0}$ is a structured rotation of the features toward the data, not a small perturbation. The parameter-gradients at the end of training are not the same object as at initialization, and the network has in effect changed the basis it works in. That change is exactly what the pure NTK limit suppresses. Fort et al. ran one of the clearest empirical comparisons: kernel learning matches a finite network early in training but the two diverge later, and the divergence is the gap between lazy convergence to a fixed kernel and feature-driven re-shaping of it.

I use the NTK as a falsifier, not a model. If a proposed mechanism for a deep-learning phenomenon is already trivially in the NTK regime, then "the network is wide and the features are random" suffices, and the explanation has not earned the depth of its hypothesis. The interesting predictions are the ones that disagree with the kernel: where width, lazy init, and architecture-induced spectra are not enough, and where representation change must be doing the work. That includes in-context learning, induction-head formation, and the parts of scaling laws that depend on where compute is spent.

Further reading

References