Notes by Mouhssine Rifaki

Score matching and diffusion

Mouhssine Rifaki — Wed, 22 Apr 2026 12:00:00 +0000

The setup goes back to Sohl-Dickstein et al. [1]. The forward chain takes data x₀ and adds Gaussian noise according to a fixed schedule ᾱ_t, producing intermediate samples x₁, …, x_T via with ε ∼ 𝒩(0, I). The schedule is chosen so that x_T is essentially a standard Gaussian. The reverse chain undoes the corruption: sampling from the data distribution reduces to learning the conditional p(x_t-1 | x_t) at each step.

Score matching

The DDPM forward chain has a clean dual under score matching, and once the two are placed side by side they are not separate ideas. The score function s(x,t) = ∇_x log p_t(x) is the gradient of the log-density of the noisy distribution at noise level t evaluated at x; it points in the direction along which the log-density rises most steeply. Hyvärinen's original score-matching objective is the square norm of the score plus the trace of the Hessian of the log-density. The Hessian-trace term is expensive in high dimensions. Vincent's denoising score matching gets around this: with Gaussian noise of variance σ², the optimal MMSE denoiser D^*(x̃) = 𝔼[x | x̃] satisfies so predicting the noise (equivalently, predicting the clean image) is the same problem as estimating the score of the noisy distribution, with mean-squared error as the loss.

Song and Ermon [2] estimated these scores at a sweep of noise levels and used annealed Langevin dynamics to draw samples from the estimates. Ho, Jain, and Abbeel showed that DDPMs trained with a weighted variational bound reduce, in a particular weighting limit, to denoising score matching at multiple noise levels. The point shared by both: learning to predict the noise that was added at a given level is enough to recover the score of the noisy distribution at that level, and sampling is then iterated denoising along the noise schedule.

Training-stability comparison from the score-matching literature (Song et al. [2] and surrounding work). The score-matching loss is unbounded below without noise conditioning; injecting noise restores stable training.

SDE view

Song et al. unified the discrete and score-based views inside stochastic differential equations. The forward process is an Itô SDE that continuously corrupts data into noise. Anderson's reverse-time formula gives a backward SDE driven by the score and a deterministic probability-flow ODE with the same marginals The ODE is useful for likelihood evaluation and fast sampling; the SDE is what most early samplers used. DDPMs, score-based models, and probability-flow ODE samplers are different discretizations of the same underlying dynamics.

The SDE view also separates modeling from numerics. The noise schedule, solver, parameterization, and preconditioner can all be changed without touching the underlying problem of learning the score.

Figure 1 of Song et al. [4]. The learned score is the object shared by the stochastic reverse process and (in the same paper) the deterministic probability-flow ODE.

There is a lot of great writing on flow matching out there all of a sudden! This post clarifies the connection with diffusion models -- they are essentially two different ways to describe the same class of models. https://t.co/lLokMmxxdz
— Sander Dieleman (@sedielem) December 2, 2024

Figure 2 of Ho et al. [3]. The forward chain corrupts data into noise; the learned reverse chain inverts it step by step.

Lu et al. [6] built DPM-Solver out of the semi-linear structure of the probability-flow ODE and cut sampling from thousands of steps to tens, with no retraining required.

Karras et al.

Karras et al. [5] argued that diffusion practice had become unnecessarily entangled: sampling schedule, loss weighting, noise parameterization, network preconditioning, and solver choice were all bundled together. Pulling each design decision apart showed that most of the published performance gains came from untangling the design space, not from a new generative principle. Their EDM recipe (continuous noise levels indexed by σ, σ-conditioned network preconditioning, and a second-order Heun ODE solver) has since become a common baseline.

Table 1 of Karras et al. [5]. The score-estimation problem stays fixed while schedules, targets, preconditioning, and solvers move around it.

The implementation history runs through three papers. Dhariwal and Nichol showed that diffusion models beat GANs on ImageNet, by combining classifier guidance with architecture and training changes. Ho and Salimans then dropped the auxiliary classifier in favor of classifier-free guidance, training the conditional and unconditional scores jointly and combining them at sampling time. Rombach et al. moved the diffusion process inside the latent space of a pre-trained autoencoder, which is what made high-resolution diffusion feasible at academic compute and what Stable Diffusion is built on.

Natural data sits near low-dimensional manifolds where direct density modeling is hard. Adding white noise thickens the manifold: at high noise levels the distribution is smooth, at low noise levels it is detailed but local. Diffusion replaces a single full-density estimation problem with a sequence of denoising problems indexed by noise level. Compare with autoregressive models, which generate sequentially conditioned on prior tokens, and with flows, which accept restrictions on the transforms they can express in exchange for tractable likelihoods. Diffusion's training objective is stable in a way GAN training is not, while paying for it with iterative sampling that DPM-Solver, consistency models, and distillation have largely clawed back.

Semantic abstraction inside the model is still theoretically incomplete. The score tells you how to move from a noisy sample toward a clean one. It does not say why text prompts, latent-space guidance, or multimodal conditioning organize concepts the way they do. Those mechanisms are built on top of a fixed score-matching core, but the core does not force any of them.

Whether diffusion is the right parameterization is also unclear to me. Flow matching frames generation as learning a vector field that transports a simple prior to the data distribution along possibly non-straight paths, with a regression objective that does not require an SDE. Rectified flow constrains the paths to be nearly straight, which keeps few-step sampling accurate. Consistency models compress the iterative denoising sampler into a single-step generator while preserving sample quality. None of these compete with score-based modeling; they are alternative parameterizations of the same vector-field-across-noise-levels problem.

References

[1] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arxiv 1503.03585, 2015.
[2] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. arxiv 1907.05600, 2019.
[3] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. arxiv 2006.11239, 2020.
[4] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. arxiv 2011.13456, 2020.
[5] T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. arxiv 2206.00364, 2022.
[6] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver: a fast ODE solver for diffusion probabilistic model sampling in around 10 steps. arxiv 2206.00927, 2022.

Adversarial examples

Mouhssine Rifaki — Sat, 07 Feb 2026 12:00:00 +0000

Adversarial examples initially seemed an oddity. Szegedy et al. [1] demonstrated that a minuscule perturbation, meaningless to human eyes, could confidently flip a neural net's prediction. My first instinct on reading it was to blame weird nonlinearities or overfitting. It turned out to be neither. Goodfellow et al. [2] gave a much simpler explanation. A linear model with weight w ∈ ℝ^d will have logit shift w^⊤ δ under perturbation δ, and the worst-case ℓ_∞ behavior inside of ‖δ‖_∞ ≤ ε is ε ‖w‖₁, growing like εd for dense weights, a tiny per-pixel perturbation accumulated across a high-dimensional input.

Figure 1 of Goodfellow et al. [2]. The perturbation is small under the threat model, but it is aligned with the classifier's loss gradient.

In a thousand-dimensional input space, even an imperceptible δ can produce a logit shift that flips the prediction. Adversarial perturbations are then not a symptom of nonlinear extrema; they are a generic property of high-dimensional linear decision rules. FGSM is then a one-step linearized attack on the inner maximum max_{‖δ‖_∞ ≤ ε} L(θ, x+δ, y). The fact that one step works at all is the diagnostic: the model is sensitive to a direction the data distribution does not mark as human-meaningful. Carlini and Wagner later showed that more carefully tuned attack objectives produce much smaller-norm perturbations than FGSM, breaking many defenses whose only evaluation had been against single-step attacks.

Madry et al. wrote down the saddle-point formulation: define a defense by the worst-case inner-max loss it can resist within a fixed threat model, and treat any defense that fails a stronger attack inside that threat model as broken. The split separates two questions that were previously tangled together: the inner problem defines what the attacker can do, and the outer problem defines what the model has to optimize against. Projected gradient descent then becomes both the canonical attack and the canonical training procedure. It is not perfect, but it made robustness measurable enough that defenses could be compared honestly. Many proposed defenses then turned out not to be robust; they just break weak attacks. Athalye, Carlini, and Wagner cataloged the failure modes under one heading - obfuscated gradients. Stochastic preprocessing, non-differentiable transforms, exploding or vanishing gradients, and gradient shattering each produce attacks that fail without producing classifiers that survive a stronger attack.

Figure 1 of Madry et al. [3]. PGD finds many high-loss perturbations on standard networks; on the adversarially-trained networks it caps out near a small bounded value.

They broke six of the nine ICLR-2018 defenses completely and a seventh partially, just by replacing the attack with a stronger one inside the same threat model. AutoAttack later turned that lesson into a parameter-free ensemble: a single attack you can run against a defense without per-defense tuning, which exposes inflated robustness numbers automatically. The other branch of progress is certified rather than empirical robustness. Cohen, Rosenfeld, and Kolter produce randomized-smoothing certificates: convolve the classifier with isotropic Gaussian noise of variance σ² and the smoothed classifier g(x) = argmax_c ℙ_{η ∼ 𝒩(0,σ² I)}[f(x+η) = c] is provably robust within an ℓ₂ ball of radius σΦ^-1(p_A), where p_A is the lower confidence bound on the top-class probability.

The definition of "adversarial examples" I prefer these days is "Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake" https://t.co/GiXiQBCp5L
— Ian Goodfellow (@goodfellow_ian) April 12, 2018

The certificate is provable. The cost is that the radius is small in practice and only natural in ℓ₂. For a less paper-indexed entry point, the Gradient Science adversarial robustness page is still one of the better maps of attacks, defenses, and the evaluation traps. RobustBench is the practical scoreboard, once the question becomes: does this defense survive standard attacks?

The accuracy trade-off

Tsipras et al. [4] made the uncomfortable point that robustness can conflict with standard accuracy. In their construction, the standard classifier uses weak but highly predictive features that are not robust. The robust classifier has to ignore them and therefore loses ordinary accuracy. The empirical version of this trade-off is more complicated, but the conceptual point survived: robustness is not just accuracy with additional caution.

A robust classifier may have to learn different features altogether. Ilyas et al. [5] put it bluntly: adversarial examples are not bugs, they are features. Their claim was not that every attack direction is semantically relevant to humans. It was that standard datasets contain predictive signal that models can use and that humans do not recognize as robust evidence. Adversarial perturbations take advantage of those signals. Robust training suppresses them. Schmidt et al. [6] sharpened the price metric in statistical terms: the sample complexity of robust learning can be polynomially bigger than the sample complexity of standard learning, an information-theoretic gap that holds irrespective of the training algorithm or the model family. In their Gaussian-mixture model, standard generalization needs only constant sample complexity while robust generalization at ℓ_∞ radius ε requires a polynomial-in-d number of samples.

Figure 1 of Ilyas et al. [5]. A feature can be genuinely predictive and still fail the invariance demanded by the threat model.

The gap comes from the structure of the learning problem, not the algorithm. Robustness is paying for invariance, and invariance costs in terms of sample complexity. Engstrom et al. then ran the experiment in the other direction: representations from robust classifiers transfer better than standard ones on a range of downstream tasks, look more semantically aligned in feature visualization, and yield gradients that resemble human-perceptible objects. Robustness, in this reading, is also a representation-learning prior. Whether the prior helps or hurts depends on the downstream task, but it is not free of structure.

The adversarial-examples literature forced a distinction between predictive validity and human-aligned validity. A feature can be statistically real, useful for test accuracy, and still unacceptable under a robustness constraint. That is a deeper issue than security: the supervised-learning objective does not fully specify the invariances I care about. To the extent that human perception is itself a strong inductive bias, models that do more representation learning end up closer to my geometry; robust models produce gradients and saliency maps that line up with what a person would identify as the object. In medical imaging, robotics, or safety-critical perception that trade-off is worth the cost. In low-stakes classification it often is not.

Adversarial-training convergence in Madry et al. [3]. The robust optimization objective is solvable: the inner-max-then-outer-min loss curve descends to a bounded plateau under PGD adversarial training.

The "non-robust feature" label renames an older statistical fact. Predictive validity in distribution is not causal structure, and a model that maximizes the former will exploit signals the latter does not endorse. Adversarial training folds a robustness constraint into the objective. The cleaner long-term fix is on the data side: collect or augment so that the equivalence classes the human cares about are the equivalence classes the dataset enforces.

References

[1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arxiv 1312.6199, 2013.
[2] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arxiv 1412.6572, 2014.
[3] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arxiv 1706.06083, 2017.
[4] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. arxiv 1805.12152, 2018.
[5] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are features. arxiv 1905.02175, 2019.
[6] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires more data. arxiv 1804.11285, 2018.

Mode connectivity

Mouhssine Rifaki — Tue, 04 Nov 2025 12:00:00 +0000

The old picture of the loss surface as many isolated basins, one per initialization, has not held up. Freeman and Bruna [6] suggested early on that low-loss level sets stay connected. Garipov et al. [1] and Draxler et al. [2] then made it concrete: two independently trained models can be joined by a smooth low-loss path. The minima are not isolated points; they are reachable from each other through a connected high-dimensional region.

The experiment is direct. Train two models θ₁ and θ₂ to low loss, and interpolate linearly between them to define a one-parameter family for α ∈ [0,1]. The straight segment usually crosses a high-loss barrier In other words, the loss jumps up substantially as soon as one steps off either endpoint, and the peak in the middle is typically far above either endpoint loss. Garipov and Draxler's contribution was to show that this barrier exists only along the straight line: if you allow curved paths, you can connect θ₁ and θ₂ with a path that stays at low loss throughout. The endpoints are not separated by an insurmountable barrier; the straight line is just the wrong path through parameter space.

Curves before lines

The first demonstrations of mode connectivity used non-linear paths. Garipov et al. parametrized the path as polygonal chains and Bézier curves; Draxler et al. used continuous non-linear paths produced by the Nudged Elastic Band method. Both findings weakened the previous picture in which SGD ends up in sharply-separated basins: if a low-loss path exists between two trained models, the connected low-loss region they both lie in is substantially larger than what a local Hessian analysis at either endpoint would suggest.

Figure 1 of Garipov et al. [1]. The straight segment between two minima crosses a barrier; the learned curve stays in low-loss territory.

Linear mode connectivity is strictly stronger than the curve-based version: it asks the loss to stay low along the straight segment between two minima, not along an arbitrary path. For independently-trained networks from distinct initializations this typically fails. The linearly connected case is essentially confined to one setup. Frankle, Dziugaite, Roy, and Carbin [3] formalized it as "spawning": train a model θ₀, fork two copies after k steps using different SGD noise, and train each to convergence. Once k crosses a stability threshold (around 1000-2000 iterations on standard CIFAR networks, and a few percent of training on ImageNet), the two descendants end up linearly connected. They argue this is exactly the lottery-ticket basin: the connected region is the one that the matching sparse sub-network corresponds to, so linear-mode-connectivity becomes a practical test for whether two runs landed in the same effective basin and can therefore be merged without loss.

Figure 3 of Frankle et al. [3]. Fork late enough and the two descendants stay linearly connected.

The permutation turn

Entezari et al. [4] reframed the geometry. Neural networks have permutation symmetries: swapping units in a hidden layer and unswapping them in the next layer leaves the function unchanged. Two models that look far apart in raw parameter coordinates might just be using different unit orderings of the same function. Quotient by those permutations and many independently trained models turn out to be connected by a simple low-loss curve.

Git Re-Basin turns the idea into a concrete algorithm. Given two trained models θ₁, θ₂ with hidden-layer widths {n_ℓ}, it searches over permutation matrices P_ℓ ∈ S_{n_ℓ} for the alignment that minimizes ‖ θ₁ - P · θ₂ ‖ under a weights or activations metric, then checks whether the aligned models can be merged in weight space. Singh and Jaggi give an optimal-transport version of the same idea: a soft assignment between units that reduces to permutation matching when the widths agree, and that handles mismatched widths when they don't. Neither paper proves that all minima sit in one basin, but together they make a strong empirical case that raw parameter-space interpolation overstates the separation. A meaningful fraction of the apparent barrier is just a bad coordinate system.

Say you train Model A.

Independently, your friend trains Model B, possibly on different data.

With Git Re-Basin, you can merge models A+B in weight space at _no cost to the loss_
— Samuel "curry-howard fanboi" Ainsworth (@SamuelAinsworth) September 13, 2022

Figure 1 of Ainsworth et al. [5]. Aligning hidden units by permutation collapses most of the apparent linear-interpolation barrier.

Empirical evidence ties together alignment and merging

Tatro et al. [7] showed empirically that aligning models before fitting a connecting curve produces shorter curves with lower loss along them, which is the consistency check the permutation story predicts: correcting for symmetries should give a simpler geometry than the raw view. Benton et al. [8] extended this beyond curves to higher-dimensional simplexes of solutions: once symmetries are corrected, low-loss volumes contain many independently trained checkpoints.

Figure 2 of Ainsworth et al. [5]. Aligning hidden units before interpolation collapses most of the apparent barrier across architectures and datasets.

Practical applications

Model merging is what gets built on top. SWA averages late-training checkpoints and works because the trajectory it averages over stays inside one connected low-loss region. Model Soups average independently fine-tuned models from a shared pretraining initialization, which puts every fine-tune inside the same connected component and close to the others. Git Re-Basin generalizes this further by aligning unit permutations so models with no shared initialization can be merged at all. Across architectures, weight space behaves like a workspace where related models can be moved between while preserving function — once symmetries and shared training histories have been accounted for.

It does not explain generalization. A connected low-training-loss region can contain many bad predictors on held-out data, and showing that two solutions are connected says nothing about how either performs on unseen examples. It also does not guarantee that every architecture, dataset, or training recipe lives in a single basin; the broader single-basin claims have counter-examples. I read "one wide basin" as rhetorically appealing but over-reaching the evidence. The established claim is weaker: solutions reachable from a fixed initialization, or from independent runs once permutations are aligned, lie in a single connected low-loss region. That is enough to explain why SWA, model soups, and weight-space ensembling work. I would not push the geometry harder than that.

References

[1] T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, and A. G. Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. arxiv 1802.10026, 2018.
[2] F. Draxler, K. Veschgini, M. Salmhofer, and F. A. Hamprecht. Essentially no barriers in neural network energy landscape. arxiv 1803.00885, 2018.
[3] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. Linear mode connectivity and the lottery ticket hypothesis. arxiv 1912.05671, 2019.
[4] R. Entezari, H. Sedghi, O. Saukh, and B. Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks. arxiv 2110.06296, 2021.
[5] S. K. Ainsworth, J. Hayase, and S. Srinivasa. Git Re-Basin: merging models modulo permutation symmetries. arxiv 2209.04836, 2022.
[6] C. D. Freeman and J. Bruna. Topology and geometry of half-rectified network optimization. arxiv 1611.01540, 2016.
[7] N. Tatro, P.-Y. Chen, P. Das, I. Melnyk, P. Sattigeri, and R. Lai. Optimizing mode connectivity via neuron alignment. arxiv 2009.02439, 2020.
[8] G. Benton, W. J. Maddox, S. Lotfi, and A. G. Wilson. Loss surface simplexes for mode connecting volumes and fast ensembling. arxiv 2102.13042, 2021.

The neural tangent kernel

Mouhssine Rifaki — Sat, 09 Aug 2025 12:00:00 +0000

The neural tangent kernel was one of the few deep-learning theory ideas that were useful before they became a concept. It doesn't solve generalization, but it makes a very stubborn object analyzable. Jacot, Gabriel, and Hongler [1] found that if you take a network to infinite width under the right scaling, gradient descent on the parameters becomes kernel gradient descent in function space. The kernel isn't chosen by hand, it's induced by the network at initialization. For a model f_θ, the tangent kernel is K_θ(x, x') = ∇_θ f_θ(x)^⊤ ∇_θ f_θ(x'). In finite networks this kernel changes as you train. In the infinite width limit under the standard parameterization, it converges to some deterministic kernel K_∞ and ‖K_{θ_t} - K_∞‖ approaches O(1/√n) in width n.

Function-space gradient descent then satisfies ḟ_t = -K_∞ (f_t - y) on the training set and integrates to f_t = y + e^{-K_∞ t}(f₀ - y). Parameter-space non-convexity stops mattering: the function-space dynamics are linear and driven by a positive semidefinite kernel.

NTK as baseline

The first thing the NTK explained is why very wide networks optimize so easily. If the kernel is well conditioned on the training data, gradient descent has an easy road to interpolation. The complicated nonconvex path is, at leading order, kernel regression with some specific architecture induced kernel. Du et al. [4] and Lee et al. [2] pushed this picture further and showed that wide networks of any depth evolve like their first-order Taylor expansion around initialization.

Figure 2 of Lee et al. [2]. The linearization at initialization tracks the wide-network trajectory closely under gradient descent.

Du et al. extend the same machinery to a clean proof that overparameterized networks reach zero training loss with a polynomial-width requirement and a global-convergence guarantee that the nonconvex landscape never gave you. But the spectrum of K_∞ does more than set the speed of convergence. Its eigendecomposition K_∞ = ∑_k λ_k φ_k φ_k^⊤ implies that the residual along eigenmode φ_k shrinks like e^{-λ_k t}, so large-eigenvalue modes are learned quickly and small-eigenvalue modes slowly, or not at all under early stopping. Early stopping, the frequency principle, and the spectral bias of MLPs all become statements about {λ_k}.

Figure 1 of Cao et al. [arxiv 1912.01198]. Low-frequency components of the target are absorbed by the network long before higher-frequency components, in the order predicted by the NTK spectrum.

Arora et al. write down an exact algorithm for computing K_∞ for fully connected and convolutional networks of arbitrary depth, making these spectral predictions empirically testable on real datasets.

Figure 2 of Bordelon et al. [arxiv 2002.02561]. Generalization on each NTK eigenmode follows a spectrum-dependent power law in sample count.

The lazy-training caveat

The catch is that the same condition that makes the theory clean also removes one of the main things deep networks seem to be doing. In the NTK limit, features do not move. Parameters drift by ‖θ_t - θ₀‖ = O(1/√n) in width n while the function changes by O(1), so the network is producing its outputs by reweighting an almost fixed collection of random features. Chizat, Oyallon, and Bach call this the lazy-training regime and emphasize that it is a property of the scaling, not a universal description of neural networks.

Figure 1 of Chizat, Oyallon, and Bach [3]. The NTK limit is powerful because it freezes feature movement; that is also what it cannot explain.

That caveat matters; a convolutional network trained in the lazy regime can optimize while failing to learn the representations that make convolutional networks useful. A transformer that looks like a fixed random-feature model is not the object that in-context learning, induction-head formation, and abstraction discussions are pointing at. The NTK gives a rigorous theory of one limit. The question is whether that limit keeps the right phenomena. Geiger et al [5] report a sharp empirical separation: at moderate width and standard initialization scale networks operate near the lazy regime; at lower initialization scale (or if explicit feature-learning parameterizations are used) the same architecture enters a regime where features evolve and test error improves.

The transition is controlled by initialization scale and width, not by anything intrinsic to the architecture. That is a slightly disappointing answer if you wanted neural networks to be feature learners by default.

The NTK is not wrong; it is a baseline. A phenomenon that already appears in the NTK limit can be attributed to width, interpolation, and fixed random features, with no representation learning needed. A phenomenon that disappears in the NTK limit is the work of feature learning, finite-width fluctuation, architecture-specific structure, or nonlinearity in the optimization. That makes the NTK a useful negative control for theoretical claims about deep learning, and the simpler question to put to such a claim is: would it still hold if the features were frozen? For most optimization claims, yes. For most generalization and capability claims, no. The Distill circuits thread is the non-theorem-shaped version of what feature learning looks like when someone manages to pry a model open. Greg Yang's μP writeup is the practical entry into Tensor Programs, and the microsoft/mup repo is what to grab when the goal is hyperparameter transfer and not the theory.

Excited to share our new #neurips2020 paper /Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel/ (https://t.co/4jmrNfOE6H) with @KDziugaite, Mansheej, @SKharaghani, @roydanroy, @SuryaGanguli 1/6 pic.twitter.com/iPPP3HmNgm
— Stanislav Fort (@stanislavfort) October 30, 2020

Feature learning is the missing term

The frontier after the NTK was to build infinite-width limits in which features actually move. Mean-field limits treat each unit as a particle in a measure and study gradient flow on that measure; in this scaling features evolve and the kernel is no longer constant. Tensor-program analyses catalogue the parameterizations that produce sensible infinite-width limits at all. The maximal-update parameterization μP, introduced by Yang and Hu in Tensor Programs IV (arxiv 2011.14522), is the one that keeps both feature learning and stable optimization in the limit. The follow-up Tensor Programs V (arxiv 2203.03466) derives the μTransfer hyperparameter-transfer rules from that analysis: tune at small width, scale to large width, and the learning-rate schedule transfers without retuning.

Feature learning means the tangent kernel is moving substantively: K_{θ_t} - K_θ₀ is a structured rotation of the features toward the data, not a small perturbation. The parameter-gradients at the end of training are not the same object as at initialization, and the network has in effect changed the basis it works in. That change is exactly what the pure NTK limit suppresses. Fort et al. ran one of the clearest empirical comparisons: kernel learning matches a finite network early in training but the two diverge later, and the divergence is the gap between lazy convergence to a fixed kernel and feature-driven re-shaping of it.

I use the NTK as a falsifier, not a model. If a proposed mechanism for a deep-learning phenomenon is already trivially in the NTK regime, then "the network is wide and the features are random" suffices, and the explanation has not earned the depth of its hypothesis. The interesting predictions are the ones that disagree with the kernel: where width, lazy init, and architecture-induced spectra are not enough, and where representation change must be doing the work. That includes in-context learning, induction-head formation, and the parts of scaling laws that depend on where compute is spent.

References

[1] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: convergence and generalization in neural networks. arxiv 1806.07572, 2018.
[2] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arxiv 1902.06720, 2019.
[3] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. arxiv 1812.07956, 2018.
[4] S. S. Du, X. Zhai, B. Póczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. arxiv 1810.02054, 2018.
[5] M. Geiger, S. Spigler, A. Jacot, and M. Wyart. Disentangling feature and lazy training in deep neural networks. arxiv 1906.08034, 2019.
[6] S. Fort, G. K. Dziugaite, M. Paul, S. Kharaghani, D. M. Roy, and S. Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. arxiv 2010.15110, 2020.

The implicit-bias program

Mouhssine Rifaki — Wed, 14 May 2025 12:00:00 +0000

Why does training a model without an explicit regularizer, with the loss driven nearly to zero, still produce a solution that generalizes? The classical answer is that the objective has to carry the regularizer somewhere. The implicit-bias answer is more subtle: even when the objective has many minima, gradient descent does not choose among them neutrally; the algorithm itself selects a particular kind of solution.

Zhang et al. [2] sharpened the puzzle that most of this literature now opens with. Standard image classifiers fit random labels as easily as they fit true ones, which kills the simplest capacity-based explanation of generalization: the hypothesis class is large enough to memorize anything. Whatever is producing the generalization, then, must come from the optimizer, the data, or the parameterization, not from the loss function itself.

The clearest version of this story is not actually about neural networks. It is about logistic regression on linearly separable data. Once the classifier separates the data, the empirical classification error is already 0% and the logistic loss continues to decrease as the norm of the weights grows. There is no finite minimizer. What is more surprising is that the direction of the weights still converges. And it converges to the maximum-margin SVM solution.

Figure 1 of Soudry et al. [1]. The loss has no finite minimizer on separable data, the norm grows without bound, and the normalized direction converges to the hard-margin separator.

The linear theorem

Let x_i ∈ ℝ^d denote an input vector with binary label y_i ∈ {-1,+1}, and let w_t ∈ ℝ^d denote the weight vector at iteration t. For linearly separable data {(x_i, y_i)}_i=1ⁿ, standard gradient descent on the logistic loss sends ‖ w_t ‖ → ∞, but the normalized direction w_t / ‖ w_t ‖ converges to the L2 max-margin direction ŵ / ‖ ŵ ‖, where

In words: once the classifier has separated the data, gradient descent keeps reducing the logistic loss as long as the separation is preserved, except that unlike a traditional SVM it does so while continuing to grow the norm of the weights without bound. The norm grows only logarithmically with t, and the angle gap between w_t/‖ w_t‖ and ̂w/‖ ̂w‖ closes at the same logarithmic rate. This is why the asymptotic regime takes so many iterations to become visible.

A few things follow from the theorem. Gradient descent behaves as if it had been regularized toward the Euclidean max-margin classifier without anyone writing that regularizer down. It also explains why early stopping helps: because the weight norm diverges over time, stopping early caps the implicit penalty before the iterate gets too far. Ji and Telgarsky extended the result to the non-separable case, showing the iterate still tracks a unique ray defined by the data when no separating hyperplane exists. Nacson et al. [4] showed that aggressive learning-rate schedules accelerate convergence to the max-margin direction by polynomial factors over plain GD. The norm divergence itself is robust to dataset size and dimension: as long as the data is linearly separable, the iterate keeps moving away from the origin and never lands at a finite minimum.

Figure 2 of Soudry et al. [1]. On real data, classification error plateaus near zero while the L2 norm of the final layer keeps growing - the asymptotic regime described in the linear theorem.

The geometry enters

Gunasekar et al. [3] generalized the result: different optimization geometries select different implicit regularizers. Steepest descent under the ℓ_p norm minimizes the ℓ_p margin instead of the ℓ₂ margin. Mirror descent with respect to a convex potential Φ minimizes the Φ-min-norm interpolant. Natural gradient and adaptive methods land at interpolants determined by the geometry of their step.

The linear-convolutional-network result is the cautionary case. For fully connected linear predictors, gradient descent picks out the familiar ℓ₂ margin geometry. For full-width linear convolutional networks of depth L, Gunasekar et al. show that gradient descent instead selects the predictor minimizing the 2/L-bridge penalty in the discrete Fourier domain. The architecture changes which parameters are being optimized, and that change shifts both the trajectory and the preferred solution. "Gradient descent likes simple solutions" is too vague to be a theorem. The more honest statement is that gradient descent likes simple solutions in whichever coordinate system the architecture imposes.

From Gunasekar et al. [3]. Implicit bias is a selection rule over interpolants determined by the geometry of the optimizer, not a single universal preference.

Margin in homogeneous networks

Lyu and Li push the result past linear predictors. If f_θ is positively homogeneous in θ with order L (which holds for ReLU networks without bias, with L equal to depth), then gradient flow on exponential or logistic loss drives θ_t / ‖ θ_t ‖ to a KKT point of the parameter-space margin program max_{‖ θ ‖ ≤ 1} min_i y_i f_θ(x_i). The norm still diverges; the normalized direction still converges; the new content is that even on a non-convex parameter landscape, gradient flow lands on points satisfying first-order optimality conditions for the margin program.

Chizat and Bach prove a parallel mean-field result for two-layer networks with vanishing initializations. The implicit bias there is F₁-norm minimization in function space, which is a different object from parameter-space margin maximization and interacts differently with the data. The state of the field is that implicit regularization in deep networks has several reasonable descriptions, none of which generalize cleanly past shallow or homogeneous models.

Figure 1 of Lyu and Li [arxiv 1906.05890]. The loss keeps shrinking, the weight norm grows, and the useful object is the normalized direction.

It is widely thought that neural networks generalize because of implicit regularization of gradient descent. Today at #ICLR2023 we show new evidence to the contrary. We train with gradient-free optimizers and observe generalization competitive with SGD.https://t.co/8Vo9rFI9FY
— Tom Goldstein (@tomgoldsteincs) May 2, 2023

My own reading is that the implicit-bias program is one of the few cases I can point to of a research direction being vindicated and outgrown at the same time. Soudry et al. [1] is true; the mechanism is real; the linear case is the only setting where I can prove anything I trust. What is unclear is whether the same phenomenon is the dominant explanation for why large feature-learning networks generalize, or whether at scale the data distribution and the architecture have already done so much of the work that the optimizer's preference is a small correction. I currently believe the second, but I do not have a falsifier I trust, which is exactly the position the field is in.

References

[1] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. arxiv 1710.10345, 2017.
[2] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arxiv 1611.03530, 2016.
[3] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. arxiv 1802.08246, 2018.
[4] M. S. Nacson, J. Lee, S. Gunasekar, P. H. P. Savarese, N. Srebro, and D. Soudry. Convergence of gradient descent on separable data. arxiv 1803.01905, 2019.

The edge of stability

Mouhssine Rifaki — Thu, 27 Feb 2025 12:00:00 +0000

Cohen et al. [1] observed that gradient descent on neural networks spends most of training in a regime where the top Hessian eigenvalue λ_max is above the classical stability threshold 2/η. The step is formally unstable (η λ_max > 2), but the loss does not diverge: the trajectory oscillates along the unstable direction while continuing to make progress on the rest. They called this the edge of stability, and the point is that it is not an edge case but the ordinary regime of neural-network training.

Progressive sharpening

The first phase Cohen identified is progressive sharpening. From random initialization, gradient descent reliably drives the sharpness (λ_max(H)) of the loss landscape upward during training.

That direction is already counterintuitive. Classical optimization theory says you want to avoid sharp regions, since each step there costs more. Neural-network training does the opposite: it walks into sharper regions until the classical step size is no longer stable. Progressive sharpening itself is not well understood from first principles. Damian et al. [2] give a self-stabilization argument for what happens after the threshold is reached, and Ahn et al. [4] build on it, but neither predicts the sharpening from initialization. The observational picture is clean; the theoretical one is not.

Above the threshold

Once sharpness crosses 2/η, the textbook prediction is divergence; instead the trajectory oscillates along the top eigendirection of the Hessian. The component of the iterate along that direction swings back and forth, and the loss still falls because optimization keeps making progress on the better-conditioned directions. The net result is a trajectory that reduces loss while sitting in a region of the landscape that classical theory says it should not occupy. The first theoretical account of why this need not be a pathology comes from Arora et al. [3]. Their analysis works on a smoothed version of the loss, where the Hessian is treated as locally fixed, and in effect tracks the trajectory you would follow from a given starting point under that fixed Hessian.

Figure 1 of Cohen et al. [1]. Across architectures, the Hessian's top eigenvalue rises during a progressive-sharpening phase and then sits near the 2/η stability threshold for the remainder of training.

The mechanism is that the oscillation Δ θ_t across the unstable direction averages to zero, so the effective dynamics is slower and looks like gradient descent on a loss with the steepest direction clipped. The account is formal enough to be checked against real training runs, and in most common settings it holds up.

And flat minima

The earlier flat-minima story started with Hochreiter and Schmidhuber and continued with Keskar et al. [7] and the later sharpness aware minimization literature. Broadly, the flat-minima story ran as follows: SGD with a small batch size produces gradient estimates with some noise ξ.

That noise looks like a random walk and tends to leave sharp minima more often than flat ones, so SGD ends up biased toward flat minima, and that bias was meant to be why neural networks generalize. Edge-of-stability does not contradict the story, but it reshapes it. The learning rate itself caps how sharp a reachable minimum can be: anything with λ_max > 2/η is unstable for GD, so the trajectory cannot stay there. Gradient descent finds flat minima not because it has noise, but because sharp minima are unstable fixed points under its own dynamics. The original explanation identified the phenomenon and pinned it to the wrong cause.

Figure 3 of Cohen et al. [1]. Progressive sharpening isolated: Hessian sharpness rises monotonically during the early phase, long before the 2/η threshold is reached.

The reframing lives mostly outside the papers themselves. Off Convex has a few posts on implicit bias, trajectory analysis, and why the classical descent lemma is genuinely misleading for neural networks instead of merely approximate. Ben Recht's ArgMin is the complementary skeptical take for once you have left convex optimization theory μ I ⪯ ∇² L ⪯ LI. Clare Lyle's tutorial walks through the 2/η arithmetic and ties the phenomenon to warmup (rising η(t)) and catapult (loss spike then decay) in one frame.

Andreyev and Beneventano (arxiv 2412.20553) extended the story to the mini-batch setting Cohen did not analyze, introducing an "edge of stochastic stability" where the quantity that pins at 2/η is the expected directional curvature of mini-batch Hessians, not the full-Hessian's top eigenvalue.

A few previously folklore-level phenomena become intelligible from this picture. Warmup schedules η(t), which start small and increase η over several thousand iterations, let the network settle into edge-of-stability before η reaches its final value; without warmup, the early transient at the full η would hit a too-sharp region and diverge. Decreasing-η schedules at the end of training raise 2/η, so the trajectory can fine-tune in sharper local minima inside the broader flat region already reached, which empirically pushes training loss down further.

Figure 1 of Arora et al. [3]. The smoothed-loss analysis makes explicit why the oscillations across the unstable direction do not destroy progress; averaging over a few steps yields an effective slow dynamics on a clipped loss.

Both schedules had been used empirically for years before any principled account existed. Lewkowycz et al.'s catapult mechanism [5], where an initial loss spike sometimes precedes a better final solution, is the same dynamics at a larger scale: a large learning rate pushes the trajectory through a briefly very sharp region, the loss spikes, and the trajectory then settles into a different basin from the one it would have reached at a smaller step size.

The generalization gap is still open

Edge-of-stability gives a clean account of why SGD ends up in flat minima, but it says nothing about why flat minima generalize. Those are distinct questions, and the second one is still open.

Dinh et al. [6] showed that the Hessian-based notion of sharpness is not reparameterization invariant, so sharpness in that form cannot directly control generalization.

Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.*

With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs. pic.twitter.com/pvvfwoQcOy
— Jeremy Cohen (@deepcohen) October 1, 2025

Jeremy Cohen, lead author of [1], announcing the central-flows follow-up that makes the 2021 observation a quantitative prediction tool.

References

[1] J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. arxiv 2103.00065, 2021.
[2] A. Damian, E. Nichani, and J. D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. arxiv 2209.15594, 2022.
[3] S. Arora, Z. Li, and A. Panigrahi. Understanding gradient descent on edge of stability in deep learning. arxiv 2205.09745, 2022.
[4] K. Ahn, J. Zhang, and S. Sra. Understanding the unstable convergence of gradient descent. arxiv 2204.01050, 2022.
[5] A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: The catapult mechanism. arxiv 2003.02218, 2020.
[6] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. arxiv 1703.04933, 2017.
[7] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: generalization gap and sharp minima. arxiv 1609.04836, 2016.

Kaplan, Chinchilla, and broken laws

Mouhssine Rifaki — Mon, 06 Jan 2025 12:00:00 +0000

Kaplan et al. [1] set the baseline picture: test loss falls as a power law L(C) = (C/C₀)^-α_C in each of model size N, dataset size D, and compute C, with clean exponents α that hold over many orders of magnitude. They also claimed that for a given compute budget, there is an optimal allocation between N and D, and specifically that N should grow faster than D as C grows.

That paper shaped how labs designed pre-training experiments for the next two years, and the eventual "Chinchilla" effort grew out of trying to reproduce and extend its recommendations. It also turned out to be wrong about the optimal D/N ratio, which is the part the field then had to revise.

Figure 1 of Kaplan et al. [1]. Log-log plot of test loss against compute. The power-law fit is tight over seven orders of magnitude which is what made the result so persuasive.

Chinchilla

Hoffmann et al. [2] re-asked Kaplan's question with a larger, better-controlled experiment: over 400 language models from 70 million to over 16 billion parameters, trained on 5 to 500 billion tokens, across a sweep of compute budgets. Fitting a joint regression over model size and tokens gave them per-budget optima N^⋆(C), D^⋆(C) that were not where Kaplan put them. Their headline rule is that for each doubling of model size, dataset size should also double, which lands at roughly 20 tokens per parameter at the compute-optimal point.

The shift in conclusions came from a methodological gap. Hoffmann et al. argue that Kaplan's runs were too short: they ended well before each model had seen enough tokens to bottom out its loss. Kaplan's reported optima were therefore extrapolations from incomplete training curves, while Hoffmann's were drawn from runs that continued long enough to actually locate the per-size minimum. The two sets of optima ended up substantially different.

The effects of correcting Kaplan

GPT-3 and similarly Kaplan-trained models were trained on substantially too little data for their size. Chinchilla shows that for a fixed compute budget, a smaller model trained on more tokens reaches a lower loss than a bigger model trained on fewer. Concretely, Chinchilla-70B is roughly 2.5x smaller than GPT-3 (175 billion parameters) and outperforms it on nearly every benchmark in the original paper. Many factors contribute to that gap, but the dominant one is that Chinchilla-70B is configured at the Chinchilla-optimal point for its compute budget while GPT-3 sits at the Kaplan-optimal point. Subsequent scaling-law work that calibrates against the Chinchilla target accordingly emphasizes data scaling, longer training, and less aggressive model-size growth.

Llama 2 was trained on 2 trillion tokens, well past the Chinchilla-optimal breakpoint and into a regime where the relevant trade-off is no longer training compute but inference cost. Once it became clear that smaller models are much cheaper to run at inference, scaling-law work shifted target from training-compute optimality to deployment optimality.

Figure 3 of Hoffmann et al. [2]. The IsoFLOPs decomposition: each line is a fixed compute budget, and the locus of minima defines the Chinchilla parameters/tokens scaling law.

Broken laws

Caballero et al. [3] argue that the single-power-law narrative is wrong. Loss versus compute, in their fit, is a smoothly-broken power law: continuous everywhere, with several breaks where the slope changes. Kaplan and Chinchilla's single-exponent fits are then averaging across regimes with different exponents and producing a number that matches none of them.

Whether the broken-laws view is useful depends on what you want the fit for. For coarse extrapolation across orders of magnitude in compute, a single power law still works fine. For predicting at what scale a particular capability shows up, it does not, and the Caballero breakpoints line up with the "emergent" capabilities 𝟙[L < L_threshold] Wei et al. [4] documented, which Schaeffer et al. [5] then argued are largely an artifact of how the underlying metric is discretized.

Figure 1 of Caballero et al. [3]. The BNSL form is a piecewise power law with explicit breaks; a single power law averages across these regimes and misses inflections the data actually shows.

Figure 2 of Caballero et al. [3]. Two real-task examples: ImageNet error versus dataset size (top) and TriviaQA accuracy versus parameter count (bottom). The BNSL form tracks the data through visible breaks where a single power law would not.

Gwern's Scaling Hypotheses essay and the Revisited follow-up are the strongest non-specialist treatments of the underlying premise that capabilities come out of scale. Jacob Steinhardt's Bounded Regret is the blog I send people to when they want a careful read of what scaling laws actually let you predict. Beyond Chinchilla-Optimal is the most direct argument that the 20-tokens-per-parameter ratio is not a universal constant.

Kaplan, Chinchilla, and Caballero all show that with a reasonably behaved architecture, a reasonable data mixture, and enough compute to get past the early warmup phase of pre-training, you can extrapolate loss from small runs to larger ones with usable accuracy. Error bars widen as the extrapolation gets more aggressive but not catastrophically. That predictive ability is what labs use to decide whether an expensive training run is worth doing.

None of the papers here support any particular exponent or ratio as universal. Exponents depend on architecture, data mixture, and optimizer. The 20-tokens-per-parameter Chinchilla figure is one point estimate for one such combination, not a law of physics.

Almost the entire scaling-law literature addresses test loss on the training distribution, and nothing else. It provides zero guidance on what data mixture to choose, what architecture will surpass dense transformers f_θ, what capabilities will appear at what scale, whether the resulting model will be safe, or how optimization will interact with the learning-rate schedule η(t). All of those are properties of individual training runs and live outside the fitted relationship. Treating scaling laws as though they answered them is a common failure mode, and much of the broken-laws literature is about that failure. Scaling laws tell you how much loss you will incur. Almost everything interesting about a model is invariant to that single number.

References

[1] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arxiv 2001.08361, 2020.
[2] J. Hoffmann et al. Training compute-optimal large language models. arxiv 2203.15556, 2022.
[3] E. Caballero, K. Gupta, I. Rish, and D. Krueger. Broken neural scaling laws. arxiv 2210.14891, 2022.
[4] J. Wei et al. Emergent abilities of large language models. arxiv 2206.07682, 2022.
[5] R. Schaeffer, B. Miranda, and S. Koyejo. Are emergent abilities of large language models a mirage? arxiv 2304.15004, 2023.

Lottery ticket hypothesis

Mouhssine Rifaki — Sat, 16 Nov 2024 12:00:00 +0000

The lottery-ticket hypothesis of Frankle and Carbin [1] proposes that a randomly initialized dense network already contains a much sparser subnetwork (the "winning ticket") which, trained in isolation from its original initialization, matches the dense network's accuracy. If true, this is a strong claim about how deep networks represent functions: optimization would be selecting structure that was already present at initialization, not creating new structure. The pruning literature had circled this idea before; Frankle and Carbin's contribution was an actionable procedure for finding these subnetworks.

In practice, what Frankle and Carbin actually did

Frankle and Carbin defined a simple process to find these "winning" tickets. They called it iterative magnitude pruning. First train the full model until it converges. Remove the lowest-magnitude weights and freeze them. Take the remaining weights and reset them to their original initialization values. Train again. Continue this process several times, removing a portion of weights each time, until you cannot remove any more without losing performance relative to the full model. The resulting subnetwork is considered to be the "winning" ticket for that particular initial condition and dataset. The "rewinding" variant, resetting to weights from a few steps after initialization rather than to initialization itself, came later in Frankle's follow-up [8].

On MNIST and small CIFAR architectures, the results are clean. Sparse subnetworks at sparsities of a few percent match the dense network's accuracy. The winning tickets are also tied to a specific initialization: a ticket found from one random init does not transfer to a different one, which is why the procedure is read as discovering structure already present at initialization rather than creating it during training.

Figure 3 of Frankle and Carbin [1]. Lottery-ticket subnetworks recover the dense baseline down to a few percent of the original weights; the same masks with random reinitialization do not.

Liu [2]'s rebuttal

Liu et al. ran the same procedure at larger scale and found that the gap between a Frankle winning ticket and a fresh random initialization of the same architecture closes once the network is big enough. At ImageNet scale, the winning-ticket effect basically disappears. Read narrowly this refutes Frankle and Carbin's strongest claim, but read alongside the original paper it is more usefully a scaling result: the winning-ticket structure is real on small networks and dissolves as scale grows.

Figure 2 of Liu et al. [2]. The two pruning regimes the paper distinguishes: predefined per-layer ratios versus automatically-discovered per-layer ratios. The "rethinking" results separate which regime the lottery-ticket conclusion survives in.

Rewinding fixes

Frankle's response to Liu was a follow-up paper [8] introducing a small change that brought the result back at scale: instead of resetting the surviving weights to their original initialization, reset them to the values they had a short distance into training (w_t+1=w_t-η∇ L, η: step-size, t: current iteration). The rewind distance is a few hundred iterations on small models and a few epochs on ImageNet (around epoch 4 of 90 for ResNet-50, about 20k iterations). With rewinding, IMP works at ImageNet scale; rewind past that point and the matching subnetwork stops appearing. Renda et al. [3] tightened the recipe by comparing rewinding to plain fine-tuning across architectures and showing that rewinding only the learning-rate schedule matches or beats fine-tuning at fixed sparsity.

Softening the claims

The revised hypothesis is weaker than the original. The earlier claim was that the winning ticket exists at random initialization. The follow-up softens this: a short window of training (a few hundred iterations on small models, a few epochs on ImageNet) is enough to locate one. By the end of that window, something about the loss landscape L(θ) has been fixed that determines the rest of training, and from that point on a sparse subnetwork pulled out of the surviving weights matches the dense network.

Why rewinding works

Frankle's companion paper offers an explanation. Early in training, two runs forked from the same initial weights end up in different basins ℬ, so linear interpolation between two such checkpoints crosses a high-loss barrier. After a small fraction of training (about 1000-2000 iterations on CIFAR-scale networks, roughly the first 3% of the schedule), two runs forked from the same starting weights end up in the same basin instead, and linear interpolation between them stays at low loss throughout. The crossover point is where rewinding starts to work.

The simplest framing of the lottery-ticket findings, given what is now understood about optimization and loss landscapes, is this: once a training run commits to a specific basin ℬ, there is a sparse sub-network within that basin that matches the dense network's performance. Frankle and Carbin's strongest claims fail at larger scales, but their weaker claims have so far held up under every replication that has tested them. This places lottery-ticket results in close alignment with the mode-connectivity literature [6], particularly its linear-mode-connectivity refinement [4] and the later permutation-based alignment work [7].

Figure 3 of Frankle et al. [4]. Pairs of runs forked from a shared pre-rewinding checkpoint stay linearly connected; pairs forked from initialization do not. Linear mode connectivity is the operational test for "same effective basin".

Davis Blalock's 2020 MLSys retrospective on pruning, together with the ShrinkBench benchmark he built, is the survey I keep returning to on what holds up after the Frankle-to-Liu exchange. Blalock separates the stronger and weaker forms of the hypothesis and argues that "checkpoint pruning" is a better name than "lottery ticket pruning" for the late-rewinding procedures that actually work at scale. Google's State of Sparsity and Rigging the Lottery are the other two retrospectives I keep going back to.

The piece I keep coming back to is that mainstream theory has not picked up the rewinding point as an object in its own right. If someone could pin down precisely when basin-membership becomes determined, that would identify a structural feature of the loss landscape current frameworks do not explain. Lewkowycz's catapult-phase work [5] and the edge-of-stability literature look like they are circling the same phenomenon from different directions, and the lottery-ticket case is the most direct entry point for tying those threads together.

How do the lottery ticket hypothesis and the loss landscape relate? Winning lottery tickets always find the same, linearly-connected optimum. Check out our (@KDziugaite, @roydanroy, @mcarbin) poster at the SEDL workshop (West 121) and our new paper https://t.co/V9yKTSrNnh pic.twitter.com/uPwQKifo1W
— Jonathan Frankle (@jefrankle) December 14, 2019

Jonathan Frankle in 2019 pointing at the bridge between the LTH and mode connectivity. The mode-connectivity reading is the softer form of the hypothesis that holds at scale.

References

[1] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arxiv 1803.03635, 2018.
[2] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. arxiv 1810.05270, 2018.
[3] A. Renda, J. Frankle, and M. Carbin. Comparing rewinding and fine-tuning in neural network pruning. arxiv 2003.02389, 2020.
[4] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. Linear mode connectivity and the lottery ticket hypothesis. arxiv 1912.05671, 2019.
[5] A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: The catapult mechanism. arxiv 2003.02218, 2020.
[6] T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, and A. G. Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. arxiv 1802.10026, 2018.
[7] S. K. Ainsworth, J. Hayase, and S. Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arxiv 2209.04836, 2022.
[8] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. Stabilizing the lottery ticket hypothesis. arxiv 1903.01611, 2019.

Double descent

Mouhssine Rifaki — Sun, 25 Aug 2024 12:00:00 +0000

The classical U becomes a W with a second descent in the overparameterized (p > n) regime and that second descent often goes below the first minimum.

Figure 1 of Belkin et al. [1]. Classical bias-variance to the left of the interpolation threshold; a second descent in the overparameterized regime on the right.

Nakkiran et al.

Nakkiran et al. [2] made the picture concrete by showing that the W-shape appears in three different axes: model size p, training time, and dataset size n. Model-wise double descent varies the width k of a ResNet f_θ; epoch-wise double descent varies the number of training steps; sample-wise double descent varies dataset size with everything else held fixed. The shape is the same each time: a test-error peak near the interpolation threshold, then a descent once you push past it.

Epoch-wise is the most surprising of the three. Within one run, test error gets worse before it gets better. The worst test error sits roughly at the iteration where training loss first hits zero; train past it and the test error drops again.

Label noise

The sharpest versions of the double descent peak in these papers come with label noise. Nakkiran's headline plots use ten to twenty percent corrupted labels. Without label noise, the peak is much weaker and sometimes absent. Label noise inflates the variance contribution of the model at the interpolation threshold because the model is being asked to memorize random labels at exactly the capacity where memorization is possible but not easy.

Past the threshold, extra capacity absorbs the noise into higher-frequency components without disturbing the underlying signal. This is the hinge that connects the toy phenomenon to actual deep learning. Belkin's linear-regression result holds at all noise levels but the gap is small without noise; Nakkiran's dramatic curves require label noise to be visible. At modern language-model scale, with clean labels and large models, test loss is close to monotone in parameter count and scaling-law papers fit clean L ∝ C^-α decay with no visible second peak. The effect is real and the classical bias-variance picture is wrong in the overparameterized regime, but the large peak that gives double descent its name is specific to the label-noise case.

Boaz Barak's Windows on Theory post makes a version of this argument: the interesting part of double descent is to the right of the peak, not the peak itself. OpenAI's Deep Double Descent post took Nakkiran to a much wider audience and posed the sharper question: given this effect, what kind of complexity control (if any) actually predicts generalization? Google Research's "A new lens on understanding generalization in deep learning" recast double descent in terms of an effective-capacity measure that tracks the empirical curves better than parameter count does. Misha Belkin's Simons Institute talks are the video account I keep sending people who want to see how the picture has changed since 2018.

So, "double descent" is happening b/c DF isn't really the right quantity for the the x-axis: like, the fact that we are choosing the minimum norm least squares fit actually means that the spline with 36 DF is **less** flexible than the spline with 20 DF.

Crazy, huh?

19/
— Daniela Witten (@daniela_witten) August 9, 2020

Figure 1 of Nakkiran et al. [2]. Model-wise double descent grows visibly with label-noise level: a peak at the interpolation threshold, then a second descent in the overparameterized regime.

I'm closer to Barak's reading than to what filtered down to practitioner intros.

After the label-noise caveat, two results from this line of work hold up. Classical capacity measures like VC dimension do not extend cleanly to the overparameterized regime and cannot be expected to predict generalization there. And overparameterized networks with astronomically large VC dimensions can sit well below smaller networks in test error on the same task — which implies the loss landscape is doing the selection: from a vast pool of interpolating solutions, the optimizer is picking a small subset that generalizes.

Figure 4 of Nakkiran et al. [2]. Pulling apart the canonical double-descent shape: the peak sits at the interpolation threshold, and the epoch-coloured family on the right makes epoch-wise double descent visible alongside model-wise.

What practitioners did with this was simpler than what the theory suggested: once you are in the overparameterized regime and you have compute to spend, bigger is usually better. The second descent has no obvious endpoint, which is why Kaplan-style scaling laws can fit clean power-law decay in compute — they are sitting entirely on the right-hand, log-log-linear side of the W-curve. The dramatic 2019 reading of double descent was that the bias-variance tradeoff is fiction and overfitting no longer exists. The second half of that is trivially untrue (overfitting is easy to produce in any small-data regime). The first half is more delicate: above the interpolation threshold, with implicit min-norm regularization (θ^⋆ = argmin_{f_θ(X)=y} ‖θ‖), larger models tend to generalize better rather than worse. That is a statement about a regime, not a law.

Expected test risk of the min-norm ridgeless interpolant with n samples and p features under isotropic covariates, from Hastie et al. [3]. The first term diverges at p=n and is the interpolation peak. The second shrinks as p → ∞ and is the second descent. Double descent is not a deep-learning phenomenon in any strict sense since it falls out of the min-norm solution to an overparameterized least squares problem and holds only because the optimizer is selecting a specific well-behaved interpolant out of the many that fit the data.

Another claim that does not survive the empirical record is that the peak is always exactly at the interpolation threshold. In practice, the exact location of the peak in Nakkiran's ResNet experiments depends on the effective number of parameters under whatever implicit regularization is in use, not on the total parameter count. The peak does not occur precisely at the width at which training error (L_train) first reaches zero; it sits slightly past that point, where the network can memorize noisy labels without disrupting the underlying signal.

References

[1] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. arxiv 1812.11118, 2018.
[2] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever. Deep double descent: Where bigger models and more data hurt. arxiv 1912.02292, 2019.
[3] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arxiv 1903.08560, 2019.

Reading Tishby's information bottleneck

Mouhssine Rifaki — Tue, 09 Jul 2024 12:00:00 +0000

Tishby and Zaslavsky's 2015 paper was, until fairly recently, one of the most-cited papers in deep-learning theory. They described training as two distinct phases. In the first, the "fitting" phase, the mutual information between a hidden representation T = f_θ(X) and the input X rises. In the second, the "compression" phase, the network discards parts of that input information that are not useful for the prediction.

The two strongest objections to that picture come from Saxe et al. [2], who argue the compression phase is an artifact of the activation function and the mutual-information estimator, and from Goldfeld et al. [3], who formalize the estimator critique and re-run the analysis with noisy networks where mutual information is well-defined. The point of this post is to reread Tishby with both of those objections in hand and ask what remains.

Both objections are worth reading in full; the summaries below are my best attempt to render them faithfully.

Figure 2 of Tishby and Zaslavsky [1]. The IB rate-distortion bound with deep-network layers placed on it: each layer trades compression of X against retention of information about Y.

The actual claim of the original paper

The claim, in plain language: early in training, a deep network builds up features that are useful for predicting the target Y, which shows up as I(T; Y) increasing. During the same period I(T; X) also increases, because the hidden layer is just retaining more about the input. Then a second phase kicks in: the network drops the parts of the input information that do not help with the prediction. That second phase is the "compression phase."

The empirical support was a small tanh network whose information-plane plot showed a clean fitting-then-compression trajectory.

Saxe et al. [2] examine the type of activation function used

Saxe and coauthors re-ran the experiments with ReLU instead of tanh. The compression phase did not appear: Î(X;T) stayed roughly flat across training. Their explanation: tanh saturation pushes each unit's activations into a small number of values, the binning estimator is sensitive to that quantization, and the combination produces a curve that looks like compression but is really an estimator artifact. In a network without saturating units (or with a less binning-sensitive estimator), the curve is gone.

That is a serious problem for the original story. The theoretical pull of Tishby's framing was that compression looked universal, a property of deep learning itself. If it only shows up for one activation function with one estimator, the universality claim is much weaker.

Figure 1 of Saxe et al., ICLR 2018. Swapping tanh for ReLU removes the compression phase (A vs B), and re-running with a KDE estimator at MNIST scale (C, D) also fails to reproduce it. The two-phase information-plane story is contingent on the nonlinearity and the estimator, not a property of training.

Goldfeld et al. formally quantify issues with estimator selection

Goldfeld and coauthors formalized what Saxe had observed. For continuous random variables and deterministic maps T = f(X) from inputs to representations, mutual information I(T;X) = H(T) - H(T|X) is not even well-defined: H(T|X) collapses, I(T;X) blows up, and the finite numbers showing up in published plots are entirely coming from the noise injected by the estimator (binning, added Gaussian noise of variance σ², or KDE). Different choices give different numbers, so the published information-plane trajectories were tracking properties of the estimator at least as much as properties of the network.

One reason Tishby's paper still has value despite the empirical claims being discredited is that it offered a third lens on generalization at a time when the dominant lenses were capacity-based (how restrictive or broad the hypothesis class is) and geometry-based (how smooth or rough the loss landscape is around a minimum). Tishby's lens was sufficient-statistics: representations should retain only the information that matters for the prediction task. The terminology stuck even though the original empirical observation did not, and modern self-supervised methods like infoNCE are essentially information-bottleneck objectives in everything but name.

The paper got most of its public reach through Natalie Wolchover's 2017 Quanta piece, "New Theory Cracks Open the Black Box of Deep Learning," which presented Tishby's claims in their strongest form. Reading that piece today is mostly useful as a reminder of how far ahead of the evidence the rhetoric got.

Figure 1 of Goldfeld et al. [3]. The same training run produces qualitatively different "information-plane trajectories" depending on the bin size used to estimate mutual information - the compression phase is partly an estimator artefact.

For readers who want non-paper summaries of this debate, Adrian Colyer's three-part Morning Paper series on Tishby's IB theory and Saxe's reply is the most accessible walkthrough.

A weaker version of the original claim does survive: networks trained with SGD often end up with representations that are sufficient for the labels and roughly insensitive to label-irrelevant input variation. "Information bottleneck" is a fine descriptive label for that. The strong version, in which training proceeds through two cleanly separated phases divided by a phase transition in I(T;X), has no empirical support, and the further claim that SGD is implicitly minimizing an information-bottleneck objective remains unproven.

Some incorrect papers end up more useful to a field than correct ones, because they hand it vocabulary it did not have.

Figure 2 of Goldfeld et al. [3]. The noisy-network construction: adding Gaussian noise after each layer makes mutual information well-defined and lets the analysis distinguish genuine compression dynamics from estimator artefacts.

Saxe's argument alone is not fatal: a noisy version of the network has well-defined mutual information and can be analyzed directly, which gets you out of the estimator trap. Goldfeld et al. did exactly that and found the two-phase trajectory does not hold up across estimator choices once the quantities being plotted are well-defined. After that, the empirical case for Tishby's strong claims is essentially gone.

References

[1] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. arxiv 1503.02406, 2015.
[2] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox. On the information bottleneck theory of deep learning. ICLR, 2018.
[3] Z. Goldfeld, E. van den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy. Estimating information flow in deep neural networks. arxiv 1810.05728, 2019.

On flat minima

Mouhssine Rifaki — Sun, 07 Apr 2024 12:00:00 +0000

Whether flat minima generalize better than sharp ones has been an open question for about seven years. The debate seems to close every year and reopen a year later. Most readers entering the field encounter it as a settled topic in some textbook chapter, in one direction or the other, when in fact it isn't. This is where I think it actually stands.

Hochreiter, Schmidhuber, and the original intuition

Hochreiter and Schmidhuber introduced the idea in 1997: a minimum that sits in a broad, low-curvature valley should generalize better than one in a sharp valley, on roughly an MDL ground that the broad solution requires fewer bits to specify and is correspondingly less tied to the noise in any one training set. The intuition lay mostly dormant for the next two decades.

Keskar et al. [1] reignited the topic by reporting that large-batch SGD with θ_t+1 = θ_t - η g_t converges to sharper minima than small-batch SGD on a range of standard benchmarks, with a corresponding gap in test accuracy. The 1D schematic that came with that paper has done a lot of work since: it is the picture more or less every later flat-minima discussion is implicitly arguing about.

Figure 1 of Keskar et al. [1]. The picture that started the modern argument.

Dinh et al.'s objection, which should have ended the debate

Dinh et al. argued that this should have ended the debate. Sharpness, measured as λ_max(H) with H = ∇² L, is a property of the parameterization, not of the function the network represents. They show explicitly that for any minimum one can find a reparameterization θ → ψ(θ) that scales the Hessian eigenvalues λ_i(H) to arbitrary values without changing the input-output map. The clean response to this would have been to drop the flat-minima paradigm. The community instead salvaged it by looking for sharpness measures that are invariant under reparameterization. The simplest of these comes from Dziugaite and Roy, who use a PAC-Bayes lens: define sharpness as the largest weight perturbation ξ ∼ 𝒩(0, I) a minimum can absorb while keeping training loss L(θ) small. That measure is reparameterization-invariant by construction and correlates with generalization in their experiments.

Figure 1 of Dinh et al. [2]. The width of the shaded ε-flat region is the geometric quantity flatness intuitions are pointing at - but its size depends on the parameterization, which is the heart of the reparameterization argument.

SAM and where flatness wins

Building on Dziugaite and Roy's framing, Foret et al. [3] turned a reparameterization-aware sharpness measure into a training objective: minimize the worst-case loss in an ℓ₂ ball around the current weights. They called the procedure SAM, and it does deliver consistent test-accuracy gains, particularly on architectures without strong built-in inductive biases (vanilla MLPs, plain ViTs without strong augmentation). Behnam Neyshabur, a co-author, has remained one of the more consistent public advocates for SAM as a generalization tool.

On the side that flatness improves generalization at scale, the main public voices are Behnam Neyshabur and collaborators across several papers and talks, and Boaz Barak's posts at Windows on Theory. Ferenc Huszár's inFERENCe is the other blog I keep coming back to on this; he writes carefully about flatness, generalization, and the Bayesian readings sitting under both. The Off-convex blog has good coverage of mode connectivity that puts flatness inside a larger geometric story, which I find more useful than treating it as an independent explanation. From the continuous-time view, the stochastic-diffusion picture of SGD is still the most direct way to see why noisy iterates concentrate near flatter minima. Huszár's adjacent essay "Everything that Works Works Because It Is Bayesian" is the prior-based reading I find most useful.

Stopping the timeline here, the flat-minima view would look basically validated: the naive Hessian definition was broken, but a reparameterization-invariant version of the phenomenon is real and SAM is a way to act on it. Kaddour et al. [4] complicate that picture. They sweep SAM and SWA against vanilla Adam across architectures and dataset scales, and report that the SAM gain shrinks as either model size or dataset size grows. In the regimes where generalization is most useful to improve, the advantage over a tuned Adam baseline narrows substantially.

Figure 1 of Foret et al. [3]. The empirical headline (error reduction across tasks) plus the loss-landscape contrast that motivates the worst-case-in-a-ball SAM objective.

I am left with an uneven picture. The naive Hessian definition of sharpness has no causal link to generalization (by Dinh). A reparameterization-invariant version does correlate with generalization at small-to-medium scale, and SAM produces real test-accuracy gains on architectures with weak inductive biases. At very large scale the correlation weakens and the SAM gain fades. I do not have a clean account of why. My guess is that with rich enough data and architectures, the optimizer's trajectory and the data distribution dominate whatever local geometry the final minimum has, and the landscape framing stops being the right description. That is speculation, and I would not put weight on it beyond that.

References

[1] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arxiv 1609.04836, 2016.
[2] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. arxiv 1703.04933, 2017.
[3] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arxiv 2010.01412, 2020.
[4] J. Kaddour, L. Liu, R. Silva, and M. J. Kusner. When do flat minima optimizers work? arxiv 2202.00661, 2022.

Four explanations for Grokking

Mouhssine Rifaki — Sat, 24 Feb 2024 12:00:00 +0000

The network has generalized but long after it has already fit the data. The paper is Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets by Power et al. [1]. I came across it maybe a week after it was posted and didn't really know what to do with it for a while.

Why this is a puzzle

Standard stories about generalization do not predict a long delay. The VC and Rademacher story says generalization either happens or doesn't as a function of how well the hypothesis class matches the data distribution. The implicit-bias stories say SGD hits a minimum that generalizes but that minimum should be found at roughly the same time as convergence of training loss, not thousands of steps later.

Grokking says the loss landscape has a second stage not driven by training loss. Some other variable is moving the weights (θ) during the stretch where the training loss is already near zero.

Four explanations

The first explanation - and the one that the original paper kind of points toward - is that weight decay (λ ‖θ‖₂²) is a slow regularizer. The paper notes that grokking only happens with weight decay turned on. Once training loss is zero, the weight-decay term keeps pulling the norm of the weights down even if that loss gradient is tiny. This is slow drift toward a lower-norm solution which may be the one that generalizes. The second explanation comes via mechanistic interpretability.

Figure 1 of Power et al. [1]. Training accuracy saturates early; validation accuracy stays flat for orders of magnitude of additional steps and then jumps.

Neel Nanda and collaborators identified specific circuits inside the small grokking networks that implement modular arithmetic via a Fourier (f̂(ω)) decomposition. The grokking transition is when those circuits finish being assembled. Before the transition the network has to memorize by brute force. After the transition it can actually compute. The circuits-level framing this work builds on is laid out in Elhage et al., which is where I'd send anyone to understand what it could mean to talk about a 'circuit' inside a transformer.

The third explanation, due to Liu, Michaud, and Tegmark in Omnigrok, is geometric: the generalizing solution lies in a narrow "Goldilocks zone" of weight norms, and grokking is what you see when the optimizer has been started outside that zone and is slowly being walked into it. Weight decay is what does the walking, which is consistent with explanation one. The fourth angle is to step back and read all of this as a single process viewed at different mesh scales: grokking = double descent but resolved over training time rather than over model or dataset size.

Figure 2 of Nanda et al. [2]. The grokked network's neurons are well-explained by degree-2 polynomials (left), and individual neurons read off specific Fourier-basis pairs from the embedding (right) - the Fourier circuit is mechanistically visible.

In this reading, grokking is double descent unfolding in time. The most lucid public articulations of this "double descent over time" reading come from Preetum Nakkiran's writing, and OpenAI's Deep Double Descent writeup paints the picture visually. On the mechanistic side, Neel Nanda wrote an intuitive walkthrough of the Fourier circuit story and maintains a corresponding paper page. Google PAIR's Do Machine Learning Models Memorize or Generalize? poses the same question in nearby visual vocabulary.

Figure 5 of Nanda et al. [2]. The DFT of the network's learned embeddings concentrates in a small number of frequencies after the transition which is the Fourier-based modular arithmetic circuit made visible.

These four views are looking at the same puzzle from different angles, and together they read as one story at different levels of abstraction. Weight decay is the optimization pressure: it selects a minimum-norm interpolant, and in modular arithmetic that interpolant admits the Fourier circuit because Fourier captures the low-rank (rank(W) ≪ d) structure of the task. The broader pattern of fast memorization followed by slow compression is the shape double descent takes when you resolve it over time instead of over model size.

Figure 3 of Nanda et al. [2]. The grokking pattern made averaged: training accuracy saturates fast, test accuracy lags by orders of magnitude before its sudden rise.

The one thing none of these explanations cleanly accounts for is the abruptness of the transition. Smoothly shrinking the norm and smoothly assembling circuits should give smoothly rising validation accuracy, not a near-vertical jump. My read is that the sharpness is largely a measurement artifact: softmax classifiers, σ(z)_j = e^z_j/∑_k e^z_k, route through a top-1 argmax, so logits that are evolving continuously map onto a piecewise-constant accuracy curve that flips once the right logit crosses its competitor.

So what's behind grokking?
Three phases of training:
1 Memorization
2 Circuit formation: It smoothly TRANSITIONS from memorising to generalising
3 Cleanup: Removing the memorised solution

Test performance needs a general circuit AND no memorisation so Grokking occurs at cleanup! pic.twitter.com/zLnP92RXKV
— Neel Nanda (@NeelNanda5) January 21, 2023

Neel Nanda summarizing the three-phase mechanistic account of grokking from [2]. The transition to generalization happens during cleanup, not during circuit formation which is why it looks sudden.

References

[1] A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: generalization beyond overfitting on small algorithmic datasets. arxiv 2201.02177, 2022.
[2] N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. arxiv 2301.05217, 2023.

Notes by Mouhssine Rifaki

Score matching and diffusion

Score matching

SDE view

Karras et al.

Further reading

References

Adversarial examples

The accuracy trade-off

Further reading

References

Mode connectivity

Curves before lines

The permutation turn

Empirical evidence ties together alignment and merging

Practical applications

Further reading

References

The neural tangent kernel

NTK as baseline

The lazy-training caveat

Feature learning is the missing term

Further reading

References

The implicit-bias program

The linear theorem

The geometry enters

Margin in homogeneous networks

Further reading

References

The edge of stability

Progressive sharpening

Above the threshold

And flat minima

The generalization gap is still open

Further reading

References

Kaplan, Chinchilla, and broken laws

Chinchilla

The effects of correcting Kaplan

Broken laws

Further reading

References

Lottery ticket hypothesis

In practice, what Frankle and Carbin actually did

Liu [2]'s rebuttal

Rewinding fixes

Softening the claims

Why rewinding works

Further reading

References

Double descent

Nakkiran et al.

Label noise

Further reading

References

Reading Tishby's information bottleneck

The actual claim of the original paper

Saxe et al. [2] examine the type of activation function used

Goldfeld et al. formally quantify issues with estimator selection

Further reading

References

On flat minima

Hochreiter, Schmidhuber, and the original intuition

Dinh et al.'s objection, which should have ended the debate

SAM and where flatness wins

Further reading

References

Four explanations for Grokking

Why this is a puzzle

Four explanations

Further reading

References