<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title>Notes by Mouhssine Rifaki</title>
    <link>https://rifaki.me/notes/</link>
    <atom:link href="https://rifaki.me/feed.xml" rel="self" type="application/rss+xml"/>
    <description>Essays on deep learning theory, reinforcement learning, and mathematical statistics.</description>
    <language>en-us</language>
    <copyright>Copyright 2024-2026 Mouhssine Rifaki</copyright>
    <managingEditor>mouhssine@rifaki.me (Mouhssine Rifaki)</managingEditor>
    <webMaster>mouhssine@rifaki.me (Mouhssine Rifaki)</webMaster>
    <lastBuildDate>Wed, 29 Apr 2026 07:22:23 +0000</lastBuildDate>
    <pubDate>Wed, 22 Apr 2026 12:00:00 +0000</pubDate>
    <ttl>1440</ttl>
    <image>
      <url>https://rifaki.me/portrait.jpg</url>
      <title>Notes by Mouhssine Rifaki</title>
      <link>https://rifaki.me/notes/</link>
    </image>

    <item>
      <title>Score matching and diffusion</title>
      <link>https://rifaki.me/notes/diffusion-models/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/diffusion-models/</guid>
      <pubDate>Wed, 22 Apr 2026 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Denoising score matching, the SDE view, Karras&#x27;s design-space disentanglement, and where diffusion sits relative to flow matching and consistency models.</p><p><img src="https://rifaki.me/notes/img/diffusion-forward-reverse.png" alt="Score matching and diffusion"/></p>]]></description>
      <content:encoded><![CDATA[<p>The setup goes back to Sohl-Dickstein et al. [<a href="https://arxiv.org/abs/1503.03585">1</a>]. The forward chain takes data x<sub>0</sub> and adds Gaussian noise according to a fixed schedule ᾱ<sub>t</sub>, producing intermediate samples x<sub>1</sub>, …, x<sub>T</sub> via <img src="https://rifaki.me/notes/img/math/fcdfd763851db7dc.svg" alt="$$x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon$$" class="math-display" width="223" height="26"/> with ε ∼ 𝒩(0, I). The schedule is chosen so that x<sub>T</sub> is essentially a standard Gaussian. The reverse chain undoes the corruption: sampling from the data distribution reduces to learning the conditional p(x<sub>t-1</sub> | x<sub>t</sub>) at each step.</p>

        <h2>Score matching</h2>

        <p>The DDPM forward chain has a clean dual under score matching, and once the two are placed side by side they are not separate ideas. The score function s(x,t) = ∇<sub>x</sub> log p<sub>t</sub>(x) is the gradient of the log-density of the noisy distribution at noise level t evaluated at x; it points in the direction along which the log-density rises most steeply. Hyvärinen's original score-matching objective is the square norm of the score plus the trace of the Hessian of the log-density. The Hessian-trace term is expensive in high dimensions. Vincent's denoising score matching gets around this: with Gaussian noise of variance σ<sup>2</sup>, the optimal MMSE denoiser D<sup>*</sup>(x̃) = 𝔼[x | x̃] satisfies <img src="https://rifaki.me/notes/img/math/925a6ce0278dacdd.svg" alt="$$\nabla_{\widetilde{x}} \log p_\sigma(\widetilde{x}) = \frac{D^*(\widetilde{x}) - \widetilde{x}}{\sigma^2},$$" class="math-display" width="239" height="45"/> so predicting the noise (equivalently, predicting the clean image) is the same problem as estimating the score of the noisy distribution, with mean-squared error as the loss.</p>

        <p>Song and Ermon [<a href="https://arxiv.org/abs/1907.05600">2</a>] estimated these scores at a sweep of noise levels and used annealed Langevin dynamics to draw samples from the estimates. Ho, Jain, and Abbeel showed that DDPMs trained with a weighted variational bound reduce, in a particular weighting limit, to denoising score matching at multiple noise levels. The point shared by both: learning to predict the noise that was added at a given level is enough to recover the score of the noisy distribution at that level, and sampling is then iterated denoising along the noise schedule.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/diffusion-score-field.png" alt="Sliced score matching loss trajectories on a toy problem: the unconstrained estimator diverges to large negative values while the noise-conditioned variant stays bounded over training iterations.">
          <figcaption>Training-stability comparison from the score-matching literature (Song et al. [<a href="https://arxiv.org/abs/1907.05600">2</a>] and surrounding work). The score-matching loss is unbounded below without noise conditioning; injecting noise restores stable training.</figcaption>
        </figure>

        <h2>SDE view</h2>

        <p>Song et al. unified the discrete and score-based views inside stochastic differential equations. The forward process is an Itô SDE <img src="https://rifaki.me/notes/img/math/52821e31dbf26a0e.svg" alt="$$dx_t=f\left( x_t,t\right)\, dt+g\left( t\right)\, dw_t$$" class="math-display" width="240" height="20"/> that continuously corrupts data into noise. Anderson's reverse-time formula gives a backward SDE driven by the score <img src="https://rifaki.me/notes/img/math/fd3608f73d73a0b4.svg" alt="$$dx_t=\left[ f\left( x_t,t\right)-g^{2}\left( t\right)\nabla _{x}\log p_{t}\left( x_{t}\right) \right] dt+g\left( t\right)\, d\bar {w}_{t},$$" class="math-display" width="444" height="25"/> and a deterministic probability-flow ODE with the same marginals <img src="https://rifaki.me/notes/img/math/fd23102ac17f70a0.svg" alt="$$dx_{t}=\left[f\left( x_{t},t\right)-\frac{1}{2}g^{2}\left( t\right)\nabla _{x}\log p_{t}\left( x_{t}\right)\right] dt.$$" class="math-display" width="371" height="49"/> The ODE is useful for likelihood evaluation and fast sampling; the SDE is what most early samplers used. DDPMs, score-based models, and probability-flow ODE samplers are different discretizations of the same underlying dynamics.</p>

        <p>The SDE view also separates modeling from numerics. The noise schedule, solver, parameterization, and preconditioner can all be changed without touching the underlying problem of learning the score.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/diffusion-sde-ode-map.png" alt="Figure 1 of Song et al. 2020. Forward SDE turning data into noise (top) and the score-driven reverse SDE turning noise back into data (bottom), with intermediate sample crops at increasing noise levels.">
          <figcaption>Figure 1 of Song et al. [<a href="https://arxiv.org/abs/2011.13456">4</a>]. The learned score is the object shared by the stochastic reverse process and (in the same paper) the deterministic probability-flow ODE.</figcaption>
        </figure>

        <figure class="tweet-embed">
          <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">There is a lot of great writing on flow matching out there all of a sudden! This post clarifies the connection with diffusion models -- they are essentially two different ways to describe the same class of models. <a href="https://t.co/lLokMmxxdz">https://t.co/lLokMmxxdz</a></p>&mdash; Sander Dieleman (@sedielem) <a href="https://twitter.com/sedielem/status/1863661809355362538?ref_src=twsrc%5Etfw">December 2, 2024</a></blockquote>
        </figure>

        <figure>
          <img src="https://rifaki.me/notes/img/diffusion-forward-reverse.png" alt="Figure 2 of Ho, Jain, Abbeel 2020 (DDPM). Directed graphical model of the forward and reverse chains between x_T and x_0.">
          <figcaption>Figure 2 of Ho et al. [<a href="https://arxiv.org/abs/2006.11239">3</a>]. The forward chain corrupts data into noise; the learned reverse chain inverts it step by step.</figcaption>
        </figure>

        <p>Lu et al. [<a href="https://arxiv.org/abs/2206.00927">6</a>] built DPM-Solver out of the semi-linear structure of the probability-flow ODE and cut sampling from thousands of steps to tens, with no retraining required.</p>

        <h2>Karras et al.</h2>

        <p>Karras et al. [<a href="https://arxiv.org/abs/2206.00364">5</a>] argued that diffusion practice had become unnecessarily entangled: sampling schedule, loss weighting, noise parameterization, network preconditioning, and solver choice were all bundled together. Pulling each design decision apart showed that most of the published performance gains came from untangling the design space, not from a new generative principle. Their EDM recipe (continuous noise levels indexed by σ, σ-conditioned network preconditioning, and a second-order Heun ODE solver) has since become a common baseline.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/diffusion-design-space.png" alt="Table 1 of Karras et al. 2022 (EDM). Explicit design-space tabulation of noise schedules, prediction targets, preconditioning, and samplers across DDPM, NCSN, EDM, and related variants.">
          <figcaption>Table 1 of Karras et al. [<a href="https://arxiv.org/abs/2206.00364">5</a>]. The score-estimation problem stays fixed while schedules, targets, preconditioning, and solvers move around it.</figcaption>
        </figure>

        <p>The implementation history runs through three papers. Dhariwal and Nichol showed that diffusion models beat GANs on ImageNet, by combining classifier guidance with architecture and training changes. Ho and Salimans then dropped the auxiliary classifier in favor of classifier-free guidance, training the conditional and unconditional scores jointly and combining them at sampling time. Rombach et al. moved the diffusion process inside the latent space of a pre-trained autoencoder, which is what made high-resolution diffusion feasible at academic compute and what Stable Diffusion is built on.</p>

        <p>Natural data sits near low-dimensional manifolds where direct density modeling is hard. Adding white noise thickens the manifold: at high noise levels the distribution is smooth, at low noise levels it is detailed but local. Diffusion replaces a single full-density estimation problem with a sequence of denoising problems indexed by noise level. Compare with autoregressive models, which generate sequentially conditioned on prior tokens, and with flows, which accept restrictions on the transforms they can express in exchange for tractable likelihoods. Diffusion's training objective is stable in a way GAN training is not, while paying for it with iterative sampling that DPM-Solver, consistency models, and distillation have largely clawed back.</p>

        <p>Semantic abstraction inside the model is still theoretically incomplete. The score tells you how to move from a noisy sample toward a clean one. It does not say why text prompts, latent-space guidance, or multimodal conditioning organize concepts the way they do. Those mechanisms are built on top of a fixed score-matching core, but the core does not force any of them.</p>

        <p>Whether diffusion is the right parameterization is also unclear to me. Flow matching frames generation as learning a vector field that transports a simple prior to the data distribution along possibly non-straight paths, with a regression objective that does not require an SDE. Rectified flow constrains the paths to be nearly straight, which keeps few-step sampling accurate. Consistency models compress the iterative denoising sampler into a single-step generator while preserving sample quality. None of these compete with score-based modeling; they are alternative parameterizations of the same vector-field-across-noise-levels problem.</p>
        <h2>Further reading</h2>
        <ul class="further">
          <li><a href="https://arxiv.org/abs/2105.05233">P. Dhariwal and A. Nichol. Diffusion models beat GANs on image synthesis. arxiv 2105.05233, 2021</a></li>
          <li><a href="https://arxiv.org/abs/2207.12598">J. Ho and T. Salimans. Classifier-free diffusion guidance. arxiv 2207.12598, 2022</a></li>
          <li><a href="https://arxiv.org/abs/2112.10752">R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. arxiv 2112.10752, 2021</a></li>
          <li><a href="https://arxiv.org/abs/2210.02747">Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arxiv 2210.02747, 2022</a></li>
          <li><a href="https://arxiv.org/abs/2209.03003">X. Liu, C. Gong, and Q. Liu. Flow straight and fast: learning to generate and transfer data with rectified flow. arxiv 2209.03003, 2022</a></li>
          <li><a href="https://arxiv.org/abs/2303.01469">Y. Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. arxiv 2303.01469, 2023</a></li>
          <li><a href="https://www.jmlr.org/papers/v6/hyvarinen05a.html">A. Hyv&auml;rinen. Estimation of non-normalized statistical models by score matching</a></li>
          <li><a href="https://www.iro.umontreal.ca/~vincentp/Publications/smdae_techreport.pdf">P. Vincent. A connection between score matching and denoising autoencoders</a></li>
        </ul>

        

<h2>References</h2>

        <ul class="refs">
          <li>[<a href="https://arxiv.org/abs/1503.03585">1</a>] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arxiv 1503.03585, 2015.</li>
          <li>[<a href="https://arxiv.org/abs/1907.05600">2</a>] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. arxiv 1907.05600, 2019.</li>
          <li>[<a href="https://arxiv.org/abs/2006.11239">3</a>] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. arxiv 2006.11239, 2020.</li>
          <li>[<a href="https://arxiv.org/abs/2011.13456">4</a>] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. arxiv 2011.13456, 2020.</li>
          <li>[<a href="https://arxiv.org/abs/2206.00364">5</a>] T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. arxiv 2206.00364, 2022.</li>
          <li>[<a href="https://arxiv.org/abs/2206.00927">6</a>] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver: a fast ODE solver for diffusion probabilistic model sampling in around 10 steps. arxiv 2206.00927, 2022.</li>
        </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/diffusion-forward-reverse.png"/>
    </item>
    <item>
      <title>Adversarial examples</title>
      <link>https://rifaki.me/notes/adversarial-examples/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/adversarial-examples/</guid>
      <pubDate>Sat, 07 Feb 2026 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Adversarial examples, FGSM, PGD, Madry&#x27;s saddle-point formulation, the robust-features view, and the accuracy trade-off.</p><p><img src="https://rifaki.me/notes/img/adversarial-fgsm.png" alt="Adversarial examples"/></p>]]></description>
      <content:encoded><![CDATA[<p>Adversarial examples initially seemed an oddity. Szegedy et al. [<a href="https://arxiv.org/abs/1312.6199">1</a>] demonstrated that a minuscule perturbation, meaningless to human eyes, could confidently flip a neural net's prediction. My first instinct on reading it was to blame weird nonlinearities or overfitting. It turned out to be neither. Goodfellow et al. [<a href="https://arxiv.org/abs/1412.6572">2</a>] gave a much simpler explanation. A linear model with weight w ∈ ℝ<sup>d</sup> will have logit shift w<sup>⊤</sup> δ under perturbation δ, and the worst-case ℓ<sub>∞</sub> behavior inside of ‖δ‖<sub>∞</sub> ≤ ε is ε ‖w‖<sub>1</sub>, growing like εd for dense weights, a tiny per-pixel perturbation accumulated across a high-dimensional input.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/adversarial-fgsm.png" alt="Figure 1 of Goodfellow, Shlens, Szegedy 2015. Panda image plus epsilon times sign of the gradient produces an imperceptible perturbation that flips the classifier's prediction to gibbon.">
          <figcaption>Figure 1 of Goodfellow et al. [<a href="https://arxiv.org/abs/1412.6572">2</a>]. The perturbation is small under the threat model, but it is aligned with the classifier's loss gradient.</figcaption>
        </figure>

        <p>In a thousand-dimensional input space, even an imperceptible δ can produce a logit shift that flips the prediction. Adversarial perturbations are then not a symptom of nonlinear extrema; they are a generic property of high-dimensional linear decision rules. FGSM is then a one-step linearized attack on the inner maximum max<sub>‖δ‖<sub>∞</sub> ≤ ε</sub> L(θ, x+δ, y). The fact that one step works at all is the diagnostic: the model is sensitive to a direction the data distribution does not mark as human-meaningful. Carlini and Wagner later showed that more carefully tuned attack objectives produce much smaller-norm perturbations than FGSM, breaking many defenses whose only evaluation had been against single-step attacks.</p>

        <p>Madry et al. wrote down the saddle-point formulation: define a defense by the worst-case inner-max loss it can resist within a fixed threat model, and treat any defense that fails a stronger attack inside that threat model as broken. The split separates two questions that were previously tangled together: the inner problem defines what the attacker can do, and the outer problem defines what the model has to optimize against. Projected gradient descent then becomes both the canonical attack and the canonical training procedure. It is not perfect, but it made robustness measurable enough that defenses could be compared honestly. Many proposed defenses then turned out not to be robust; they just break weak attacks. Athalye, Carlini, and Wagner cataloged the failure modes under one heading - obfuscated gradients. Stochastic preprocessing, non-differentiable transforms, exploding or vanishing gradients, and gradient shattering each produce attacks that fail without producing classifiers that survive a stronger attack.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/adversarial-minmax-loop.png" alt="Figure 1 of Madry et al. 2018. PGD attack-loss curves over inner-maximization iterations on standard- and adversarially-trained MNIST and CIFAR10 networks: the standard models reach high attack loss easily, the robust models cap the attainable loss.">
          <figcaption>Figure 1 of Madry et al. [<a href="https://arxiv.org/abs/1706.06083">3</a>]. PGD finds many high-loss perturbations on standard networks; on the adversarially-trained networks it caps out near a small bounded value.</figcaption>
        </figure>

        <p>They broke six of the nine ICLR-2018 defenses completely and a seventh partially, just by replacing the attack with a stronger one inside the same threat model. AutoAttack later turned that lesson into a parameter-free ensemble: a single attack you can run against a defense without per-defense tuning, which exposes inflated robustness numbers automatically. The other branch of progress is certified rather than empirical robustness. Cohen, Rosenfeld, and Kolter produce randomized-smoothing certificates: convolve the classifier with isotropic Gaussian noise of variance σ<sup>2</sup> and the smoothed classifier g(x) = argmax<sub>c</sub> ℙ<sub>η ∼ 𝒩(0,σ<sup>2</sup> I)</sub>[f(x+η) = c] is provably robust within an ℓ<sub>2</sub> ball of radius σΦ<sup>-1</sup>(p<sub>A</sub>), where p<sub>A</sub> is the lower confidence bound on the top-class probability.</p>

        <figure class="tweet-embed">
          <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">The definition of &quot;adversarial examples&quot; I prefer these days is &quot;Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake&quot; <a href="https://t.co/GiXiQBCp5L">https://t.co/GiXiQBCp5L</a></p>&mdash; Ian Goodfellow (@goodfellow_ian) <a href="https://twitter.com/goodfellow_ian/status/984518755546906624?ref_src=twsrc%5Etfw">April 12, 2018</a></blockquote>
        </figure>

        <p>The certificate is provable. The cost is that the radius is small in practice and only natural in ℓ<sub>2</sub>. For a less paper-indexed entry point, the Gradient Science adversarial robustness page is still one of the better maps of attacks, defenses, and the evaluation traps. RobustBench is the practical scoreboard, once the question becomes: does this defense survive standard attacks?</p>

        <h2>The accuracy trade-off</h2>

        <p>Tsipras et al. [<a href="https://arxiv.org/abs/1805.12152">4</a>] made the uncomfortable point that robustness can conflict with standard accuracy. In their construction, the standard classifier uses weak but highly predictive features that are not robust. The robust classifier has to ignore them and therefore loses ordinary accuracy. The empirical version of this trade-off is more complicated, but the conceptual point survived: robustness is not just accuracy with additional caution.</p>

        <p>A robust classifier may have to learn different features altogether. Ilyas et al. [<a href="https://arxiv.org/abs/1905.02175">5</a>] put it bluntly: adversarial examples are not bugs, they are features. Their claim was not that every attack direction is semantically relevant to humans. It was that standard datasets contain predictive signal that models can use and that humans do not recognize as robust evidence. Adversarial perturbations take advantage of those signals. Robust training suppresses them. Schmidt et al. [<a href="https://arxiv.org/abs/1804.11285">6</a>] sharpened the price metric in statistical terms: the sample complexity of robust learning can be polynomially bigger than the sample complexity of standard learning, an information-theoretic gap that holds irrespective of the training algorithm or the model family. In their Gaussian-mixture model, standard generalization needs only constant sample complexity while robust generalization at ℓ<sub>∞</sub> radius ε requires a polynomial-in-d number of samples.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/adversarial-feature-taxonomy.png" alt="Figure 1 of Ilyas et al. 2019. Robust versus non-robust feature decomposition: standard models exploit non-robust features that are predictive but human-imperceptible.">
          <figcaption>Figure 1 of Ilyas et al. [<a href="https://arxiv.org/abs/1905.02175">5</a>]. A feature can be genuinely predictive and still fail the invariance demanded by the threat model.</figcaption>
        </figure>

        <p>The gap comes from the structure of the learning problem, not the algorithm. Robustness is paying for invariance, and invariance costs in terms of sample complexity. Engstrom et al. then ran the experiment in the other direction: representations from robust classifiers transfer better than standard ones on a range of downstream tasks, look more semantically aligned in feature visualization, and yield gradients that resemble human-perceptible objects. Robustness, in this reading, is also a representation-learning prior. Whether the prior helps or hurts depends on the downstream task, but it is not free of structure.</p>

        <p>The adversarial-examples literature forced a distinction between predictive validity and human-aligned validity. A feature can be statistically real, useful for test accuracy, and still unacceptable under a robustness constraint. That is a deeper issue than security: the supervised-learning objective does not fully specify the invariances I care about. To the extent that human perception is itself a strong inductive bias, models that do more representation learning end up closer to my geometry; robust models produce gradients and saliency maps that line up with what a person would identify as the object. In medical imaging, robotics, or safety-critical perception that trade-off is worth the cost. In low-stakes classification it often is not.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/adversarial-robust-optimization.png" alt="Adversarial-training loss curves of Madry et al. 2018: PGD-adversarial training loss decays from the initial saddle-point value over 100k MNIST iterations and 75k CIFAR10 iterations.">
          <figcaption>Adversarial-training convergence in Madry et al. [<a href="https://arxiv.org/abs/1706.06083">3</a>]. The robust optimization objective is solvable: the inner-max-then-outer-min loss curve descends to a bounded plateau under PGD adversarial training.</figcaption>
        </figure>

        <p>The "non-robust feature" label renames an older statistical fact. Predictive validity in distribution is not causal structure, and a model that maximizes the former will exploit signals the latter does not endorse. Adversarial training folds a robustness constraint into the objective. The cleaner long-term fix is on the data side: collect or augment so that the equivalence classes the human cares about are the equivalence classes the dataset enforces.</p>
        <h2>Further reading</h2>
        <ul class="further">
          <li><a href="https://arxiv.org/abs/2003.01690">F. Croce and M. Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. arxiv 2003.01690, 2020</a></li>
          <li><a href="https://arxiv.org/abs/1608.04644">N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. arxiv 1608.04644, 2016</a></li>
          <li><a href="https://arxiv.org/abs/1802.00420">A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arxiv 1802.00420, 2018</a></li>
          <li><a href="https://arxiv.org/abs/1902.02918">J. M. Cohen, E. Rosenfeld, and J. Z. Kolter. Certified adversarial robustness via randomized smoothing. arxiv 1902.02918, 2019</a></li>
          <li><a href="https://arxiv.org/abs/1906.00945">L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, B. Tran, and A. Madry. Adversarial robustness as a prior for learned representations. arxiv 1906.00945, 2019</a></li>
        </ul>

        

<h2>References</h2>

        <ul class="refs">
          <li>[<a href="https://arxiv.org/abs/1312.6199">1</a>] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arxiv 1312.6199, 2013.</li>
          <li>[<a href="https://arxiv.org/abs/1412.6572">2</a>] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arxiv 1412.6572, 2014.</li>
          <li>[<a href="https://arxiv.org/abs/1706.06083">3</a>] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arxiv 1706.06083, 2017.</li>
          <li>[<a href="https://arxiv.org/abs/1805.12152">4</a>] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. arxiv 1805.12152, 2018.</li>
          <li>[<a href="https://arxiv.org/abs/1905.02175">5</a>] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are features. arxiv 1905.02175, 2019.</li>
          <li>[<a href="https://arxiv.org/abs/1804.11285">6</a>] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires more data. arxiv 1804.11285, 2018.</li>
        </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/adversarial-fgsm.png"/>
    </item>
    <item>
      <title>Mode connectivity</title>
      <link>https://rifaki.me/notes/mode-connectivity/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/mode-connectivity/</guid>
      <pubDate>Tue, 04 Nov 2025 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Garipov and Draxler&#x27;s curved low-loss paths, the permutation turn, linear mode connectivity, and what the geometry does and does not prove.</p><p><img src="https://rifaki.me/notes/img/mode-connectivity-paths.png" alt="Mode connectivity"/></p>]]></description>
      <content:encoded><![CDATA[<p>The old picture of the loss surface as many isolated basins, one per initialization, has not held up. Freeman and Bruna [<a href="https://arxiv.org/abs/1611.01540">6</a>] suggested early on that low-loss level sets stay connected. Garipov et al. [<a href="https://arxiv.org/abs/1802.10026">1</a>] and Draxler et al. [<a href="https://arxiv.org/abs/1803.00885">2</a>] then made it concrete: two independently trained models can be joined by a smooth low-loss path. The minima are not isolated points; they are reachable from each other through a connected high-dimensional region.</p>

 <p>The experiment is direct. Train two models θ<sub>1</sub> and θ<sub>2</sub> to low loss, and interpolate linearly between them to define a one-parameter family <img src="https://rifaki.me/notes/img/math/a7e21049c68bd0c3.svg" alt="$$\theta(\alpha) = (1-\alpha)\,\theta_1 + \alpha\,\theta_2$$" class="math-display" width="211" height="20"/> for α ∈ [0,1]. The straight segment usually crosses a high-loss barrier <img src="https://rifaki.me/notes/img/math/29403b3b89c65182.svg" alt="$$B = \max_{\alpha \in [0,1]} L(\theta(\alpha)) - \tfrac{1}{2}\big(L(\theta_1) + L(\theta_2)\big).$$" class="math-display" width="346" height="38"/> In other words, the loss jumps up substantially as soon as one steps off either endpoint, and the peak in the middle is typically far above either endpoint loss. Garipov and Draxler's contribution was to show that this barrier exists only along the straight line: if you allow curved paths, you can connect θ<sub>1</sub> and θ<sub>2</sub> with a path that stays at low loss throughout. The endpoints are not separated by an insurmountable barrier; the straight line is just the wrong path through parameter space.</p>

 <h2>Curves before lines</h2>

 <p>The first demonstrations of mode connectivity used non-linear paths. Garipov et al. parametrized the path as polygonal chains and Bézier curves; Draxler et al. used continuous non-linear paths produced by the Nudged Elastic Band method. Both findings weakened the previous picture in which SGD ends up in sharply-separated basins: if a low-loss path exists between two trained models, the connected low-loss region they both lie in is substantially larger than what a local Hessian analysis at either endpoint would suggest.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/mode-connectivity-paths.png" alt="Figure 1 of Garipov et al. 2018. Loss-landscape view of a low-loss curve connecting two independently trained SGD minima, while the straight segment between them rises through a high-loss barrier.">
 <figcaption>Figure 1 of Garipov et al. [<a href="https://arxiv.org/abs/1802.10026">1</a>]. The straight segment between two minima crosses a barrier; the learned curve stays in low-loss territory.</figcaption>
 </figure>

 <p>Linear mode connectivity is strictly stronger than the curve-based version: it asks the loss to stay low along the straight segment between two minima, not along an arbitrary path. For independently-trained networks from distinct initializations this typically fails. The linearly connected case is essentially confined to one setup. Frankle, Dziugaite, Roy, and Carbin [<a href="https://arxiv.org/abs/1912.05671">3</a>] formalized it as "spawning": train a model θ<sub>0</sub>, fork two copies after k steps using different SGD noise, and train each to convergence. Once k crosses a stability threshold (around 1000-2000 iterations on standard CIFAR networks, and a few percent of training on ImageNet), the two descendants end up linearly connected. They argue this is exactly the lottery-ticket basin: the connected region is the one that the matching sparse sub-network corresponds to, so linear-mode-connectivity becomes a practical test for whether two runs landed in the same effective basin and can therefore be merged without loss.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/mode-connectivity-spawning.png" alt="Figure 3 of Frankle, Dziugaite, Roy, Carbin 2020. Linear interpolation curves for spawn-then-fork SGD pairs at varying late-rewinding points: late enough forks remain linearly connected.">
 <figcaption>Figure 3 of Frankle et al. [<a href="https://arxiv.org/abs/1912.05671">3</a>]. Fork late enough and the two descendants stay linearly connected.</figcaption>
 </figure>

 <h2>The permutation turn</h2>

 <p>Entezari et al. [<a href="https://arxiv.org/abs/2110.06296">4</a>] reframed the geometry. Neural networks have permutation symmetries: swapping units in a hidden layer and unswapping them in the next layer leaves the function unchanged. Two models that look far apart in raw parameter coordinates might just be using different unit orderings of the same function. Quotient by those permutations and many independently trained models turn out to be connected by a simple low-loss curve.</p>

 <p>Git Re-Basin turns the idea into a concrete algorithm. Given two trained models θ<sub>1</sub>, θ<sub>2</sub> with hidden-layer widths {n<sub>ℓ</sub>}, it searches over permutation matrices P<sub>ℓ</sub> ∈ S<sub>n<sub>ℓ</sub></sub> for the alignment that minimizes ‖ θ<sub>1</sub> - P · θ<sub>2</sub> ‖ under a weights or activations metric, then checks whether the aligned models can be merged in weight space. Singh and Jaggi give an optimal-transport version of the same idea: a soft assignment between units that reduces to permutation matching when the widths agree, and that handles mismatched widths when they don't. Neither paper proves that all minima sit in one basin, but together they make a strong empirical case that raw parameter-space interpolation overstates the separation. A meaningful fraction of the apparent barrier is just a bad coordinate system.</p>

 <figure class="tweet-embed">
 <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Say you train Model A. <br><br>Independently, your friend trains Model B, possibly on different data. <br><br>With Git Re-Basin, you can merge models A+B in weight space at _no cost to the loss_</p>&mdash; Samuel &quot;curry-howard fanboi&quot; Ainsworth (@SamuelAinsworth) <a href="https://twitter.com/SamuelAinsworth/status/1569719499263471616?ref_src=twsrc%5Etfw">September 13, 2022</a></blockquote>
 </figure>

 <figure>
 <img src="https://rifaki.me/notes/img/mode-connectivity-simplex.png" alt="Figure 1 of Ainsworth, Hayase, Srinivasa 2023 (Git Re-Basin). Linear-interpolation barrier between two independently trained networks before and after permutation alignment.">
 <figcaption>Figure 1 of Ainsworth et al. [<a href="https://arxiv.org/abs/2209.04836">5</a>]. Aligning hidden units by permutation collapses most of the apparent linear-interpolation barrier.</figcaption>
 </figure>

 <h2>Empirical evidence ties together alignment and merging</h2>

 <p>Tatro et al. [<a href="https://arxiv.org/abs/2009.02439">7</a>] showed empirically that aligning models before fitting a connecting curve produces shorter curves with lower loss along them, which is the consistency check the permutation story predicts: correcting for symmetries should give a simpler geometry than the raw view. Benton et al. [<a href="https://arxiv.org/abs/2102.13042">8</a>] extended this beyond curves to higher-dimensional simplexes of solutions: once symmetries are corrected, low-loss volumes contain many independently trained checkpoints.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/mode-connectivity-rebasin.png" alt="Figure 2 of Ainsworth, Hayase, Srinivasa 2023. Linear interpolation barriers between two independently trained networks across MNIST/CIFAR-10/ImageNet under naive, activation-matching, weight-matching, and STE-matching alignment schemes.">
 <figcaption>Figure 2 of Ainsworth et al. [<a href="https://arxiv.org/abs/2209.04836">5</a>]. Aligning hidden units before interpolation collapses most of the apparent barrier across architectures and datasets.</figcaption>
 </figure>

 <h2>Practical applications</h2>

 <p>Model merging is what gets built on top. SWA averages late-training checkpoints and works because the trajectory it averages over stays inside one connected low-loss region. Model Soups average independently fine-tuned models from a shared pretraining initialization, which puts every fine-tune inside the same connected component and close to the others. Git Re-Basin generalizes this further by aligning unit permutations so models with no shared initialization can be merged at all. Across architectures, weight space behaves like a workspace where related models can be moved between while preserving function — once symmetries and shared training histories have been accounted for.</p>

 <p>It does not explain generalization. A connected low-training-loss region can contain many bad predictors on held-out data, and showing that two solutions are connected says nothing about how either performs on unseen examples. It also does not guarantee that every architecture, dataset, or training recipe lives in a single basin; the broader single-basin claims have counter-examples. I read "one wide basin" as rhetorically appealing but over-reaching the evidence. The established claim is weaker: solutions reachable from a fixed initialization, or from independent runs once permutations are aligned, lie in a single connected low-loss region. That is enough to explain why SWA, model soups, and weight-space ensembling work. I would not push the geometry harder than that.</p>
        <h2>Further reading</h2>
        <ul class="further">
          <li><a href="https://arxiv.org/abs/1910.05653">S. P. Singh and M. Jaggi. Model fusion via optimal transport. arxiv 1910.05653, 2020</a></li>
          <li><a href="https://arxiv.org/abs/1803.05407">P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson. Averaging weights leads to wider optima and better generalization. arxiv 1803.05407, 2018</a></li>
          <li><a href="https://arxiv.org/abs/2203.05482">M. Wortsman et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. arxiv 2203.05482, 2022</a></li>
        </ul>

        

<h2>References</h2>

 <ul class="refs">
 <li>[<a href="https://arxiv.org/abs/1802.10026">1</a>] T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, and A. G. Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. arxiv 1802.10026, 2018.</li>
 <li>[<a href="https://arxiv.org/abs/1803.00885">2</a>] F. Draxler, K. Veschgini, M. Salmhofer, and F. A. Hamprecht. Essentially no barriers in neural network energy landscape. arxiv 1803.00885, 2018.</li>
 <li>[<a href="https://arxiv.org/abs/1912.05671">3</a>] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. Linear mode connectivity and the lottery ticket hypothesis. arxiv 1912.05671, 2019.</li>
 <li>[<a href="https://arxiv.org/abs/2110.06296">4</a>] R. Entezari, H. Sedghi, O. Saukh, and B. Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks. arxiv 2110.06296, 2021.</li>
 <li>[<a href="https://arxiv.org/abs/2209.04836">5</a>] S. K. Ainsworth, J. Hayase, and S. Srinivasa. Git Re-Basin: merging models modulo permutation symmetries. arxiv 2209.04836, 2022.</li>
 <li>[<a href="https://arxiv.org/abs/1611.01540">6</a>] C. D. Freeman and J. Bruna. Topology and geometry of half-rectified network optimization. arxiv 1611.01540, 2016.</li>
 <li>[<a href="https://arxiv.org/abs/2009.02439">7</a>] N. Tatro, P.-Y. Chen, P. Das, I. Melnyk, P. Sattigeri, and R. Lai. Optimizing mode connectivity via neuron alignment. arxiv 2009.02439, 2020.</li>
 <li>[<a href="https://arxiv.org/abs/2102.13042">8</a>] G. Benton, W. J. Maddox, S. Lotfi, and A. G. Wilson. Loss surface simplexes for mode connecting volumes and fast ensembling. arxiv 2102.13042, 2021.</li>
 </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/mode-connectivity-paths.png"/>
    </item>
    <item>
      <title>The neural tangent kernel</title>
      <link>https://rifaki.me/notes/neural-tangent-kernel/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/neural-tangent-kernel/</guid>
      <pubDate>Sat, 09 Aug 2025 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Jacot&#x27;s infinite-width limit, the lazy-training regime, and what the NTK explains and what feature learning leaves on the table.</p><p><img src="https://rifaki.me/notes/img/ntk-linearization.png" alt="The neural tangent kernel"/></p>]]></description>
      <content:encoded><![CDATA[<p>The neural tangent kernel was one of the few deep-learning theory ideas that were useful before they became a concept. It doesn't solve generalization, but it makes a very stubborn object analyzable. Jacot, Gabriel, and Hongler [<a href="https://arxiv.org/abs/1806.07572">1</a>] found that if you take a network to infinite width under the right scaling, gradient descent on the parameters becomes kernel gradient descent in function space. The kernel isn't chosen by hand, it's induced by the network at initialization. For a model f<sub>θ</sub>, the tangent kernel is K<sub>θ</sub>(x, x') = ∇<sub>θ</sub> f<sub>θ</sub>(x)<sup>⊤</sup> ∇<sub>θ</sub> f<sub>θ</sub>(x'). In finite networks this kernel changes as you train. In the infinite width limit under the standard parameterization, it converges to some deterministic kernel K<sub>∞</sub> and ‖K<sub>θ<sub>t</sub></sub> - K<sub>∞</sub>‖ approaches O(1/√n) in width n.</p>

        <p>Function-space gradient descent then satisfies ḟ<sub>t</sub> = -K<sub>∞</sub> (f<sub>t</sub> - y) on the training set and integrates to f<sub>t</sub> = y + e<sup>-K<sub>∞</sub> t</sup>(f<sub>0</sub> - y). Parameter-space non-convexity stops mattering: the function-space dynamics are linear and driven by a positive semidefinite kernel.</p>

        <h2>NTK as baseline</h2>

        <p>The first thing the NTK explained is why very wide networks optimize so easily. If the kernel is well conditioned on the training data, gradient descent has an easy road to interpolation. The complicated nonconvex path is, at leading order, kernel regression with some specific architecture induced kernel. Du et al. [<a href="https://arxiv.org/abs/1810.02054">4</a>] and Lee et al. [<a href="https://arxiv.org/abs/1902.06720">2</a>] pushed this picture further and showed that wide networks of any depth evolve like their first-order Taylor expansion around initialization.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/ntk-linearization.png" alt="Figure 2 of Lee et al. 2019. Predictions from the linearized infinite-width model match the trajectory of the actual wide finite network during gradient-descent training.">
          <figcaption>Figure 2 of Lee et al. [<a href="https://arxiv.org/abs/1902.06720">2</a>]. The linearization at initialization tracks the wide-network trajectory closely under gradient descent.</figcaption>
        </figure>

        <p>Du et al. extend the same machinery to a clean proof that overparameterized networks reach zero training loss with a polynomial-width requirement and a global-convergence guarantee that the nonconvex landscape never gave you. But the spectrum of K<sub>∞</sub> does more than set the speed of convergence. Its eigendecomposition K<sub>∞</sub> = ∑<sub>k</sub> λ<sub>k</sub> φ<sub>k</sub> φ<sub>k</sub><sup>⊤</sup> implies that the residual along eigenmode φ<sub>k</sub> shrinks like e<sup>-λ<sub>k</sub> t</sup>, so large-eigenvalue modes are learned quickly and small-eigenvalue modes slowly, or not at all under early stopping. Early stopping, the frequency principle, and the spectral bias of MLPs all become statements about {λ<sub>k</sub>}.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/ntk-spectrum.png" alt="Figure 1 of Cao et al. 2019 (spectral bias of deep learning). Projection lengths along the lowest few eigenmodes of the NTK as a function of training step: low-frequency (small k) components are fit much faster than higher-frequency components.">
          <figcaption>Figure 1 of Cao et al. [<a href="https://arxiv.org/abs/1912.01198">arxiv 1912.01198</a>]. Low-frequency components of the target are absorbed by the network long before higher-frequency components, in the order predicted by the NTK spectrum.</figcaption>
        </figure>

        <p>Arora et al. write down an exact algorithm for computing K<sub>∞</sub> for fully connected and convolutional networks of arbitrary depth, making these spectral predictions empirically testable on real datasets.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/ntk-spectrum-timeline.png" alt="Figure 2 of Bordelon, Canatar, Pehlevan 2020. Spectrum-dependent generalization-error scaling: per-mode learning curves $E_k(p)/E_k(0)$ versus number of training samples for varying eigenmode index, input dimension, and depth, all approaching the predicted $1/p^\alpha$ envelope.">
          <figcaption>Figure 2 of Bordelon et al. [<a href="https://arxiv.org/abs/2002.02561">arxiv 2002.02561</a>]. Generalization on each NTK eigenmode follows a spectrum-dependent power law in sample count.</figcaption>
        </figure>

        <h2>The lazy-training caveat</h2>

        <p>The catch is that the same condition that makes the theory clean also removes one of the main things deep networks seem to be doing. In the NTK limit, features do not move. Parameters drift by ‖θ<sub>t</sub> - θ<sub>0</sub>‖ = O(1/√n) in width n while the function changes by O(1), so the network is producing its outputs by reweighting an almost fixed collection of random features. Chizat, Oyallon, and Bach call this the lazy-training regime and emphasize that it is a property of the scaling, not a universal description of neural networks.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/ntk-lazy-vs-feature-learning.png" alt="Figure 1 of Chizat, Oyallon, Bach 2019. Lazy regime versus feature-learning regime trajectories on a 2D classification problem: the lazy regime stays near initialization while feature learning moves substantially.">
          <figcaption>Figure 1 of Chizat, Oyallon, and Bach [<a href="https://arxiv.org/abs/1812.07956">3</a>]. The NTK limit is powerful because it freezes feature movement; that is also what it cannot explain.</figcaption>
        </figure>

        <p>That caveat matters; a convolutional network trained in the lazy regime can optimize while failing to learn the representations that make convolutional networks useful. A transformer that looks like a fixed random-feature model is not the object that in-context learning, induction-head formation, and abstraction discussions are pointing at. The NTK gives a rigorous theory of one limit. The question is whether that limit keeps the right phenomena. Geiger et al [<a href="https://arxiv.org/abs/1906.08034">5</a>] report a sharp empirical separation: at moderate width and standard initialization scale networks operate near the lazy regime; at lower initialization scale (or if explicit feature-learning parameterizations are used) the same architecture enters a regime where features evolve and test error improves.</p>

        <p>The transition is controlled by initialization scale and width, not by anything intrinsic to the architecture. That is a slightly disappointing answer if you wanted neural networks to be feature learners by default.</p>

        <p>The NTK is not wrong; it is a baseline. A phenomenon that already appears in the NTK limit can be attributed to width, interpolation, and fixed random features, with no representation learning needed. A phenomenon that disappears in the NTK limit is the work of feature learning, finite-width fluctuation, architecture-specific structure, or nonlinearity in the optimization. That makes the NTK a useful negative control for theoretical claims about deep learning, and the simpler question to put to such a claim is: would it still hold if the features were frozen? For most optimization claims, yes. For most generalization and capability claims, no. The Distill circuits thread is the non-theorem-shaped version of what feature learning looks like when someone manages to pry a model open. Greg Yang's μP writeup is the practical entry into Tensor Programs, and the microsoft/mup repo is what to grab when the goal is hyperparameter transfer and not the theory.</p>

        <figure class="tweet-embed">
          <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Excited to share our new <a href="https://twitter.com/hashtag/neurips2020?src=hash&amp;ref_src=twsrc%5Etfw">#neurips2020</a> paper /Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel/ (<a href="https://t.co/4jmrNfOE6H">https://t.co/4jmrNfOE6H</a>) with @KDziugaite, Mansheej, <a href="https://twitter.com/SKharaghani?ref_src=twsrc%5Etfw">@SKharaghani</a>, <a href="https://twitter.com/roydanroy?ref_src=twsrc%5Etfw">@roydanroy</a>, <a href="https://twitter.com/SuryaGanguli?ref_src=twsrc%5Etfw">@SuryaGanguli</a> 1/6 <a href="https://t.co/iPPP3HmNgm">pic.twitter.com/iPPP3HmNgm</a></p>&mdash; Stanislav Fort (@stanislavfort) <a href="https://twitter.com/stanislavfort/status/1322246600320757760?ref_src=twsrc%5Etfw">October 30, 2020</a></blockquote>
        </figure>

        <h2>Feature learning is the missing term</h2>

        <p>The frontier after the NTK was to build infinite-width limits in which features actually move. Mean-field limits treat each unit as a particle in a measure and study gradient flow on that measure; in this scaling features evolve and the kernel is no longer constant. Tensor-program analyses catalogue the parameterizations that produce sensible infinite-width limits at all. The maximal-update parameterization μP, introduced by Yang and Hu in Tensor Programs IV (<a href="https://arxiv.org/abs/2011.14522">arxiv 2011.14522</a>), is the one that keeps both feature learning and stable optimization in the limit. The follow-up Tensor Programs V (<a href="https://arxiv.org/abs/2203.03466">arxiv 2203.03466</a>) derives the μTransfer hyperparameter-transfer rules from that analysis: tune at small width, scale to large width, and the learning-rate schedule transfers without retuning.</p>

        <p>Feature learning means the tangent kernel is moving substantively: K<sub>θ<sub>t</sub></sub> - K<sub>θ<sub>0</sub></sub> is a structured rotation of the features toward the data, not a small perturbation. The parameter-gradients at the end of training are not the same object as at initialization, and the network has in effect changed the basis it works in. That change is exactly what the pure NTK limit suppresses. Fort et al. ran one of the clearest empirical comparisons: kernel learning matches a finite network early in training but the two diverge later, and the divergence is the gap between lazy convergence to a fixed kernel and feature-driven re-shaping of it.</p>

        <p>I use the NTK as a falsifier, not a model. If a proposed mechanism for a deep-learning phenomenon is already trivially in the NTK regime, then "the network is wide and the features are random" suffices, and the explanation has not earned the depth of its hypothesis. The interesting predictions are the ones that disagree with the kernel: where width, lazy init, and architecture-induced spectra are not enough, and where representation change must be doing the work. That includes in-context learning, induction-head formation, and the parts of scaling laws that depend on where compute is spent.</p>
        <h2>Further reading</h2>
        <ul class="further">
          <li><a href="https://arxiv.org/abs/1904.11955">S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. arxiv 1904.11955, 2019</a></li>
          <li><a href="https://arxiv.org/abs/1812.11118">M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. arxiv 1812.11118, 2018</a></li>
          <li><a href="https://arxiv.org/abs/1804.06561">S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks. arxiv 1804.06561, 2018</a></li>
          <li><a href="https://arxiv.org/abs/2011.14522">G. Yang and E. J. Hu. Tensor Programs IV: feature learning in infinite-width neural networks. arxiv 2011.14522, 2020</a></li>
          <li><a href="https://arxiv.org/abs/1912.01198">Y. Cao, Z. Fang, Y. Wu, D.-X. Zhou, and Q. Gu. Towards understanding the spectral bias of deep learning. arxiv 1912.01198, 2019</a></li>
          <li><a href="https://arxiv.org/abs/2002.02561">B. Bordelon, A. Canatar, and C. Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. arxiv 2002.02561, 2020</a></li>
        </ul>

        

<h2>References</h2>

        <ul class="refs">
          <li>[<a href="https://arxiv.org/abs/1806.07572">1</a>] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: convergence and generalization in neural networks. arxiv 1806.07572, 2018.</li>
          <li>[<a href="https://arxiv.org/abs/1902.06720">2</a>] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arxiv 1902.06720, 2019.</li>
          <li>[<a href="https://arxiv.org/abs/1812.07956">3</a>] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. arxiv 1812.07956, 2018.</li>
          <li>[<a href="https://arxiv.org/abs/1810.02054">4</a>] S. S. Du, X. Zhai, B. P&oacute;czos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. arxiv 1810.02054, 2018.</li>
          <li>[<a href="https://arxiv.org/abs/1906.08034">5</a>] M. Geiger, S. Spigler, A. Jacot, and M. Wyart. Disentangling feature and lazy training in deep neural networks. arxiv 1906.08034, 2019.</li>
          <li>[<a href="https://arxiv.org/abs/2010.15110">6</a>] S. Fort, G. K. Dziugaite, M. Paul, S. Kharaghani, D. M. Roy, and S. Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. arxiv 2010.15110, 2020.</li>
        </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/ntk-linearization.png"/>
    </item>
    <item>
      <title>The implicit-bias program</title>
      <link>https://rifaki.me/notes/implicit-bias/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/implicit-bias/</guid>
      <pubDate>Wed, 14 May 2025 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Soudry&#x27;s max-margin result for linear models, the geometry of optimizer choice, and how cleanly the linear case does and does not transfer to deep networks.</p><p><img src="https://rifaki.me/notes/img/implicit-bias-margin.png" alt="The implicit-bias program"/></p>]]></description>
      <content:encoded><![CDATA[<p>Why does training a model without an explicit regularizer, with the loss driven nearly to zero, still produce a solution that generalizes? The classical answer is that the objective has to carry the regularizer somewhere. The implicit-bias answer is more subtle: even when the objective has many minima, gradient descent does not choose among them neutrally; the algorithm itself selects a particular kind of solution.</p>

 <p>Zhang et al. [<a href="https://arxiv.org/abs/1611.03530">2</a>] sharpened the puzzle that most of this literature now opens with. Standard image classifiers fit random labels as easily as they fit true ones, which kills the simplest capacity-based explanation of generalization: the hypothesis class is large enough to memorize anything. Whatever is producing the generalization, then, must come from the optimizer, the data, or the parameterization, not from the loss function itself.</p>

 <p>The clearest version of this story is not actually about neural networks. It is about logistic regression on linearly separable data. Once the classifier separates the data, the empirical classification error is already 0% and the logistic loss continues to decrease as the norm of the weights grows. There is no finite minimizer. What is more surprising is that the direction of the weights still converges. And it converges to the maximum-margin SVM solution.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/implicit-bias-margin.png" alt="Figure 1 of Soudry et al. 2018. Five-panel layout: (A) 2D separable data with the converged separator, (B) normalized weight norm growing logarithmically, (C) logistic loss decaying, (D) angle gap to max-margin direction shrinking, (E) margin gap closing.">
 <figcaption>Figure 1 of Soudry et al. [<a href="https://arxiv.org/abs/1710.10345">1</a>]. The loss has no finite minimizer on separable data, the norm grows without bound, and the normalized direction converges to the hard-margin separator.</figcaption>
 </figure>

 <h2>The linear theorem</h2>

 <p>Let x<sub>i</sub> ∈ ℝ<sup>d</sup> denote an input vector with binary label y<sub>i</sub> ∈ {-1,+1}, and let w<sub>t</sub> ∈ ℝ<sup>d</sup> denote the weight vector at iteration t. For linearly separable data {(x<sub>i</sub>, y<sub>i</sub>)}<sub>i=1</sub><sup>n</sup>, standard gradient descent on the logistic loss sends ‖ w<sub>t</sub> ‖ → ∞, but the normalized direction w<sub>t</sub> / ‖ w<sub>t</sub> ‖ converges to the L2 max-margin direction ŵ / ‖ ŵ ‖, where
 <img src="https://rifaki.me/notes/img/math/f4a525e8c1ec0762.svg" alt="$$\hat{w} \;=\; \arg\min_{w \in \mathbb{R}^d} \tfrac{1}{2}\lVert w\rVert^2 \quad \text{s.t.}\quad y_i\,w^{\top} x_i \ge 1 \quad \forall i \in [n].$$" class="math-display" width="449" height="35"/>
 </p>

 <p>In words: once the classifier has separated the data, gradient descent keeps reducing the logistic loss as long as the separation is preserved, except that unlike a traditional SVM it does so while continuing to grow the norm of the weights without bound. The norm grows only logarithmically with t, and the angle gap between w<sub>t</sub>/‖ w<sub>t</sub>‖ and ̂w/‖ ̂w‖ closes at the same logarithmic rate. This is why the asymptotic regime takes so many iterations to become visible.</p>

 <p>A few things follow from the theorem. Gradient descent behaves as if it had been regularized toward the Euclidean max-margin classifier without anyone writing that regularizer down. It also explains why early stopping helps: because the weight norm diverges over time, stopping early caps the implicit penalty before the iterate gets too far. Ji and Telgarsky extended the result to the non-separable case, showing the iterate still tracks a unique ray defined by the data when no separating hyperplane exists. Nacson et al. [<a href="https://arxiv.org/abs/1803.01905">4</a>] showed that aggressive learning-rate schedules accelerate convergence to the max-margin direction by polynomial factors over plain GD. The norm divergence itself is robust to dataset size and dimension: as long as the data is linearly separable, the iterate keeps moving away from the origin and never lands at a finite minimum.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/implicit-bias-margin-flow.png" alt="Figure 2 of Soudry et al. 2018. Three panels on a real classification dataset: training/validation objective loss, classification error, and L2 norm of the final layer growing as training progresses.">
 <figcaption>Figure 2 of Soudry et al. [<a href="https://arxiv.org/abs/1710.10345">1</a>]. On real data, classification error plateaus near zero while the L2 norm of the final layer keeps growing - the asymptotic regime described in the linear theorem.</figcaption>
 </figure>

 <h2>The geometry enters</h2>

 <p>Gunasekar et al. [<a href="https://arxiv.org/abs/1802.08246">3</a>] generalized the result: different optimization geometries select different implicit regularizers. Steepest descent under the ℓ<sub>p</sub> norm minimizes the ℓ<sub>p</sub> margin instead of the ℓ<sub>2</sub> margin. Mirror descent with respect to a convex potential Φ minimizes the Φ-min-norm interpolant. Natural gradient and adaptive methods land at interpolants determined by the geometry of their step.</p>

 <p>The linear-convolutional-network result is the cautionary case. For fully connected linear predictors, gradient descent picks out the familiar ℓ<sub>2</sub> margin geometry. For full-width linear convolutional networks of depth L, Gunasekar et al. show that gradient descent instead selects the predictor minimizing the 2/L-bridge penalty in the discrete Fourier domain. The architecture changes which parameters are being optimized, and that change shifts both the trajectory and the preferred solution. "Gradient descent likes simple solutions" is too vague to be a theorem. The more honest statement is that gradient descent likes simple solutions in whichever coordinate system the architecture imposes.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/implicit-bias-optimizer-geometry.png" alt="Three-panel figure from Gunasekar, Lee, Soudry, Srebro 2018: (a) mirror descent with primal momentum, (b) natural gradient descent at varying step sizes, (c) steepest descent under the 4/3 norm. Each optimizer trajectory lands on a different interpolating solution along the same zero-loss line.">
 <figcaption>From Gunasekar et al. [<a href="https://arxiv.org/abs/1802.08246">3</a>]. Implicit bias is a selection rule over interpolants determined by the geometry of the optimizer, not a single universal preference.</figcaption>
 </figure>

 <h2>Margin in homogeneous networks</h2>

 <p>Lyu and Li push the result past linear predictors. If f<sub>θ</sub> is positively homogeneous in θ with order L (which holds for ReLU networks without bias, with L equal to depth), then gradient flow on exponential or logistic loss drives θ<sub>t</sub> / ‖ θ<sub>t</sub> ‖ to a KKT point of the parameter-space margin program max<sub>‖ θ ‖ ≤ 1</sub> min<sub>i</sub> y<sub>i</sub> f<sub>θ</sub>(x<sub>i</sub>). The norm still diverges; the normalized direction still converges; the new content is that even on a non-convex parameter landscape, gradient flow lands on points satisfying first-order optimality conditions for the margin program.</p>

 <p>Chizat and Bach prove a parallel mean-field result for two-layer networks with vanishing initializations. The implicit bias there is F<sub>1</sub>-norm minimization in function space, which is a different object from parameter-space margin maximization and interacts differently with the data. The state of the field is that implicit regularization in deep networks has several reasonable descriptions, none of which generalize cleanly past shallow or homogeneous models.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/implicit-bias-dynamics.png" alt="Figure 1 of Lyu and Li 2020. Training loss and normalized margin trajectories for homogeneous networks under fixed and loss-based learning rates: the loss collapses while the normalized margin keeps rising toward a KKT point of the parameter-space margin program.">
 <figcaption>Figure 1 of Lyu and Li [<a href="https://arxiv.org/abs/1906.05890">arxiv 1906.05890</a>]. The loss keeps shrinking, the weight norm grows, and the useful object is the normalized direction.</figcaption>
 </figure>

 <figure class="tweet-embed">
 <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">It is widely thought that neural networks generalize because of implicit regularization of gradient descent. Today at <a href="https://twitter.com/hashtag/ICLR2023?src=hash&amp;ref_src=twsrc%5Etfw">#ICLR2023</a> we show new evidence to the contrary. We train with gradient-free optimizers and observe generalization competitive with SGD.<a href="https://t.co/8Vo9rFI9FY">https://t.co/8Vo9rFI9FY</a></p>&mdash; Tom Goldstein (@tomgoldsteincs) <a href="https://twitter.com/tomgoldsteincs/status/1653284772314005505?ref_src=twsrc%5Etfw">May 2, 2023</a></blockquote>
 </figure>

 <p>My own reading is that the implicit-bias program is one of the few cases I can point to of a research direction being vindicated and outgrown at the same time. Soudry et al. [<a href="https://arxiv.org/abs/1710.10345">1</a>] is true; the mechanism is real; the linear case is the only setting where I can prove anything I trust. What is unclear is whether the same phenomenon is the dominant explanation for why large feature-learning networks generalize, or whether at scale the data distribution and the architecture have already done so much of the work that the optimizer's preference is a small correction. I currently believe the second, but I do not have a falsifier I trust, which is exactly the position the field is in.</p>
 <h2>Further reading</h2>
 <ul class="further">
 <li><a href="https://arxiv.org/abs/1806.00468">S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Implicit bias of gradient descent on linear convolutional networks. arxiv 1806.00468, 2018</a></li>
 <li><a href="https://arxiv.org/abs/1803.07300">Z. Ji and M. Telgarsky. Risk and parameter convergence of logistic regression. arxiv 1803.07300, 2018</a></li>
 <li><a href="https://arxiv.org/abs/1906.05890">K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. arxiv 1906.05890, 2019</a></li>
 <li><a href="https://arxiv.org/abs/2002.04486">L. Chizat and F. Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. arxiv 2002.04486, 2020</a></li>
 <li><a href="https://arxiv.org/abs/2106.09524">S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of SGD for diagonal linear networks: a provable benefit of stochasticity. arxiv 2106.09524, 2021</a></li>
 <li><a href="https://arxiv.org/abs/2007.06738">E. Moroshko, S. Gunasekar, B. Woodworth, J. D. Lee, N. Srebro, and D. Soudry. Implicit bias in deep linear classification: initialization scale vs training accuracy. arxiv 2007.06738, 2020</a></li>
 </ul>

 

<h2>References</h2>

 <ul class="refs">
 <li>[<a href="https://arxiv.org/abs/1710.10345">1</a>] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. arxiv 1710.10345, 2017.</li>
 <li>[<a href="https://arxiv.org/abs/1611.03530">2</a>] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arxiv 1611.03530, 2016.</li>
 <li>[<a href="https://arxiv.org/abs/1802.08246">3</a>] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. arxiv 1802.08246, 2018.</li>
 <li>[<a href="https://arxiv.org/abs/1803.01905">4</a>] M. S. Nacson, J. Lee, S. Gunasekar, P. H. P. Savarese, N. Srebro, and D. Soudry. Convergence of gradient descent on separable data. arxiv 1803.01905, 2019.</li>
 </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/implicit-bias-margin.png"/>
    </item>
    <item>
      <title>The edge of stability</title>
      <link>https://rifaki.me/notes/edge-of-stability/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/edge-of-stability/</guid>
      <pubDate>Thu, 27 Feb 2025 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Cohen&#x27;s edge-of-stability finding, Arora&#x27;s analysis, and what it does to the older flat-minima story.</p><p><img src="https://rifaki.me/notes/img/cohen-fig1.png" alt="The edge of stability"/></p>]]></description>
      <content:encoded><![CDATA[<p>Cohen et al. [<a href="https://arxiv.org/abs/2103.00065">1</a>] observed that gradient descent on neural networks spends most of training in a regime where the top Hessian eigenvalue λ<sub>max</sub> is above the classical stability threshold 2/η. The step is formally unstable (η λ<sub>max</sub> > 2), but the loss does not diverge: the trajectory oscillates along the unstable direction while continuing to make progress on the rest. They called this the edge of stability, and the point is that it is not an edge case but the ordinary regime of neural-network training.</p>

        <h2>Progressive sharpening</h2>

        <p>The first phase Cohen identified is progressive sharpening. From random initialization, gradient descent reliably drives the sharpness (λ<sub>max</sub>(H)) of the loss landscape upward during training.</p>

        <p>That direction is already counterintuitive. Classical optimization theory says you want to avoid sharp regions, since each step there costs more. Neural-network training does the opposite: it walks into sharper regions until the classical step size is no longer stable. Progressive sharpening itself is not well understood from first principles. Damian et al. [<a href="https://arxiv.org/abs/2209.15594">2</a>] give a self-stabilization argument for what happens after the threshold is reached, and Ahn et al. [<a href="https://arxiv.org/abs/2204.01050">4</a>] build on it, but neither predicts the sharpening from initialization. The observational picture is clean; the theoretical one is not.</p>

        <h2>Above the threshold</h2>

        <p>Once sharpness crosses 2/η, the textbook prediction is divergence; instead the trajectory oscillates along the top eigendirection of the Hessian. The component of the iterate along that direction swings back and forth, and the loss still falls because optimization keeps making progress on the better-conditioned directions. The net result is a trajectory that reduces loss while sitting in a region of the landscape that classical theory says it should not occupy. The first theoretical account of why this need not be a pathology comes from Arora et al. [<a href="https://arxiv.org/abs/2205.09745">3</a>]. Their analysis works on a smoothed version of the loss, where the Hessian is treated as locally fixed, and in effect tracks the trajectory you would follow from a given starting point under that fixed Hessian.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/cohen-fig1.png" alt="Figure 1 of Cohen et al. 2021. Train loss (top row) and Hessian sharpness (bottom row) over training steps for a fully-connected net on a CIFAR-10 5k subset, VGG on CIFAR-10, and ResNet on CIFAR-10. In every case, sharpness rises until it hits the $2/\eta$ threshold (dashed) and oscillates along it.">
          <figcaption>Figure 1 of Cohen et al. [<a href="https://arxiv.org/abs/2103.00065">1</a>]. Across architectures, the Hessian's top eigenvalue rises during a progressive-sharpening phase and then sits near the 2/η stability threshold for the remainder of training.</figcaption>
        </figure>

        <p>The mechanism is that the oscillation Δ θ<sub>t</sub> across the unstable direction averages to zero, so the effective dynamics is slower and looks like gradient descent on a loss with the steepest direction clipped. The account is formal enough to be checked against real training runs, and in most common settings it holds up.</p>

        <h2>And flat minima</h2>

        <p>The earlier flat-minima story started with Hochreiter and Schmidhuber and continued with Keskar et al. [<a href="https://arxiv.org/abs/1609.04836">7</a>] and the later sharpness aware minimization literature. Broadly, the flat-minima story ran as follows: SGD with a small batch size produces gradient estimates with some noise ξ.</p>

        <p>That noise looks like a random walk and tends to leave sharp minima more often than flat ones, so SGD ends up biased toward flat minima, and that bias was meant to be why neural networks generalize. Edge-of-stability does not contradict the story, but it reshapes it. The learning rate itself caps how sharp a reachable minimum can be: anything with λ<sub>max</sub> > 2/η is unstable for GD, so the trajectory cannot stay there. Gradient descent finds flat minima not because it has noise, but because sharp minima are unstable fixed points under its own dynamics. The original explanation identified the phenomenon and pinned it to the wrong cause.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/cohen-fig3.png" alt="Figure 3 of Cohen et al. 2103.00065. Progressive sharpening isolated; sharpness rises before reaching the 2/eta threshold." width="1600" height="249">
          <figcaption>Figure 3 of Cohen et al. [<a href="https://arxiv.org/abs/2103.00065">1</a>]. Progressive sharpening isolated: Hessian sharpness rises monotonically during the early phase, long before the 2/η threshold is reached.</figcaption>
        </figure>

        <p>The reframing lives mostly outside the papers themselves. Off Convex has a few posts on implicit bias, trajectory analysis, and why the classical descent lemma is genuinely misleading for neural networks instead of merely approximate. Ben Recht's ArgMin is the complementary skeptical take for once you have left convex optimization theory μ I ⪯ ∇<sup>2</sup> L ⪯ LI. Clare Lyle's tutorial walks through the 2/η arithmetic and ties the phenomenon to warmup (rising η(t)) and catapult (loss spike then decay) in one frame.</p>

        <p>Andreyev and Beneventano (<a href="https://arxiv.org/abs/2412.20553">arxiv 2412.20553</a>) extended the story to the mini-batch setting Cohen did not analyze, introducing an "edge of stochastic stability" where the quantity that pins at 2/η is the expected directional curvature of mini-batch Hessians, not the full-Hessian's top eigenvalue.</p>

        <p>A few previously folklore-level phenomena become intelligible from this picture. Warmup schedules η(t), which start small and increase η over several thousand iterations, let the network settle into edge-of-stability before η reaches its final value; without warmup, the early transient at the full η would hit a too-sharp region and diverge. Decreasing-η schedules at the end of training raise 2/η, so the trajectory can fine-tune in sharper local minima inside the broader flat region already reached, which empirically pushes training loss down further.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/arora-fig1.png" alt="Figure 1 of Arora et al. 2205.09745. Smoothed-loss analysis of the edge-of-stability oscillations." width="1600" height="525">
          <figcaption>Figure 1 of Arora et al. [<a href="https://arxiv.org/abs/2205.09745">3</a>]. The smoothed-loss analysis makes explicit why the oscillations across the unstable direction do not destroy progress; averaging over a few steps yields an effective slow dynamics on a clipped loss.</figcaption>
        </figure>

        <p>Both schedules had been used empirically for years before any principled account existed. Lewkowycz et al.'s catapult mechanism [<a href="https://arxiv.org/abs/2003.02218">5</a>], where an initial loss spike sometimes precedes a better final solution, is the same dynamics at a larger scale: a large learning rate pushes the trajectory through a briefly very sharp region, the loss spikes, and the trajectory then settles into a different basin from the one it would have reached at a smaller step size.</p>

        <h2>The generalization gap is still open</h2>

        <p>Edge-of-stability gives a clean account of why SGD ends up in flat minima, but it says nothing about why flat minima generalize. Those are distinct questions, and the second one is still open.</p>

        <p>Dinh et al. [<a href="https://arxiv.org/abs/1703.04933">6</a>] showed that the Hessian-based notion of sharpness is not reparameterization invariant, so sharpness in that form cannot directly control generalization.</p>

        <figure class="tweet-embed">
          <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.*<br><br>With <a href="https://twitter.com/alex_damian_?ref_src=twsrc%5Etfw">@alex_damian_</a>, we introduce &quot;central flows&quot;: a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs. <a href="https://t.co/pvvfwoQcOy">pic.twitter.com/pvvfwoQcOy</a></p>&mdash; Jeremy Cohen (@deepcohen) <a href="https://twitter.com/deepcohen/status/1973191790602887544?ref_src=twsrc%5Etfw">October 1, 2025</a></blockquote>
          <figcaption>Jeremy Cohen, lead author of [<a href="https://arxiv.org/abs/2103.00065">1</a>], announcing the <a href="https://arxiv.org/abs/2410.24206">central-flows follow-up</a> that makes the 2021 observation a quantitative prediction tool.</figcaption>
        </figure>

        <h2>Further reading</h2>
        <ul class="further">
          <li><a href="https://clarelyle.com/posts/2023-10-15-edge.html">deep dive into the edge of stability</a></li>
          <li><a href="https://arxiv.org/abs/1912.05671">J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. Linear mode connectivity and the lottery ticket hypothesis. arxiv 1912.05671, 2019</a></li>
          <li><a href="https://jmlr.org/papers/v25/23-1285.html">P. M. Long and P. L. Bartlett. Sharpness-aware minimization and the edge of stability. Journal of Machine Learning Research, 2024</a></li>
          <li><a href="https://arxiv.org/abs/2410.24206">J. M. Cohen, A. Damian, A. Talwalkar, J. Z. Kolter, and J. D. Lee. Understanding optimization in deep learning with central flows. arxiv 2410.24206, 2024</a></li>
          <li><a href="https://direct.mit.edu/neco/article/9/1/1/6027/Flat-Minima">S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1-42, 1997</a></li>
        </ul>

<h2>References</h2>

        <ul class="refs">
          <li>[<a href="https://arxiv.org/abs/2103.00065">1</a>] J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. arxiv 2103.00065, 2021.</li>
          <li>[<a href="https://arxiv.org/abs/2209.15594">2</a>] A. Damian, E. Nichani, and J. D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. arxiv 2209.15594, 2022.</li>
          <li>[<a href="https://arxiv.org/abs/2205.09745">3</a>] S. Arora, Z. Li, and A. Panigrahi. Understanding gradient descent on edge of stability in deep learning. arxiv 2205.09745, 2022.</li>
          <li>[<a href="https://arxiv.org/abs/2204.01050">4</a>] K. Ahn, J. Zhang, and S. Sra. Understanding the unstable convergence of gradient descent. arxiv 2204.01050, 2022.</li>
          <li>[<a href="https://arxiv.org/abs/2003.02218">5</a>] A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: The catapult mechanism. arxiv 2003.02218, 2020.</li>
          <li>[<a href="https://arxiv.org/abs/1703.04933">6</a>] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. arxiv 1703.04933, 2017.</li>
          <li>[<a href="https://arxiv.org/abs/1609.04836">7</a>] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: generalization gap and sharp minima. arxiv 1609.04836, 2016.</li>
        </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/cohen-fig1.png"/>
    </item>
    <item>
      <title>Kaplan, Chinchilla, and broken laws</title>
      <link>https://rifaki.me/notes/neural-scaling-laws/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/neural-scaling-laws/</guid>
      <pubDate>Mon, 06 Jan 2025 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Kaplan&#x27;s original fit, Chinchilla&#x27;s correction, Caballero&#x27;s broken-power-law alternative, and what predictive ability the labs actually use.</p><p><img src="https://rifaki.me/notes/img/kaplan-fig1.png" alt="Kaplan, Chinchilla, and broken laws"/></p>]]></description>
      <content:encoded><![CDATA[<p>Kaplan et al. [<a href="https://arxiv.org/abs/2001.08361">1</a>] set the baseline picture: test loss falls as a power law L(C) = (C/C<sub>0</sub>)<sup>-α<sub>C</sub></sup> in each of model size N, dataset size D, and compute C, with clean exponents α that hold over many orders of magnitude. They also claimed that for a given compute budget, there is an optimal allocation between N and D, and specifically that N should grow faster than D as C grows.</p>

 <p>That paper shaped how labs designed pre-training experiments for the next two years, and the eventual "Chinchilla" effort grew out of trying to reproduce and extend its recommendations. It also turned out to be wrong about the optimal D/N ratio, which is the part the field then had to revise.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/kaplan-fig1.png" alt="Figure 1 of arxiv 2001.08361. Test loss plotted against compute on log-log axes, showing a power-law fit over many orders of magnitude." width="1600" height="493">
 <figcaption>Figure 1 of Kaplan et al. [<a href="https://arxiv.org/abs/2001.08361">1</a>]. Log-log plot of test loss against compute. The power-law fit is tight over seven orders of magnitude which is what made the result so persuasive.</figcaption>
 </figure>

 <h2>Chinchilla</h2>

 <p>Hoffmann et al. [<a href="https://arxiv.org/abs/2203.15556">2</a>] re-asked Kaplan's question with a larger, better-controlled experiment: over 400 language models from 70 million to over 16 billion parameters, trained on 5 to 500 billion tokens, across a sweep of compute budgets. Fitting a joint regression over model size and tokens gave them per-budget optima N<sup>⋆</sup>(C), D<sup>⋆</sup>(C) that were not where Kaplan put them. Their headline rule is that for each doubling of model size, dataset size should also double, which lands at roughly 20 tokens per parameter at the compute-optimal point.</p>

 <p>The shift in conclusions came from a methodological gap. Hoffmann et al. argue that Kaplan's runs were too short: they ended well before each model had seen enough tokens to bottom out its loss. Kaplan's reported optima were therefore extrapolations from incomplete training curves, while Hoffmann's were drawn from runs that continued long enough to actually locate the per-size minimum. The two sets of optima ended up substantially different.</p>

 <h2>The effects of correcting Kaplan</h2>

 <p>GPT-3 and similarly Kaplan-trained models were trained on substantially too little data for their size. Chinchilla shows that for a fixed compute budget, a smaller model trained on more tokens reaches a lower loss than a bigger model trained on fewer. Concretely, Chinchilla-70B is roughly 2.5x smaller than GPT-3 (175 billion parameters) and outperforms it on nearly every benchmark in the original paper. Many factors contribute to that gap, but the dominant one is that Chinchilla-70B is configured at the Chinchilla-optimal point for its compute budget while GPT-3 sits at the Kaplan-optimal point. Subsequent scaling-law work that calibrates against the Chinchilla target accordingly emphasizes data scaling, longer training, and less aggressive model-size growth.</p>

 <p>Llama 2 was trained on 2 trillion tokens, well past the Chinchilla-optimal breakpoint and into a regime where the relevant trade-off is no longer training compute but inference cost. Once it became clear that smaller models are much cheaper to run at inference, scaling-law work shifted target from training-compute optimality to deployment optimality.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/hoffmann-fig3.png" alt="Figure 3 of Hoffmann et al. 2022 (Chinchilla). Left: training loss versus parameter count for fixed FLOP budgets from 6e18 up to 3e21, each forming a U-shape with a clear minimum. Middle: optimal parameters versus FLOPs, extrapolating to ~63B parameters at ~1e23 FLOPs. Right: optimal training tokens versus FLOPs, extrapolating to ~1.4T tokens.">
 <figcaption>Figure 3 of Hoffmann et al. [<a href="https://arxiv.org/abs/2203.15556">2</a>]. The IsoFLOPs decomposition: each line is a fixed compute budget, and the locus of minima defines the Chinchilla parameters/tokens scaling law.</figcaption>
 </figure>

 <h2>Broken laws</h2>

 <p>Caballero et al. [<a href="https://arxiv.org/abs/2210.14891">3</a>] argue that the single-power-law narrative is wrong. Loss versus compute, in their fit, is a smoothly-broken power law: continuous everywhere, with several breaks where the slope changes. Kaplan and Chinchilla's single-exponent fits are then averaging across regimes with different exponents and producing a number that matches none of them.</p>

 <p>Whether the broken-laws view is useful depends on what you want the fit for. For coarse extrapolation across orders of magnitude in compute, a single power law still works fine. For predicting at what scale a particular capability shows up, it does not, and the Caballero breakpoints line up with the "emergent" capabilities 𝟙[L < L<sub>threshold</sub>] Wei et al. [<a href="https://arxiv.org/abs/2206.07682">4</a>] documented, which Schaeffer et al. [<a href="https://arxiv.org/abs/2304.15004">5</a>] then argued are largely an artifact of how the underlying metric is discretized.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/caballero-fig1.png" alt="Figure 1 of Caballero et al. 2022. Annotated example of a Broken Neural Scaling Law (BNSL) functional form, marking three break points and four slope regimes between them as the performance metric is plotted against the quantity being scaled (log-log)." style="max-width: 75%;">
 <figcaption>Figure 1 of Caballero et al. [<a href="https://arxiv.org/abs/2210.14891">3</a>]. The BNSL form is a piecewise power law with explicit breaks; a single power law averages across these regimes and misses inflections the data actually shows.</figcaption>
 </figure>

 <figure>
 <img src="https://rifaki.me/notes/img/caballero-fig2.png" alt="Figure 2 of Caballero et al. 2022. Two real-task BNSL fits: top panel ImageNet 25-shot test error versus training-dataset size; bottom panel TriviaQA few-shot test accuracy versus number of model parameters. Red curve is the BNSL fit; green points extend the fit beyond the training range." style="max-width: 75%;">
 <figcaption>Figure 2 of Caballero et al. [<a href="https://arxiv.org/abs/2210.14891">3</a>]. Two real-task examples: ImageNet error versus dataset size (top) and TriviaQA accuracy versus parameter count (bottom). The BNSL form tracks the data through visible breaks where a single power law would not.</figcaption>
 </figure>

 <p>Gwern's Scaling Hypotheses essay and the Revisited follow-up are the strongest non-specialist treatments of the underlying premise that capabilities come out of scale. Jacob Steinhardt's Bounded Regret is the blog I send people to when they want a careful read of what scaling laws actually let you predict. Beyond Chinchilla-Optimal is the most direct argument that the 20-tokens-per-parameter ratio is not a universal constant.</p>

 <p>Kaplan, Chinchilla, and Caballero all show that with a reasonably behaved architecture, a reasonable data mixture, and enough compute to get past the early warmup phase of pre-training, you can extrapolate loss from small runs to larger ones with usable accuracy. Error bars widen as the extrapolation gets more aggressive but not catastrophically. That predictive ability is what labs use to decide whether an expensive training run is worth doing.</p>

 <p>None of the papers here support any particular exponent or ratio as universal. Exponents depend on architecture, data mixture, and optimizer. The 20-tokens-per-parameter Chinchilla figure is one point estimate for one such combination, not a law of physics.</p>

 <p>Almost the entire scaling-law literature addresses test loss on the training distribution, and nothing else. It provides zero guidance on what data mixture to choose, what architecture will surpass dense transformers f<sub>θ</sub>, what capabilities will appear at what scale, whether the resulting model will be safe, or how optimization will interact with the learning-rate schedule η(t). All of those are properties of individual training runs and live outside the fitted relationship. Treating scaling laws as though they answered them is a common failure mode, and much of the broken-laws literature is about that failure. Scaling laws tell you how much loss you will incur. Almost everything interesting about a model is invariant to that single number.</p>

 <h2>Further reading</h2>
 <ul class="further">
 <li><a href="https://simons.berkeley.edu/talks/sasha-rush-cornell-university-hugging-face-2023-08-15">Scaling Data-Constrained Language Models</a></li>
 <li><a href="https://simons.berkeley.edu/talks/when-scale-enough">When is Scale Enough?</a></li>
 <li><a href="https://simons.berkeley.edu/talks/yasaman-bahri-google-deepmind-2023-08-15">Simons talk on the theoretical side of scaling</a></li>
 <li><a href="https://arxiv.org/abs/2010.14701">T. Henighan et al. Scaling laws for autoregressive generative modeling. arxiv 2010.14701, 2021</a></li>
 <li><a href="https://arxiv.org/abs/2406.12907">T. Pearce and J. Song. Reconciling Kaplan and Chinchilla scaling laws. arxiv 2406.12907, 2024</a></li>
 <li><a href="https://proceedings.mlr.press/v235/sardana24a.html">N. Sardana, J. Portes, S. Doubov, and J. Frankle. Beyond Chinchilla-Optimal: Accounting for inference in language model scaling laws. ICML, 2024</a></li>
 <li><a href="https://arxiv.org/abs/2602.07488">Deriving neural scaling laws from the statistics of natural language. arxiv 2602.07488, 2026</a></li>
 </ul>

<h2>References</h2>

 <ul class="refs">
 <li>[<a href="https://arxiv.org/abs/2001.08361">1</a>] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arxiv 2001.08361, 2020.</li>
 <li>[<a href="https://arxiv.org/abs/2203.15556">2</a>] J. Hoffmann et al. Training compute-optimal large language models. arxiv 2203.15556, 2022.</li>
 <li>[<a href="https://arxiv.org/abs/2210.14891">3</a>] E. Caballero, K. Gupta, I. Rish, and D. Krueger. Broken neural scaling laws. arxiv 2210.14891, 2022.</li>
 <li>[<a href="https://arxiv.org/abs/2206.07682">4</a>] J. Wei et al. Emergent abilities of large language models. arxiv 2206.07682, 2022.</li>
 <li>[<a href="https://arxiv.org/abs/2304.15004">5</a>] R. Schaeffer, B. Miranda, and S. Koyejo. Are emergent abilities of large language models a mirage? arxiv 2304.15004, 2023.</li>
 </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/kaplan-fig1.png"/>
    </item>
    <item>
      <title>Lottery ticket hypothesis</title>
      <link>https://rifaki.me/notes/lottery-tickets/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/lottery-tickets/</guid>
      <pubDate>Sat, 16 Nov 2024 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Frankle and Carbin&#x27;s original procedure, Liu&#x27;s rebuttal, the rewinding fix, and what holds up after the Frankle-to-Liu exchange.</p><p><img src="https://rifaki.me/notes/img/frankle-fig3.png" alt="Lottery ticket hypothesis"/></p>]]></description>
      <content:encoded><![CDATA[<p>The lottery-ticket hypothesis of Frankle and Carbin [<a href="https://arxiv.org/abs/1803.03635">1</a>] proposes that a randomly initialized dense network already contains a much sparser subnetwork (the "winning ticket") which, trained in isolation from its original initialization, matches the dense network's accuracy. If true, this is a strong claim about how deep networks represent functions: optimization would be selecting structure that was already present at initialization, not creating new structure. The pruning literature had circled this idea before; Frankle and Carbin's contribution was an actionable procedure for finding these subnetworks.</p>

 <h2>In practice, what Frankle and Carbin actually did</h2>

 <p>Frankle and Carbin defined a simple process to find these "winning" tickets. They called it iterative magnitude pruning. First train the full model until it converges. Remove the lowest-magnitude weights and freeze them. Take the remaining weights and reset them to their original initialization values. Train again. Continue this process several times, removing a portion of weights each time, until you cannot remove any more without losing performance relative to the full model. The resulting subnetwork is considered to be the "winning" ticket for that particular initial condition and dataset. The "rewinding" variant, resetting to weights from a few steps after initialization rather than to initialization itself, came later in Frankle's follow-up [<a href="https://arxiv.org/abs/1903.01611">8</a>].</p>

 <p>On MNIST and small CIFAR architectures, the results are clean. Sparse subnetworks at sparsities of a few percent match the dense network's accuracy. The winning tickets are also tied to a specific initialization: a ticket found from one random init does not transfer to a different one, which is why the procedure is read as discovering structure already present at initialization rather than creating it during training.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/frankle-fig3.png" alt="Figure 3 of Frankle and Carbin 2019. Test accuracy versus training iterations on Lenet-MNIST for lottery tickets at sparsity levels 100%, 51.3%, 21.1%, 7.0%, 3.6%, 1.9% remaining weights, plus reinitialized 51.3% and 21.1% baselines. Winning tickets reach the dense baseline; randomly-reinitialized counterparts plateau lower.">
 <figcaption>Figure 3 of Frankle and Carbin [<a href="https://arxiv.org/abs/1803.03635">1</a>]. Lottery-ticket subnetworks recover the dense baseline down to a few percent of the original weights; the same masks with random reinitialization do not.</figcaption>
 </figure>

 <h2>Liu [<a href="https://arxiv.org/abs/1810.05270">2</a>]'s rebuttal</h2>

 <p>Liu et al. ran the same procedure at larger scale and found that the gap between a Frankle winning ticket and a fresh random initialization of the same architecture closes once the network is big enough. At ImageNet scale, the winning-ticket effect basically disappears. Read narrowly this refutes Frankle and Carbin's strongest claim, but read alongside the original paper it is more usefully a scaling result: the winning-ticket structure is real on small networks and dissolves as scale grows.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/liu-fig2.png" alt="Figure 2 of Liu et al. 2019. Schematic distinguishing predefined pruning (uniform x% per layer) from automatic pruning (per-layer percentages a%, b%, c%, d% chosen by the algorithm) on a 4-layer model." style="max-width: 60%;">
 <figcaption>Figure 2 of Liu et al. [<a href="https://arxiv.org/abs/1810.05270">2</a>]. The two pruning regimes the paper distinguishes: predefined per-layer ratios versus automatically-discovered per-layer ratios. The "rethinking" results separate which regime the lottery-ticket conclusion survives in.</figcaption>
 </figure>

 <h2>Rewinding fixes</h2>

 <p>Frankle's response to Liu was a follow-up paper [<a href="https://arxiv.org/abs/1903.01611">8</a>] introducing a small change that brought the result back at scale: instead of resetting the surviving weights to their original initialization, reset them to the values they had a short distance into training (w<sub>t+1</sub>=w<sub>t</sub>-η∇ L, η: step-size, t: current iteration). The rewind distance is a few hundred iterations on small models and a few epochs on ImageNet (around epoch 4 of 90 for ResNet-50, about 20k iterations). With rewinding, IMP works at ImageNet scale; rewind past that point and the matching subnetwork stops appearing. Renda et al. [<a href="https://arxiv.org/abs/2003.02389">3</a>] tightened the recipe by comparing rewinding to plain fine-tuning across architectures and showing that rewinding only the learning-rate schedule matches or beats fine-tuning at fixed sparsity.</p>

 <h2>Softening the claims</h2>

 <p>The revised hypothesis is weaker than the original. The earlier claim was that the winning ticket exists at random initialization. The follow-up softens this: a short window of training (a few hundred iterations on small models, a few epochs on ImageNet) is enough to locate one. By the end of that window, something about the loss landscape L(θ) has been fixed that determines the rest of training, and from that point on a sparse subnetwork pulled out of the surviving weights matches the dense network.</p>

 <h2>Why rewinding works</h2>

 <p>Frankle's companion paper offers an explanation. Early in training, two runs forked from the same initial weights end up in different basins ℬ, so linear interpolation between two such checkpoints crosses a high-loss barrier. After a small fraction of training (about 1000-2000 iterations on CIFAR-scale networks, roughly the first 3% of the schedule), two runs forked from the same starting weights end up in the same basin instead, and linear interpolation between them stays at low loss throughout. The crossover point is where rewinding starts to work.</p>

 <p>The simplest framing of the lottery-ticket findings, given what is now understood about optimization and loss landscapes, is this: once a training run commits to a specific basin ℬ, there is a sparse sub-network within that basin that matches the dense network's performance. Frankle and Carbin's strongest claims fail at larger scales, but their weaker claims have so far held up under every replication that has tested them. This places lottery-ticket results in close alignment with the mode-connectivity literature [<a href="https://arxiv.org/abs/1802.10026">6</a>], particularly its linear-mode-connectivity refinement [<a href="https://arxiv.org/abs/1912.05671">4</a>] and the later permutation-based alignment work [<a href="https://arxiv.org/abs/2209.04836">7</a>].</p>

 <figure>
 <img src="https://rifaki.me/notes/img/frankle-lmc-fig3.png" alt="Figure 3 of Frankle, Dziugaite, Roy, Carbin 2020. Linear-interpolation instability versus fork step k across LeNet (MNIST), ResNet-20 (CIFAR-10), VGG-16 (CIFAR-10), ResNet-50 (ImageNet), Inception-v3 (ImageNet). Instability collapses once k passes a small threshold.">
 <figcaption>Figure 3 of Frankle et al. [<a href="https://arxiv.org/abs/1912.05671">4</a>]. Pairs of runs forked from a shared pre-rewinding checkpoint stay linearly connected; pairs forked from initialization do not. Linear mode connectivity is the operational test for "same effective basin".</figcaption>
 </figure>

 <p>Davis Blalock's 2020 MLSys retrospective on pruning, together with the ShrinkBench benchmark he built, is the survey I keep returning to on what holds up after the Frankle-to-Liu exchange. Blalock separates the stronger and weaker forms of the hypothesis and argues that "checkpoint pruning" is a better name than "lottery ticket pruning" for the late-rewinding procedures that actually work at scale. Google's <a href="https://research.google/pubs/the-state-of-sparsity-in-deep-neural-networks/">State of Sparsity</a> and <a href="https://research.google/pubs/rigging-the-lottery-making-all-tickets-winners/">Rigging the Lottery</a> are the other two retrospectives I keep going back to.</p>

 <p>The piece I keep coming back to is that mainstream theory has not picked up the rewinding point as an object in its own right. If someone could pin down precisely when basin-membership becomes determined, that would identify a structural feature of the loss landscape current frameworks do not explain. Lewkowycz's catapult-phase work [<a href="https://arxiv.org/abs/2003.02218">5</a>] and the edge-of-stability literature look like they are circling the same phenomenon from different directions, and the lottery-ticket case is the most direct entry point for tying those threads together.</p>

 <figure class="tweet-embed">
 <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">How do the lottery ticket hypothesis and the loss landscape relate? Winning lottery tickets always find the same, linearly-connected optimum. Check out our (@KDziugaite, <a href="https://twitter.com/roydanroy?ref_src=twsrc%5Etfw">@roydanroy</a>, <a href="https://twitter.com/mcarbin?ref_src=twsrc%5Etfw">@mcarbin</a>) poster at the SEDL workshop (West 121) and our new paper <a href="https://t.co/V9yKTSrNnh">https://t.co/V9yKTSrNnh</a> <a href="https://t.co/uPwQKifo1W">pic.twitter.com/uPwQKifo1W</a></p>&mdash; Jonathan Frankle (@jefrankle) <a href="https://twitter.com/jefrankle/status/1205902384112848899?ref_src=twsrc%5Etfw">December 14, 2019</a></blockquote>
 <figcaption>Jonathan Frankle in 2019 pointing at the bridge between the LTH and mode connectivity. The mode-connectivity reading is the softer form of the hypothesis that holds at scale.</figcaption>
 </figure>

 <h2>Further reading</h2>
        <ul class="further">
 <li><a href="https://research.google/pubs/the-state-of-sparsity-in-deep-neural-networks/">The State of Sparsity in DNNs</a></li>
 <li><a href="https://research.google/pubs/rigging-the-lottery-making-all-tickets-winners/">Rigging the Lottery</a></li>
          <li><a href="https://arxiv.org/abs/2009.08576">J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. Pruning neural networks at initialization: Why are we missing the mark? arxiv 2009.08576, 2020</a></li>
          <li><a href="https://openreview.net/forum?id=Uzb45nolTb">T. Kumar, K. Luo, and M. Sellke. No free prune: information-theoretic barriers to pruning at initialization. ICML, 2024</a></li>
        </ul>

<h2>References</h2>

 <ul class="refs">
 <li>[<a href="https://arxiv.org/abs/1803.03635">1</a>] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arxiv 1803.03635, 2018.</li>
 <li>[<a href="https://arxiv.org/abs/1810.05270">2</a>] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. arxiv 1810.05270, 2018.</li>
 <li>[<a href="https://arxiv.org/abs/2003.02389">3</a>] A. Renda, J. Frankle, and M. Carbin. Comparing rewinding and fine-tuning in neural network pruning. arxiv 2003.02389, 2020.</li>
 <li>[<a href="https://arxiv.org/abs/1912.05671">4</a>] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. Linear mode connectivity and the lottery ticket hypothesis. arxiv 1912.05671, 2019.</li>
 <li>[<a href="https://arxiv.org/abs/2003.02218">5</a>] A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: The catapult mechanism. arxiv 2003.02218, 2020.</li>
 <li>[<a href="https://arxiv.org/abs/1802.10026">6</a>] T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, and A. G. Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. arxiv 1802.10026, 2018.</li>
 <li>[<a href="https://arxiv.org/abs/2209.04836">7</a>] S. K. Ainsworth, J. Hayase, and S. Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arxiv 2209.04836, 2022.</li>
 <li>[<a href="https://arxiv.org/abs/1903.01611">8</a>] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. Stabilizing the lottery ticket hypothesis. arxiv 1903.01611, 2019.</li>
 </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/frankle-fig3.png"/>
    </item>
    <item>
      <title>Double descent</title>
      <link>https://rifaki.me/notes/double-descent/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/double-descent/</guid>
      <pubDate>Sun, 25 Aug 2024 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Belkin&#x27;s bias-variance picture, Nakkiran&#x27;s three axes, the label-noise caveat, and what the phenomenon does and does not say at scale.</p><p><img src="https://rifaki.me/notes/img/belkin-fig1.png" alt="Double descent"/></p>]]></description>
      <content:encoded><![CDATA[<p>The classical U becomes a W with a second descent in the overparameterized (p > n) regime and that second descent often goes below the first minimum.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/belkin-fig1.png" alt="Figure 1 of Belkin et al. 2018. Schematic of test risk as a function of model capacity: the classical U-shape to the left of the interpolation threshold and a second descending branch to its right.">
          <figcaption>Figure 1 of Belkin et al. [<a href="https://arxiv.org/abs/1812.11118">1</a>]. Classical bias-variance to the left of the interpolation threshold; a second descent in the overparameterized regime on the right.</figcaption>
        </figure>

        <h2>Nakkiran et al.</h2>

        <p>Nakkiran et al. [<a href="https://arxiv.org/abs/1912.02292">2</a>] made the picture concrete by showing that the W-shape appears in three different axes: model size p, training time, and dataset size n. Model-wise double descent varies the width k of a ResNet f<sub>θ</sub>; epoch-wise double descent varies the number of training steps; sample-wise double descent varies dataset size with everything else held fixed. The shape is the same each time: a test-error peak near the interpolation threshold, then a descent once you push past it.</p>

        <p>Epoch-wise is the most surprising of the three. Within one run, test error gets worse before it gets better. The worst test error sits roughly at the iteration where training loss first hits zero; train past it and the test error drops again.</p>

        <h2>Label noise</h2>

        <p>The sharpest versions of the double descent peak in these papers come with label noise. Nakkiran's headline plots use ten to twenty percent corrupted labels. Without label noise, the peak is much weaker and sometimes absent. Label noise inflates the variance contribution of the model at the interpolation threshold because the model is being asked to memorize random labels at exactly the capacity where memorization is possible but not easy.</p>

        <p>Past the threshold, extra capacity absorbs the noise into higher-frequency components without disturbing the underlying signal. This is the hinge that connects the toy phenomenon to actual deep learning. Belkin's linear-regression result holds at all noise levels but the gap is small without noise; Nakkiran's dramatic curves require label noise to be visible. At modern language-model scale, with clean labels and large models, test loss is close to monotone in parameter count and scaling-law papers fit clean L ∝ C<sup>-α</sup> decay with no visible second peak. The effect is real and the classical bias-variance picture is wrong in the overparameterized regime, but the large peak that gives double descent its name is specific to the label-noise case.</p>

        <p>Boaz Barak's Windows on Theory post makes a version of this argument: the interesting part of double descent is to the right of the peak, not the peak itself. OpenAI's Deep Double Descent post took Nakkiran to a much wider audience and posed the sharper question: given this effect, what kind of complexity control (if any) actually predicts generalization? Google Research's "A new lens on understanding generalization in deep learning" recast double descent in terms of an effective-capacity measure that tracks the empirical curves better than parameter count does. Misha Belkin's Simons Institute talks are the video account I keep sending people who want to see how the picture has changed since 2018.</p>

        <figure class="tweet-embed">
          <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">So, &quot;double descent&quot; is happening b/c DF isn&#39;t really the right quantity for the the x-axis: like, the fact that we are choosing the minimum norm least squares fit actually means that the spline with 36 DF is **less** flexible than the spline with 20 DF. <br><br>Crazy, huh?<br><br>19/</p>&mdash; Daniela Witten (@daniela_witten) <a href="https://twitter.com/daniela_witten/status/1292293122752262145?ref_src=twsrc%5Etfw">August 9, 2020</a></blockquote>
        </figure>

        <figure>
          <img src="https://rifaki.me/notes/img/nakkiran-fig1.png" alt="Figure 1 of Nakkiran et al. 2019. Test error and train error versus ResNet18 width parameter under varying label-noise levels (0%, 5%, 10%, 15%, 20%). Test error peaks near the interpolation threshold and decreases again as width grows.">
          <figcaption>Figure 1 of Nakkiran et al. [<a href="https://arxiv.org/abs/1912.02292">2</a>]. Model-wise double descent grows visibly with label-noise level: a peak at the interpolation threshold, then a second descent in the overparameterized regime.</figcaption>
        </figure>

        <p>I'm closer to Barak's reading than to what filtered down to practitioner intros.</p>

        <p>After the label-noise caveat, two results from this line of work hold up. Classical capacity measures like VC dimension do not extend cleanly to the overparameterized regime and cannot be expected to predict generalization there. And overparameterized networks with astronomically large VC dimensions can sit well below smaller networks in test error on the same task — which implies the loss landscape is doing the selection: from a vast pool of interpolating solutions, the optimizer is picking a small subset that generalizes.</p>

        <figure>
          <img src="https://rifaki.me/notes/img/nakkiran-fig4.png" alt="Figure 4 of Nakkiran et al. 2019. Left panel labels the classical (under-parameterized) and modern (over-parameterized) regimes around the interpolation threshold; right panel overlays test error across many epochs (color = epochs 1 to 1000) versus ResNet18 width, with an optimal early-stopping envelope.">
          <figcaption>Figure 4 of Nakkiran et al. [<a href="https://arxiv.org/abs/1912.02292">2</a>]. Pulling apart the canonical double-descent shape: the peak sits at the interpolation threshold, and the epoch-coloured family on the right makes epoch-wise double descent visible alongside model-wise.</figcaption>
        </figure>

        <p>What practitioners did with this was simpler than what the theory suggested: once you are in the overparameterized regime and you have compute to spend, bigger is usually better. The second descent has no obvious endpoint, which is why Kaplan-style scaling laws can fit clean power-law decay in compute — they are sitting entirely on the right-hand, log-log-linear side of the W-curve. The dramatic 2019 reading of double descent was that the bias-variance tradeoff is fiction and overfitting no longer exists. The second half of that is trivially untrue (overfitting is easy to produce in any small-data regime). The first half is more delicate: above the interpolation threshold, with implicit min-norm regularization (θ<sup>⋆</sup> = argmin<sub>f<sub>θ</sub>(X)=y</sub> ‖θ‖), larger models tend to generalize better rather than worse. That is a statement about a regime, not a law.</p>

        <figure>
          <div style="padding: 20px 16px; background: #E5DFCB; border: 1px solid #eee; border-radius: 4px; text-align:center;">
            <img src="https://rifaki.me/notes/img/math/0b374448521c2c5d.svg" alt="$$R(p) = \sigma^2 \cdot \frac{n}{|p-n-1|} + \|\beta\|^2 \cdot \max\!\left(0,\, 1 - \frac{n}{p}\right)$$" class="math-display" width="434" height="49"/>
          </div>
          <figcaption>
            Expected test risk of the min-norm ridgeless interpolant with n samples and p features under isotropic covariates, from Hastie et al. [<a href="https://arxiv.org/abs/1903.08560">3</a>]. The first term diverges at p=n and is the interpolation peak. The second shrinks as p → ∞ and is the second descent. Double descent is not a deep-learning phenomenon in any strict sense since it falls out of the min-norm solution to an overparameterized least squares problem and holds only because the optimizer is selecting a specific well-behaved interpolant out of the many that fit the data.
          </figcaption>
        </figure>

        <p>Another claim that does not survive the empirical record is that the peak is always exactly at the interpolation threshold. In practice, the exact location of the peak in Nakkiran's ResNet experiments depends on the effective number of parameters under whatever implicit regularization is in use, not on the total parameter count. The peak does not occur precisely at the width at which training error (L<sub>train</sub>) first reaches zero; it sits slightly past that point, where the network can memorize noisy labels without disrupting the underlying signal.</p>
        <h2>Further reading</h2>
        <ul class="further">
          <li><a href="https://arxiv.org/abs/2001.08361">J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arxiv 2001.08361, 2020</a></li>
          <li><a href="https://arxiv.org/abs/2003.01054">S. d'Ascoli, M. Refinetti, G. Biroli, and F. Krzakala. Double trouble in double descent: Bias and variance(s) in the lazy regime. arxiv 2003.01054, 2020</a></li>
          <li><a href="https://arxiv.org/abs/2003.01897">P. Nakkiran, P. Venkat, S. Kakade, and T. Ma. Optimal regularization can mitigate double descent. arxiv 2003.01897, 2020</a></li>
          <li><a href="https://arxiv.org/abs/2203.03466">G. Yang, E. J. Hu, I. Babuschkin, S. Sidor, X. Liu, D. Farhi, N. Ryder, J. Pachocki, W. Chen, and J. Gao. Tensor Programs V: tuning large neural networks via zero-shot hyperparameter transfer. arxiv 2203.03466, 2022</a></li>
          <li><a href="https://research.google/pubs/understanding-double-descent-requires-a-fine-grained-bias-variance-decomposition/">B. Adlam and J. Pennington. Understanding double descent requires a fine-grained bias-variance decomposition. NeurIPS, 2020</a></li>
          <li><a href="https://iclr-blogposts.github.io/2024/blog/double-descent-demystified/">R. Schaeffer et al. Double descent demystified. ICLR Blogposts, 2024</a></li>
        </ul>

        

<h2>References</h2>

        <ul class="refs">
          <li>[<a href="https://arxiv.org/abs/1812.11118">1</a>] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. arxiv 1812.11118, 2018.</li>
          <li>[<a href="https://arxiv.org/abs/1912.02292">2</a>] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever. Deep double descent: Where bigger models and more data hurt. arxiv 1912.02292, 2019.</li>
          <li>[<a href="https://arxiv.org/abs/1903.08560">3</a>] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arxiv 1903.08560, 2019.</li>
        </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/belkin-fig1.png"/>
    </item>
    <item>
      <title>Reading Tishby&#x27;s information bottleneck</title>
      <link>https://rifaki.me/notes/information-bottleneck/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/information-bottleneck/</guid>
      <pubDate>Tue, 09 Jul 2024 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>The original Tishby claim, Saxe&#x27;s reply on activation choice, Goldfeld&#x27;s estimator critique, and what survives.</p><p><img src="https://rifaki.me/notes/img/tishby-fig2.png" alt="Reading Tishby&#x27;s information bottleneck"/></p>]]></description>
      <content:encoded><![CDATA[<p>Tishby and Zaslavsky's 2015 paper was, until fairly recently, one of the most-cited papers in deep-learning theory. They described training as two distinct phases. In the first, the "fitting" phase, the mutual information between a hidden representation T = f<sub>θ</sub>(X) and the input X rises. In the second, the "compression" phase, the network discards parts of that input information that are not useful for the prediction.</p>

 <p>The two strongest objections to that picture come from Saxe et al. [<a href="https://openreview.net/forum?id=ry_WPG-A-">2</a>], who argue the compression phase is an artifact of the activation function and the mutual-information estimator, and from Goldfeld et al. [<a href="https://arxiv.org/abs/1810.05728">3</a>], who formalize the estimator critique and re-run the analysis with noisy networks where mutual information is well-defined. The point of this post is to reread Tishby with both of those objections in hand and ask what remains.</p>

 <p>Both objections are worth reading in full; the summaries below are my best attempt to render them faithfully.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/tishby-fig2.png" alt="Figure 2 of Tishby and Zaslavsky 2015. Qualitative information plane: optimal IB limit (black), suboptimal bifurcations (blue), finite-sample distortion bound (red), and a possible path of the layers in a typical DNN (green), with shaded regions marking the compression gap and generalization gap." style="max-width: 75%;">
 <figcaption>Figure 2 of Tishby and Zaslavsky [<a href="https://arxiv.org/abs/1503.02406">1</a>]. The IB rate-distortion bound with deep-network layers placed on it: each layer trades compression of X against retention of information about Y.</figcaption>
 </figure>

 <h2>The actual claim of the original paper</h2>

 <p>The claim, in plain language: early in training, a deep network builds up features that are useful for predicting the target Y, which shows up as I(T; Y) increasing. During the same period I(T; X) also increases, because the hidden layer is just retaining more about the input. Then a second phase kicks in: the network drops the parts of the input information that do not help with the prediction. That second phase is the "compression phase."</p>

 <p>The empirical support was a small tanh network whose information-plane plot showed a clean fitting-then-compression trajectory.</p>

 <h2>Saxe et al. [<a href="https://openreview.net/forum?id=ry_WPG-A-">2</a>] examine the type of activation function used</h2>

 <p>Saxe and coauthors re-ran the experiments with ReLU instead of tanh. The compression phase did not appear: Î(X;T) stayed roughly flat across training. Their explanation: tanh saturation pushes each unit's activations into a small number of values, the binning estimator is sensitive to that quantization, and the combination produces a curve that looks like compression but is really an estimator artifact. In a network without saturating units (or with a less binning-sensitive estimator), the curve is gone.</p>

 <p>That is a serious problem for the original story. The theoretical pull of Tishby's framing was that compression looked universal, a property of deep learning itself. If it only shows up for one activation function with one estimator, the universality claim is much weaker.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/saxe-fig1.png" alt="Figure 1 of Saxe et al. ICLR 2018. Four information-plane panels (A, B, C, D). Top row uses a small toy network with the binning estimator: (A) tanh nonlinearity reproduces the Shwartz-Ziv & Tishby fitting-then-compression trajectory; (B) ReLU nonlinearity shows no compression phase. Bottom row uses a 784-1024-20-20-20-10 MNIST network with the Kolchinsky-Tracey KDE estimator: (C) tanh, no compression observed except in the final sigmoidal classification layer; (D) extension under the same KDE setup.">
 <figcaption>Figure 1 of Saxe et al., <a href="https://openreview.net/forum?id=ry_WPG-A-">ICLR 2018</a>. Swapping tanh for ReLU removes the compression phase (A vs B), and re-running with a KDE estimator at MNIST scale (C, D) also fails to reproduce it. The two-phase information-plane story is contingent on the nonlinearity and the estimator, not a property of training.</figcaption>
 </figure>

 <h2>Goldfeld et al. formally quantify issues with estimator selection</h2>

 <p>Goldfeld and coauthors formalized what Saxe had observed. For continuous random variables and deterministic maps T = f(X) from inputs to representations, mutual information I(T;X) = H(T) - H(T|X) is not even well-defined: H(T|X) collapses, I(T;X) blows up, and the finite numbers showing up in published plots are entirely coming from the noise injected by the estimator (binning, added Gaussian noise of variance σ<sup>2</sup>, or KDE). Different choices give different numbers, so the published information-plane trajectories were tracking properties of the estimator at least as much as properties of the network.</p>

 <p>One reason Tishby's paper still has value despite the empirical claims being discredited is that it offered a third lens on generalization at a time when the dominant lenses were capacity-based (how restrictive or broad the hypothesis class is) and geometry-based (how smooth or rough the loss landscape is around a minimum). Tishby's lens was sufficient-statistics: representations should retain only the information that matters for the prediction task. The terminology stuck even though the original empirical observation did not, and modern self-supervised methods like infoNCE are essentially information-bottleneck objectives in everything but name.</p>

 <p>The paper got most of its public reach through Natalie Wolchover's 2017 Quanta piece, "New Theory Cracks Open the Black Box of Deep Learning," which presented Tishby's claims in their strongest form. Reading that piece today is mostly useful as a reminder of how far ahead of the evidence the rhetoric got.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/goldfeld-fig1.png" alt="Figure 1 of Goldfeld et al. 2019. Estimated $I(X; \mathrm{Bin}(T_\ell))$ over training epochs for layers 1-5 at four binning resolutions (bin size 0.0001, 0.001, 0.01, 0.1). The apparent compression phase appears or disappears depending on the bin size.">
 <figcaption>Figure 1 of Goldfeld et al. [<a href="https://arxiv.org/abs/1810.05728">3</a>]. The same training run produces qualitatively different "information-plane trajectories" depending on the bin size used to estimate mutual information - the compression phase is partly an estimator artefact.</figcaption>
 </figure>

 <p>For readers who want non-paper summaries of this debate, Adrian Colyer's three-part Morning Paper series on Tishby's IB theory and Saxe's reply is the most accessible walkthrough.</p>

 <p>A weaker version of the original claim does survive: networks trained with SGD often end up with representations that are sufficient for the labels and roughly insensitive to label-irrelevant input variation. "Information bottleneck" is a fine descriptive label for that. The strong version, in which training proceeds through two cleanly separated phases divided by a phase transition in I(T;X), has no empirical support, and the further claim that SGD is implicitly minimizing an information-bottleneck objective remains unproven.</p>

 <p>Some incorrect papers end up more useful to a field than correct ones, because they hand it vocabulary it did not have.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/goldfeld-fig2.png" alt="Figure 2 of Goldfeld et al. 2019. Architectural diagram of the noisy DNN: $T_{\ell-1}$ feeds through $\sigma(W_\ell^{(k)} T_{\ell-1} + b_\ell^{(k)})$ to produce a pre-noise hidden $S_\ell(k)$, to which Gaussian noise $Z_\ell(k) \sim \mathcal{N}(0,\beta^2)$ is added to yield the next-layer hidden $T_\ell(k)$." style="max-width: 55%;">
 <figcaption>Figure 2 of Goldfeld et al. [<a href="https://arxiv.org/abs/1810.05728">3</a>]. The noisy-network construction: adding Gaussian noise after each layer makes mutual information well-defined and lets the analysis distinguish genuine compression dynamics from estimator artefacts.</figcaption>
 </figure>

 <p>Saxe's argument alone is not fatal: a noisy version of the network has well-defined mutual information and can be analyzed directly, which gets you out of the estimator trap. Goldfeld et al. did exactly that and found the two-phase trajectory does not hold up across estimator choices once the quantities being plotted are well-defined. After that, the empirical case for Tishby's strong claims is essentially gone.</p>

 <h2>Further reading</h2>
 <ul class="further">
 <li><a href="https://arxiv.org/abs/1703.00810">R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arxiv 1703.00810, 2017</a></li>
 <li><a href="https://arxiv.org/abs/1612.00410">A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. arxiv 1612.00410, 2017</a></li>
 <li><a href="https://arxiv.org/abs/1807.03748">A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arxiv 1807.03748, 2018</a></li>
 <li><a href="https://arxiv.org/abs/2305.18887">K. Kawaguchi, Z. Deng, X. Ji, and J. Huang. How does information bottleneck help deep learning? arxiv 2305.18887, 2023</a></li>
 </ul>

<h2>References</h2>
 
 <ul class="refs">
 <li>[<a href="https://arxiv.org/abs/1503.02406">1</a>] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. arxiv 1503.02406, 2015.</li>
 <li>[<a href="https://openreview.net/forum?id=ry_WPG-A-">2</a>] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox. On the information bottleneck theory of deep learning. ICLR, 2018.</li>
 <li>[<a href="https://arxiv.org/abs/1810.05728">3</a>] Z. Goldfeld, E. van den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy. Estimating information flow in deep neural networks. arxiv 1810.05728, 2019.</li>
 </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/tishby-fig2.png"/>
    </item>
    <item>
      <title>On flat minima</title>
      <link>https://rifaki.me/notes/flat-minima/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/flat-minima/</guid>
      <pubDate>Sun, 07 Apr 2024 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Hochreiter and Schmidhuber, Keskar&#x27;s small-batch result, Dinh&#x27;s reparameterization objection, SAM, and where the flatness-generalization debate currently sits.</p><p><img src="https://rifaki.me/notes/img/keskar-fig1.png" alt="On flat minima"/></p>]]></description>
      <content:encoded><![CDATA[<p>Whether flat minima generalize better than sharp ones has been an open question for about seven years. The debate seems to close every year and reopen a year later. Most readers entering the field encounter it as a settled topic in some textbook chapter, in one direction or the other, when in fact it isn't. This is where I think it actually stands.</p>

 <h2>Hochreiter, Schmidhuber, and the original intuition</h2>

 <p>Hochreiter and Schmidhuber introduced the idea in 1997: a minimum that sits in a broad, low-curvature valley should generalize better than one in a sharp valley, on roughly an MDL ground that the broad solution requires fewer bits to specify and is correspondingly less tied to the noise in any one training set. The intuition lay mostly dormant for the next two decades.</p>

 <p>Keskar et al. [<a href="https://arxiv.org/abs/1609.04836">1</a>] reignited the topic by reporting that large-batch SGD with θ<sub>t+1</sub> = θ<sub>t</sub> - η g<sub>t</sub> converges to sharper minima than small-batch SGD on a range of standard benchmarks, with a corresponding gap in test accuracy. The 1D schematic that came with that paper has done a lot of work since: it is the picture more or less every later flat-minima discussion is implicitly arguing about.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/keskar-fig1.png" alt="Figure 1 of Keskar et al. [1]. A 1D schematic of a wide basin around one minimum and a narrow basin around another." width="1600" height="643">
 <figcaption>Figure 1 of Keskar et al. [<a href="https://arxiv.org/abs/1609.04836">1</a>]. The picture that started the modern argument.</figcaption>
 </figure>

 <h2>Dinh et al.'s objection, which should have ended the debate</h2>

 <p>Dinh et al. argued that this should have ended the debate. Sharpness, measured as λ<sub>max</sub>(H) with H = ∇<sup>2</sup> L, is a property of the parameterization, not of the function the network represents. They show explicitly that for any minimum one can find a reparameterization θ → ψ(θ) that scales the Hessian eigenvalues λ<sub>i</sub>(H) to arbitrary values without changing the input-output map. The clean response to this would have been to drop the flat-minima paradigm. The community instead salvaged it by looking for sharpness measures that are invariant under reparameterization. The simplest of these comes from Dziugaite and Roy, who use a PAC-Bayes lens: define sharpness as the largest weight perturbation ξ ∼ 𝒩(0, I) a minimum can absorb while keeping training loss L(θ) small. That measure is reparameterization-invariant by construction and correlates with generalization in their experiments.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/dinh-fig1.png" alt="Figure 1 of Dinh et al. 2017. Schematic of an $\epsilon$-flat minimum: a parabolic loss curve in $(\theta, L)$ with a horizontal cutoff at level $\epsilon$ above the minimum, shading the connected set of parameters whose loss stays within $\epsilon$ of the optimum.">
 <figcaption>Figure 1 of Dinh et al. [<a href="https://arxiv.org/abs/1703.04933">2</a>]. The width of the shaded ε-flat region is the geometric quantity flatness intuitions are pointing at - but its size depends on the parameterization, which is the heart of the reparameterization argument.</figcaption>
 </figure>

 <h2>SAM and where flatness wins</h2>

 <p>Building on Dziugaite and Roy's framing, Foret et al. [<a href="https://arxiv.org/abs/2010.01412">3</a>] turned a reparameterization-aware sharpness measure into a training objective: minimize the worst-case loss in an ℓ<sub>2</sub> ball around the current weights. They called the procedure SAM, and it does deliver consistent test-accuracy gains, particularly on architectures without strong built-in inductive biases (vanilla MLPs, plain ViTs without strong augmentation). Behnam Neyshabur, a co-author, has remained one of the more consistent public advocates for SAM as a generalization tool.</p>

 <p>On the side that flatness improves generalization at scale, the main public voices are Behnam Neyshabur and collaborators across several papers and talks, and Boaz Barak's posts at Windows on Theory. Ferenc Huszár's inFERENCe is the other blog I keep coming back to on this; he writes carefully about flatness, generalization, and the Bayesian readings sitting under both. The Off-convex blog has good coverage of mode connectivity that puts flatness inside a larger geometric story, which I find more useful than treating it as an independent explanation. From the continuous-time view, the stochastic-diffusion picture of SGD is still the most direct way to see why noisy iterates concentrate near flatter minima. Huszár's adjacent essay "Everything that Works Works Because It Is Bayesian" is the prior-based reading I find most useful.</p>

 <p>Stopping the timeline here, the flat-minima view would look basically validated: the naive Hessian definition was broken, but a reparameterization-invariant version of the phenomenon is real and SAM is a way to act on it. Kaddour et al. [<a href="https://arxiv.org/abs/2202.00661">4</a>] complicate that picture. They sweep SAM and SWA against vanilla Adam across architectures and dataset scales, and report that the SAM gain shrinks as either model size or dataset size grows. In the regimes where generalization is most useful to improve, the advantage over a tuned Adam baseline narrows substantially.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/foret-fig1.png" alt="Figure 1 of Foret et al. 2020 (SAM). Left: percent error reduction from SAM across CIFAR10, CIFAR100, ImageNet, finetuning, SVHN, F-MNIST, and noisy CIFAR. Right: 3D loss landscape of a SAM-trained network (smooth blue valley) compared with a sharp, jagged surface from standard training.">
 <figcaption>Figure 1 of Foret et al. [<a href="https://arxiv.org/abs/2010.01412">3</a>]. The empirical headline (error reduction across tasks) plus the loss-landscape contrast that motivates the worst-case-in-a-ball SAM objective.</figcaption>
 </figure>

 <p>I am left with an uneven picture. The naive Hessian definition of sharpness has no causal link to generalization (by Dinh). A reparameterization-invariant version does correlate with generalization at small-to-medium scale, and SAM produces real test-accuracy gains on architectures with weak inductive biases. At very large scale the correlation weakens and the SAM gain fades. I do not have a clean account of why. My guess is that with rich enough data and architectures, the optimizer's trajectory and the data distribution dominate whatever local geometry the final minimum has, and the landscape framing stops being the right description. That is speculation, and I would not put weight on it beyond that.</p>
 <h2>Further reading</h2>
 <ul class="further">
 <li><a href="https://arxiv.org/abs/1703.11008">G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arxiv 1703.11008, 2017</a></li>
 <li><a href="https://arxiv.org/abs/2106.01548">X. Chen, C.-J. Hsieh, and B. Gong. When vision transformers outperform ResNets without pre-training or strong data augmentations. arxiv 2106.01548, 2021</a></li>
 <li><a href="https://arxiv.org/abs/1803.05407">P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson. Averaging weights leads to wider optima and better generalization. arxiv 1803.05407, 2018</a></li>
 <li><a href="https://arxiv.org/abs/2203.05482">M. Wortsman et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. arxiv 2203.05482, 2022</a></li>
 <li><a href="https://direct.mit.edu/neco/article/9/1/1/6027/Flat-Minima">S. Hochreiter and J. Schmidhuber. Flat minima</a></li>
 </ul>

 

<h2>References</h2>
 
 <ul class="refs">
 <li>[<a href="https://arxiv.org/abs/1609.04836">1</a>] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arxiv 1609.04836, 2016.</li>
 <li>[<a href="https://arxiv.org/abs/1703.04933">2</a>] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. arxiv 1703.04933, 2017.</li>
 <li>[<a href="https://arxiv.org/abs/2010.01412">3</a>] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arxiv 2010.01412, 2020.</li>
 <li>[<a href="https://arxiv.org/abs/2202.00661">4</a>] J. Kaddour, L. Liu, R. Silva, and M. J. Kusner. When do flat minima optimizers work? arxiv 2202.00661, 2022.</li>
 </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/keskar-fig1.png"/>
    </item>
    <item>
      <title>Four explanations for Grokking</title>
      <link>https://rifaki.me/notes/grokking/</link>
      <guid isPermaLink="true">https://rifaki.me/notes/grokking/</guid>
      <pubDate>Sat, 24 Feb 2024 12:00:00 +0000</pubDate>
      <dc:creator>Mouhssine Rifaki</dc:creator>
      <description><![CDATA[<p>Power&#x27;s modular-arithmetic finding, Nanda&#x27;s circuit-level analysis, and the four candidate explanations that may all be the same mechanism.</p><p><img src="https://rifaki.me/notes/img/grokking-fig1.png" alt="Four explanations for Grokking"/></p>]]></description>
      <content:encoded><![CDATA[<p>The network has generalized but long after it has already fit the data. The paper is Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets by Power et al. [<a href="https://arxiv.org/abs/2201.02177">1</a>]. I came across it maybe a week after it was posted and didn't really know what to do with it for a while.</p>

 <h2>Why this is a puzzle</h2>

 <p>Standard stories about generalization do not predict a long delay. The VC and Rademacher story says generalization either happens or doesn't as a function of how well the hypothesis class matches the data distribution. The implicit-bias stories say SGD hits a minimum that generalizes but that minimum should be found at roughly the same time as convergence of training loss, not thousands of steps later.</p>

 <p>Grokking says the loss landscape has a second stage not driven by training loss. Some other variable is moving the weights (θ) during the stretch where the training loss is already near zero.</p>

 <h2>Four explanations</h2>

 <p>The first explanation - and the one that the original paper kind of points toward - is that weight decay (λ ‖θ‖<sub>2</sub><sup>2</sup>) is a slow regularizer. The paper notes that grokking only happens with weight decay turned on. Once training loss is zero, the weight-decay term keeps pulling the norm of the weights down even if that loss gradient is tiny. This is slow drift toward a lower-norm solution which may be the one that generalizes. The second explanation comes via mechanistic interpretability.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/grokking-fig1.png" alt="Figure 1 of arxiv 2201.02177. Training and validation accuracy on modular arithmetic as a function of optimization step on a log scale. The validation curve stays at chance while the training curve saturates, then jumps to one hundred percent much later." width="1600" height="432">
 <figcaption>Figure 1 of Power et al. [<a href="https://arxiv.org/abs/2201.02177">1</a>]. Training accuracy saturates early; validation accuracy stays flat for orders of magnitude of additional steps and then jumps.</figcaption>
 </figure>

 <p>Neel Nanda and collaborators identified specific circuits inside the small grokking networks that implement modular arithmetic via a Fourier (f̂(ω)) decomposition. The grokking transition is when those circuits finish being assembled. Before the transition the network has to memorize by brute force. After the transition it can actually compute. The circuits-level framing this work builds on is laid out in Elhage et al., which is where I'd send anyone to understand what it could mean to talk about a 'circuit' inside a transformer.</p>

 <p>The third explanation, due to Liu, Michaud, and Tegmark in <a href="https://arxiv.org/abs/2210.01117">Omnigrok</a>, is geometric: the generalizing solution lies in a narrow "Goldilocks zone" of weight norms, and grokking is what you see when the optimizer has been started outside that zone and is slowly being walked into it. Weight decay is what does the walking, which is consistent with explanation one. The fourth angle is to step back and read all of this as a single process viewed at different mesh scales: grokking = double descent but resolved over training time rather than over model or dataset size.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/nanda-fig2.png" alt="Figure 2 of Nanda et al. 2023. Left: histogram of fraction-of-variance-explained by degree-2 polynomials over neurons. Right: heatmap of components of $W_L$ corresponding to frequency-14 neurons, showing weight concentrated at the sin/cos basis pair for that frequency.">
 <figcaption>Figure 2 of Nanda et al. [<a href="https://arxiv.org/abs/2301.05217">2</a>]. The grokked network's neurons are well-explained by degree-2 polynomials (left), and individual neurons read off specific Fourier-basis pairs from the embedding (right) - the Fourier circuit is mechanistically visible.</figcaption>
 </figure>

 <p>In this reading, grokking is double descent unfolding in time. The most lucid public articulations of this "double descent over time" reading come from Preetum Nakkiran's writing, and OpenAI's Deep Double Descent writeup paints the picture visually. On the mechanistic side, Neel Nanda wrote an intuitive walkthrough of the Fourier circuit story and maintains a corresponding paper page. Google PAIR's <a href="https://pair.withgoogle.com/explorables/grokking/">Do Machine Learning Models Memorize or Generalize?</a> poses the same question in nearby visual vocabulary.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/nanda-fig5.png" alt="Figure 5 of arxiv 2301.05217. The Discrete Fourier Transform of the grokked network's input embeddings, showing concentration on a small set of frequencies." width="1600" height="435">
 <figcaption>Figure 5 of Nanda et al. [<a href="https://arxiv.org/abs/2301.05217">2</a>]. The DFT of the network's learned embeddings concentrates in a small number of frequencies after the transition which is the Fourier-based modular arithmetic circuit made visible.</figcaption>
 </figure>

 <p>These four views are looking at the same puzzle from different angles, and together they read as one story at different levels of abstraction. Weight decay is the optimization pressure: it selects a minimum-norm interpolant, and in modular arithmetic that interpolant admits the Fourier circuit because Fourier captures the low-rank (rank(W) ≪ d) structure of the task. The broader pattern of fast memorization followed by slow compression is the shape double descent takes when you resolve it over time instead of over model size.</p>

 <figure>
 <img src="https://rifaki.me/notes/img/nanda-fig3.png" alt="Figure 3 of Nanda et al. 2023. Average train accuracy (saturates near 1.0 within ~1k epochs), average test accuracy (stays at chance for ~5k epochs then jumps), and corresponding average train/test log-loss curves over epochs. Faded background lines show individual seeds.">
 <figcaption>Figure 3 of Nanda et al. [<a href="https://arxiv.org/abs/2301.05217">2</a>]. The grokking pattern made averaged: training accuracy saturates fast, test accuracy lags by orders of magnitude before its sudden rise.</figcaption>
 </figure>

 <p>The one thing none of these explanations cleanly accounts for is the abruptness of the transition. Smoothly shrinking the norm and smoothly assembling circuits should give smoothly rising validation accuracy, not a near-vertical jump. My read is that the sharpness is largely a measurement artifact: softmax classifiers, σ(z)<sub>j</sub> = e<sup>z<sub>j</sub></sup>/∑<sub>k</sub> e<sup>z<sub>k</sub></sup>, route through a top-1 argmax, so logits that are evolving continuously map onto a piecewise-constant accuracy curve that flips once the right logit crosses its competitor.</p>

 <figure class="tweet-embed">
 <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">So what&#39;s behind grokking?<br>Three phases of training:<br>1 Memorization<br>2 Circuit formation: It smoothly TRANSITIONS from memorising to generalising<br>3 Cleanup: Removing the memorised solution<br><br>Test performance needs a general circuit AND no memorisation so Grokking occurs at cleanup! <a href="https://t.co/zLnP92RXKV">pic.twitter.com/zLnP92RXKV</a></p>&mdash; Neel Nanda (@NeelNanda5) <a href="https://twitter.com/NeelNanda5/status/1616590960066203648?ref_src=twsrc%5Etfw">January 21, 2023</a></blockquote>
 <figcaption>Neel Nanda summarizing the three-phase mechanistic account of grokking from [<a href="https://arxiv.org/abs/2301.05217">2</a>]. The transition to generalization happens during cleanup, not during circuit formation which is why it looks sudden.</figcaption>
 </figure>

 <h2>Further reading</h2>
 <ul class="further">
 <li><a href="https://www.neelnanda.io/mechanistic-interpretability/modular-addition-walkthrough">accessible walkthrough</a></li>
 <li><a href="https://www.neelnanda.io/grokking-paper">dedicated paper page</a></li>
 <li><a href="https://transformer-circuits.pub/2021/framework/index.html">N. Elhage et al. A mathematical framework for transformer circuits. transformer-circuits.pub/2021/framework, 2021</a></li>
 <li><a href="https://arxiv.org/abs/2210.01117">Z. Liu, E. Michaud, and M. Tegmark. Omnigrok: Grokking beyond algorithmic data. arxiv 2210.01117, 2022</a></li>
 <li><a href="https://arxiv.org/abs/2206.04817">V. Thilak, E. Littwin, S. Zhai, O. Saremi, R. Paiss, and J. Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arxiv 2206.04817, 2022</a></li>
 <li><a href="https://arxiv.org/abs/1812.11118">M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. arxiv 1812.11118, 2018</a></li>
 <li><a href="https://arxiv.org/abs/1912.02292">P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever. Deep double descent: Where bigger models and more data hurt. arxiv 1912.02292, 2019</a></li>
 <li><a href="https://arxiv.org/abs/2303.06173">X. Davies, L. Langosco, and D. Krueger. Unifying grokking and double descent. arxiv 2303.06173, 2023</a></li>
 <li><a href="https://arxiv.org/abs/2501.04697">L. Prieto, M. Barsbey, P. Mediano, and T. Birdal. Grokking at the edge of numerical stability. arxiv 2501.04697, 2025</a></li>
 <li><a href="https://arxiv.org/abs/2311.18817">K. Lyu, J. Jin, Z. Li, S. S. Du, J. D. Lee, and W. Hu. Dichotomy of early and late phase implicit biases can provably induce grokking. arxiv 2311.18817, 2023</a></li>
 </ul>

<h2>References</h2>
 
 <ul class="refs">
 <li>[<a href="https://arxiv.org/abs/2201.02177">1</a>] A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: generalization beyond overfitting on small algorithmic datasets. arxiv 2201.02177, 2022.</li>
 <li>[<a href="https://arxiv.org/abs/2301.05217">2</a>] N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. arxiv 2301.05217, 2023.</li>
 </ul>]]></content:encoded>
      <media:thumbnail url="https://rifaki.me/notes/img/grokking-fig1.png"/>
    </item>

  </channel>
</rss>
