The edge of stability

February 27, 2025

Cohen et al. [1] observed that gradient descent on neural networks spends most of training in a regime where the top Hessian eigenvalue $\lambda_{\max}$ is above the classical stability threshold $2/\eta$. The step is formally unstable ($\eta \lambda_{\max} > 2$), but the loss does not diverge: the trajectory oscillates along the unstable direction while continuing to make progress on the rest. They called this the edge of stability, and the point is that it is not an edge case but the ordinary regime of neural-network training.

Progressive sharpening

The first phase Cohen identified is progressive sharpening. From random initialization, gradient descent reliably drives the sharpness ($\lambda_{\max}(H)$) of the loss landscape upward during training.

That direction is already counterintuitive. Classical optimization theory says you want to avoid sharp regions, since each step there costs more. Neural-network training does the opposite: it walks into sharper regions until the classical step size is no longer stable. Progressive sharpening itself is not well understood from first principles. Damian et al. [2] give a self-stabilization argument for what happens after the threshold is reached, and Ahn et al. [4] build on it, but neither predicts the sharpening from initialization. The observational picture is clean; the theoretical one is not.

Above the threshold

Once sharpness crosses $2/\eta$, the textbook prediction is divergence; instead the trajectory oscillates along the top eigendirection of the Hessian. The component of the iterate along that direction swings back and forth, and the loss still falls because optimization keeps making progress on the better-conditioned directions. The net result is a trajectory that reduces loss while sitting in a region of the landscape that classical theory says it should not occupy. The first theoretical account of why this need not be a pathology comes from Arora et al. [3]. Their analysis works on a smoothed version of the loss, where the Hessian is treated as locally fixed, and in effect tracks the trajectory you would follow from a given starting point under that fixed Hessian.

Figure 1 of Cohen et al. 2021. Train loss (top row) and Hessian sharpness (bottom row) over training steps for a fully-connected net on a CIFAR-10 5k subset, VGG on CIFAR-10, and ResNet on CIFAR-10. In every case, sharpness rises until it hits the $2/\eta$ threshold (dashed) and oscillates along it. — Figure 1 of Cohen et al. [1]. Across architectures, the Hessian's top eigenvalue rises during a progressive-sharpening phase and then sits near the $2/\eta$ stability threshold for the remainder of training.

The mechanism is that the oscillation $\Delta \theta_t$ across the unstable direction averages to zero, so the effective dynamics is slower and looks like gradient descent on a loss with the steepest direction clipped. The account is formal enough to be checked against real training runs, and in most common settings it holds up.

And flat minima

The earlier flat-minima story started with Hochreiter and Schmidhuber and continued with Keskar et al. [7] and the later sharpness aware minimization literature. Broadly, the flat-minima story ran as follows: SGD with a small batch size produces gradient estimates with some noise $\xi$.

That noise looks like a random walk and tends to leave sharp minima more often than flat ones, so SGD ends up biased toward flat minima, and that bias was meant to be why neural networks generalize. Edge-of-stability does not contradict the story, but it reshapes it. The learning rate itself caps how sharp a reachable minimum can be: anything with $\lambda_{\max} > 2/\eta$ is unstable for GD, so the trajectory cannot stay there. Gradient descent finds flat minima not because it has noise, but because sharp minima are unstable fixed points under its own dynamics. The original explanation identified the phenomenon and pinned it to the wrong cause.

Figure 3 of Cohen et al. 2103.00065. Progressive sharpening isolated; sharpness rises before reaching the 2/eta threshold. — Figure 3 of Cohen et al. [1]. Progressive sharpening isolated: Hessian sharpness rises monotonically during the early phase, long before the $2/\eta$ threshold is reached.

The reframing lives mostly outside the papers themselves. Off Convex has a few posts on implicit bias, trajectory analysis, and why the classical descent lemma is genuinely misleading for neural networks instead of merely approximate. Ben Recht's ArgMin is the complementary skeptical take for once you have left convex optimization theory $\mu I \preceq \nabla^2 L \preceq L\,I$. Clare Lyle's tutorial walks through the $2/\eta$ arithmetic and ties the phenomenon to warmup (rising $\eta(t)$) and catapult (loss spike then decay) in one frame.

Andreyev and Beneventano (arxiv 2412.20553) extended the story to the mini-batch setting Cohen did not analyze, introducing an "edge of stochastic stability" where the quantity that pins at $2/\eta$ is the expected directional curvature of mini-batch Hessians, not the full-Hessian's top eigenvalue.

A few previously folklore-level phenomena become intelligible from this picture. Warmup schedules $\eta(t)$, which start small and increase $\eta$ over several thousand iterations, let the network settle into edge-of-stability before $\eta$ reaches its final value; without warmup, the early transient at the full $\eta$ would hit a too-sharp region and diverge. Decreasing-$\eta$ schedules at the end of training raise $2/\eta$, so the trajectory can fine-tune in sharper local minima inside the broader flat region already reached, which empirically pushes training loss down further.

Figure 1 of Arora et al. 2205.09745. Smoothed-loss analysis of the edge-of-stability oscillations. — Figure 1 of Arora et al. [3]. The smoothed-loss analysis makes explicit why the oscillations across the unstable direction do not destroy progress; averaging over a few steps yields an effective slow dynamics on a clipped loss.

Both schedules had been used empirically for years before any principled account existed. Lewkowycz et al.'s catapult mechanism [5], where an initial loss spike sometimes precedes a better final solution, is the same dynamics at a larger scale: a large learning rate pushes the trajectory through a briefly very sharp region, the loss spikes, and the trajectory then settles into a different basin from the one it would have reached at a smaller step size.

The generalization gap is still open

Edge-of-stability gives a clean account of why SGD ends up in flat minima, but it says nothing about why flat minima generalize. Those are distinct questions, and the second one is still open.

Dinh et al. [6] showed that the Hessian-based notion of sharpness is not reparameterization invariant, so sharpness in that form cannot directly control generalization.

Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.*

With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs. pic.twitter.com/pvvfwoQcOy
— Jeremy Cohen (@deepcohen) October 1, 2025

Jeremy Cohen, lead author of [1], announcing the central-flows follow-up that makes the 2021 observation a quantitative prediction tool.

References

[1] J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. arxiv 2103.00065, 2021.
[2] A. Damian, E. Nichani, and J. D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. arxiv 2209.15594, 2022.
[3] S. Arora, Z. Li, and A. Panigrahi. Understanding gradient descent on edge of stability in deep learning. arxiv 2205.09745, 2022.
[4] K. Ahn, J. Zhang, and S. Sra. Understanding the unstable convergence of gradient descent. arxiv 2204.01050, 2022.
[5] A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: The catapult mechanism. arxiv 2003.02218, 2020.
[6] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. arxiv 1703.04933, 2017.
[7] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: generalization gap and sharp minima. arxiv 1609.04836, 2016.