Double descent

August 25, 2024

The classical U becomes a W with a second descent in the overparameterized ($p \gt n$) regime and that second descent often goes below the first minimum.

Figure 1 of Belkin et al. 2018. Schematic of test risk as a function of model capacity: the classical U-shape to the left of the interpolation threshold and a second descending branch to its right. — Figure 1 of Belkin et al. [1]. Classical bias-variance to the left of the interpolation threshold; a second descent in the overparameterized regime on the right.

Nakkiran et al.

Nakkiran et al. [2] made the picture concrete by showing that the W-shape appears in three different axes: model size $p$, training time, and dataset size $n$. Model-wise double descent varies the width $k$ of a ResNet $f_\theta$; epoch-wise double descent varies the number of training steps; sample-wise double descent varies dataset size with everything else held fixed. The shape is the same each time: a test-error peak near the interpolation threshold, then a descent once you push past it.

Epoch-wise is the most surprising of the three. Within one run, test error gets worse before it gets better. The worst test error sits roughly at the iteration where training loss first hits zero; train past it and the test error drops again.

Label noise

The sharpest versions of the double descent peak in these papers come with label noise. Nakkiran's headline plots use ten to twenty percent corrupted labels. Without label noise, the peak is much weaker and sometimes absent. Label noise inflates the variance contribution of the model at the interpolation threshold because the model is being asked to memorize random labels at exactly the capacity where memorization is possible but not easy.

Past the threshold, extra capacity absorbs the noise into higher-frequency components without disturbing the underlying signal. This is the hinge that connects the toy phenomenon to actual deep learning. Belkin's linear-regression result holds at all noise levels but the gap is small without noise; Nakkiran's dramatic curves require label noise to be visible. At modern language-model scale, with clean labels and large models, test loss is close to monotone in parameter count and scaling-law papers fit clean $L \propto C^{-\alpha}$ decay with no visible second peak. The effect is real and the classical bias-variance picture is wrong in the overparameterized regime, but the large peak that gives double descent its name is specific to the label-noise case.

Boaz Barak's Windows on Theory post makes a version of this argument: the interesting part of double descent is to the right of the peak, not the peak itself. OpenAI's Deep Double Descent post took Nakkiran to a much wider audience and posed the sharper question: given this effect, what kind of complexity control (if any) actually predicts generalization? Google Research's "A new lens on understanding generalization in deep learning" recast double descent in terms of an effective-capacity measure that tracks the empirical curves better than parameter count does. Misha Belkin's Simons Institute talks are the video account I keep sending people who want to see how the picture has changed since 2018.

So, "double descent" is happening b/c DF isn't really the right quantity for the the x-axis: like, the fact that we are choosing the minimum norm least squares fit actually means that the spline with 36 DF is **less** flexible than the spline with 20 DF.

Crazy, huh?

19/
— Daniela Witten (@daniela_witten) August 9, 2020

Figure 1 of Nakkiran et al. 2019. Test error and train error versus ResNet18 width parameter under varying label-noise levels (0%, 5%, 10%, 15%, 20%). Test error peaks near the interpolation threshold and decreases again as width grows. — Figure 1 of Nakkiran et al. [2]. Model-wise double descent grows visibly with label-noise level: a peak at the interpolation threshold, then a second descent in the overparameterized regime.

I'm closer to Barak's reading than to what filtered down to practitioner intros.

After the label-noise caveat, two results from this line of work hold up. Classical capacity measures like VC dimension do not extend cleanly to the overparameterized regime and cannot be expected to predict generalization there. And overparameterized networks with astronomically large VC dimensions can sit well below smaller networks in test error on the same task — which implies the loss landscape is doing the selection: from a vast pool of interpolating solutions, the optimizer is picking a small subset that generalizes.

Figure 4 of Nakkiran et al. 2019. Left panel labels the classical (under-parameterized) and modern (over-parameterized) regimes around the interpolation threshold; right panel overlays test error across many epochs (color = epochs 1 to 1000) versus ResNet18 width, with an optimal early-stopping envelope. — Figure 4 of Nakkiran et al. [2]. Pulling apart the canonical double-descent shape: the peak sits at the interpolation threshold, and the epoch-coloured family on the right makes epoch-wise double descent visible alongside model-wise.

What practitioners did with this was simpler than what the theory suggested: once you are in the overparameterized regime and you have compute to spend, bigger is usually better. The second descent has no obvious endpoint, which is why Kaplan-style scaling laws can fit clean power-law decay in compute — they are sitting entirely on the right-hand, log-log-linear side of the W-curve. The dramatic 2019 reading of double descent was that the bias-variance tradeoff is fiction and overfitting no longer exists. The second half of that is trivially untrue (overfitting is easy to produce in any small-data regime). The first half is more delicate: above the interpolation threshold, with implicit min-norm regularization ($\theta^\star = \arg\min_{f_\theta(X)=y} \|\theta\|$), larger models tend to generalize better rather than worse. That is a statement about a regime, not a law.

$$R(p) = \sigma^2 \cdot \frac{n}{|p-n-1|} + \|\beta\|^2 \cdot \max\!\left(0,\, 1 - \frac{n}{p}\right)$$

Expected test risk of the min-norm ridgeless interpolant with $n$ samples and $p$ features under isotropic covariates, from Hastie et al. [3]. The first term diverges at $p=n$ and is the interpolation peak. The second shrinks as $p \to \infty$ and is the second descent. Double descent is not a deep-learning phenomenon in any strict sense since it falls out of the min-norm solution to an overparameterized least squares problem and holds only because the optimizer is selecting a specific well-behaved interpolant out of the many that fit the data.

Another claim that does not survive the empirical record is that the peak is always exactly at the interpolation threshold. In practice, the exact location of the peak in Nakkiran's ResNet experiments depends on the effective number of parameters under whatever implicit regularization is in use, not on the total parameter count. The peak does not occur precisely at the width at which training error ($L_{\text{train}}$) first reaches zero; it sits slightly past that point, where the network can memorize noisy labels without disrupting the underlying signal.

References

[1] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. arxiv 1812.11118, 2018.
[2] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever. Deep double descent: Where bigger models and more data hurt. arxiv 1912.02292, 2019.
[3] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arxiv 1903.08560, 2019.

Double descent

Nakkiran et al.

Label noise

Further reading

References