The implicit-bias program

May 14, 2025

Why does training a model without an explicit regularizer, with the loss driven nearly to zero, still produce a solution that generalizes? The classical answer is that the objective has to carry the regularizer somewhere. The implicit-bias answer is more subtle: even when the objective has many minima, gradient descent does not choose among them neutrally; the algorithm itself selects a particular kind of solution.

Zhang et al. [2] sharpened the puzzle that most of this literature now opens with. Standard image classifiers fit random labels as easily as they fit true ones, which kills the simplest capacity-based explanation of generalization: the hypothesis class is large enough to memorize anything. Whatever is producing the generalization, then, must come from the optimizer, the data, or the parameterization, not from the loss function itself.

The clearest version of this story is not actually about neural networks. It is about logistic regression on linearly separable data. Once the classifier separates the data, the empirical classification error is already 0% and the logistic loss continues to decrease as the norm of the weights grows. There is no finite minimizer. What is more surprising is that the direction of the weights still converges. And it converges to the maximum-margin SVM solution.

Figure 1 of Soudry et al. 2018. Five-panel layout: (A) 2D separable data with the converged separator, (B) normalized weight norm growing logarithmically, (C) logistic loss decaying, (D) angle gap to max-margin direction shrinking, (E) margin gap closing. — Figure 1 of Soudry et al. [1]. The loss has no finite minimizer on separable data, the norm grows without bound, and the normalized direction converges to the hard-margin separator.

The linear theorem

Let $x_i \in \mathbb{R}^d$ denote an input vector with binary label $y_i \in \{-1,+1\}$, and let $w_t \in \mathbb{R}^d$ denote the weight vector at iteration $t$. For linearly separable data $\{(x_i, y_i)\}_{i=1}^{n}$, standard gradient descent on the logistic loss sends $\lVert w_t \rVert \to \infty$, but the normalized direction $w_t / \lVert w_t \rVert$ converges to the L2 max-margin direction $\hat{w} / \lVert \hat{w} \rVert$, where \[\hat{w} \;=\; \arg\min_{w \in \mathbb{R}^d} \tfrac{1}{2}\lVert w\rVert^2 \quad \text{s.t.}\quad y_i\,w^{\top} x_i \ge 1 \quad \forall i \in [n].\]

In words: once the classifier has separated the data, gradient descent keeps reducing the logistic loss as long as the separation is preserved, except that unlike a traditional SVM it does so while continuing to grow the norm of the weights without bound. The norm grows only logarithmically with $t$, and the angle gap between $w_t/\lVert w_t\rVert$ and $\hat w/\lVert\hat w\rVert$ closes at the same logarithmic rate. This is why the asymptotic regime takes so many iterations to become visible.

A few things follow from the theorem. Gradient descent behaves as if it had been regularized toward the Euclidean max-margin classifier without anyone writing that regularizer down. It also explains why early stopping helps: because the weight norm diverges over time, stopping early caps the implicit penalty before the iterate gets too far. Ji and Telgarsky extended the result to the non-separable case, showing the iterate still tracks a unique ray defined by the data when no separating hyperplane exists. Nacson et al. [4] showed that aggressive learning-rate schedules accelerate convergence to the max-margin direction by polynomial factors over plain GD. The norm divergence itself is robust to dataset size and dimension: as long as the data is linearly separable, the iterate keeps moving away from the origin and never lands at a finite minimum.

Figure 2 of Soudry et al. 2018. Three panels on a real classification dataset: training/validation objective loss, classification error, and L2 norm of the final layer growing as training progresses. — Figure 2 of Soudry et al. [1]. On real data, classification error plateaus near zero while the L2 norm of the final layer keeps growing - the asymptotic regime described in the linear theorem.

The geometry enters

Gunasekar et al. [3] generalized the result: different optimization geometries select different implicit regularizers. Steepest descent under the $\ell_p$ norm minimizes the $\ell_p$ margin instead of the $\ell_2$ margin. Mirror descent with respect to a convex potential $\Phi$ minimizes the $\Phi$-min-norm interpolant. Natural gradient and adaptive methods land at interpolants determined by the geometry of their step.

The linear-convolutional-network result is the cautionary case. For fully connected linear predictors, gradient descent picks out the familiar $\ell_2$ margin geometry. For full-width linear convolutional networks of depth $L$, Gunasekar et al. show that gradient descent instead selects the predictor minimizing the $2/L$-bridge penalty in the discrete Fourier domain. The architecture changes which parameters are being optimized, and that change shifts both the trajectory and the preferred solution. "Gradient descent likes simple solutions" is too vague to be a theorem. The more honest statement is that gradient descent likes simple solutions in whichever coordinate system the architecture imposes.

Three-panel figure from Gunasekar, Lee, Soudry, Srebro 2018: (a) mirror descent with primal momentum, (b) natural gradient descent at varying step sizes, (c) steepest descent under the 4/3 norm. Each optimizer trajectory lands on a different interpolating solution along the same zero-loss line. — From Gunasekar et al. [3]. Implicit bias is a selection rule over interpolants determined by the geometry of the optimizer, not a single universal preference.

Margin in homogeneous networks

Lyu and Li push the result past linear predictors. If $f_\theta$ is positively homogeneous in $\theta$ with order $L$ (which holds for ReLU networks without bias, with $L$ equal to depth), then gradient flow on exponential or logistic loss drives $\theta_t / \lVert \theta_t \rVert$ to a KKT point of the parameter-space margin program $\max_{\lVert \theta \rVert \leq 1}\, \min_i\, y_i f_\theta(x_i)$. The norm still diverges; the normalized direction still converges; the new content is that even on a non-convex parameter landscape, gradient flow lands on points satisfying first-order optimality conditions for the margin program.

Chizat and Bach prove a parallel mean-field result for two-layer networks with vanishing initializations. The implicit bias there is $F_1$-norm minimization in function space, which is a different object from parameter-space margin maximization and interacts differently with the data. The state of the field is that implicit regularization in deep networks has several reasonable descriptions, none of which generalize cleanly past shallow or homogeneous models.

Figure 1 of Lyu and Li 2020. Training loss and normalized margin trajectories for homogeneous networks under fixed and loss-based learning rates: the loss collapses while the normalized margin keeps rising toward a KKT point of the parameter-space margin program. — Figure 1 of Lyu and Li [arxiv 1906.05890]. The loss keeps shrinking, the weight norm grows, and the useful object is the normalized direction.

It is widely thought that neural networks generalize because of implicit regularization of gradient descent. Today at #ICLR2023 we show new evidence to the contrary. We train with gradient-free optimizers and observe generalization competitive with SGD.https://t.co/8Vo9rFI9FY
— Tom Goldstein (@tomgoldsteincs) May 2, 2023

My own reading is that the implicit-bias program is one of the few cases I can point to of a research direction being vindicated and outgrown at the same time. Soudry et al. [1] is true; the mechanism is real; the linear case is the only setting where I can prove anything I trust. What is unclear is whether the same phenomenon is the dominant explanation for why large feature-learning networks generalize, or whether at scale the data distribution and the architecture have already done so much of the work that the optimizer's preference is a small correction. I currently believe the second, but I do not have a falsifier I trust, which is exactly the position the field is in.

References

[1] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. arxiv 1710.10345, 2017.
[2] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arxiv 1611.03530, 2016.
[3] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. arxiv 1802.08246, 2018.
[4] M. S. Nacson, J. Lee, S. Gunasekar, P. H. P. Savarese, N. Srebro, and D. Soudry. Convergence of gradient descent on separable data. arxiv 1803.01905, 2019.

The implicit-bias program

The linear theorem

The geometry enters

Margin in homogeneous networks

Further reading

References