← Notes

Reading Tishby's information bottleneck

Tishby and Zaslavsky's 2015 paper was, until fairly recently, one of the most-cited papers in deep-learning theory. They described training as two distinct phases. In the first, the "fitting" phase, the mutual information between a hidden representation $T = f_\theta(X)$ and the input $X$ rises. In the second, the "compression" phase, the network discards parts of that input information that are not useful for the prediction.

The two strongest objections to that picture come from Saxe et al. [2], who argue the compression phase is an artifact of the activation function and the mutual-information estimator, and from Goldfeld et al. [3], who formalize the estimator critique and re-run the analysis with noisy networks where mutual information is well-defined. The point of this post is to reread Tishby with both of those objections in hand and ask what remains.

Both objections are worth reading in full; the summaries below are my best attempt to render them faithfully.

Figure 2 of Tishby and Zaslavsky 2015. Qualitative information plane: optimal IB limit (black), suboptimal bifurcations (blue), finite-sample distortion bound (red), and a possible path of the layers in a typical DNN (green), with shaded regions marking the compression gap and generalization gap.
Figure 2 of Tishby and Zaslavsky [1]. The IB rate-distortion bound with deep-network layers placed on it: each layer trades compression of $X$ against retention of information about $Y$.

The actual claim of the original paper

The claim, in plain language: early in training, a deep network builds up features that are useful for predicting the target $Y$, which shows up as $I(T; Y)$ increasing. During the same period $I(T; X)$ also increases, because the hidden layer is just retaining more about the input. Then a second phase kicks in: the network drops the parts of the input information that do not help with the prediction. That second phase is the "compression phase."

The empirical support was a small tanh network whose information-plane plot showed a clean fitting-then-compression trajectory.

Saxe et al. [2] examine the type of activation function used

Saxe and coauthors re-ran the experiments with ReLU instead of tanh. The compression phase did not appear: $\hat{I}(X;T)$ stayed roughly flat across training. Their explanation: tanh saturation pushes each unit's activations into a small number of values, the binning estimator is sensitive to that quantization, and the combination produces a curve that looks like compression but is really an estimator artifact. In a network without saturating units (or with a less binning-sensitive estimator), the curve is gone.

That is a serious problem for the original story. The theoretical pull of Tishby's framing was that compression looked universal, a property of deep learning itself. If it only shows up for one activation function with one estimator, the universality claim is much weaker.

Figure 1 of Saxe et al. ICLR 2018. Four information-plane panels (A, B, C, D). Top row uses a small toy network with the binning estimator: (A) tanh nonlinearity reproduces the Shwartz-Ziv & Tishby fitting-then-compression trajectory; (B) ReLU nonlinearity shows no compression phase. Bottom row uses a 784-1024-20-20-20-10 MNIST network with the Kolchinsky-Tracey KDE estimator: (C) tanh, no compression observed except in the final sigmoidal classification layer; (D) extension under the same KDE setup.
Figure 1 of Saxe et al., ICLR 2018. Swapping tanh for ReLU removes the compression phase (A vs B), and re-running with a KDE estimator at MNIST scale (C, D) also fails to reproduce it. The two-phase information-plane story is contingent on the nonlinearity and the estimator, not a property of training.

Goldfeld et al. formally quantify issues with estimator selection

Goldfeld and coauthors formalized what Saxe had observed. For continuous random variables and deterministic maps $T = f(X)$ from inputs to representations, mutual information $I(T;X) = H(T) - H(T|X)$ is not even well-defined: $H(T|X)$ collapses, $I(T;X)$ blows up, and the finite numbers showing up in published plots are entirely coming from the noise injected by the estimator (binning, added Gaussian noise of variance $\sigma^2$, or KDE). Different choices give different numbers, so the published information-plane trajectories were tracking properties of the estimator at least as much as properties of the network.

One reason Tishby's paper still has value despite the empirical claims being discredited is that it offered a third lens on generalization at a time when the dominant lenses were capacity-based (how restrictive or broad the hypothesis class is) and geometry-based (how smooth or rough the loss landscape is around a minimum). Tishby's lens was sufficient-statistics: representations should retain only the information that matters for the prediction task. The terminology stuck even though the original empirical observation did not, and modern self-supervised methods like infoNCE are essentially information-bottleneck objectives in everything but name.

The paper got most of its public reach through Natalie Wolchover's 2017 Quanta piece, "New Theory Cracks Open the Black Box of Deep Learning," which presented Tishby's claims in their strongest form. Reading that piece today is mostly useful as a reminder of how far ahead of the evidence the rhetoric got.

Figure 1 of Goldfeld et al. 2019. Estimated $I(X; \mathrm{Bin}(T_\ell))$ over training epochs for layers 1-5 at four binning resolutions (bin size 0.0001, 0.001, 0.01, 0.1). The apparent compression phase appears or disappears depending on the bin size.
Figure 1 of Goldfeld et al. [3]. The same training run produces qualitatively different "information-plane trajectories" depending on the bin size used to estimate mutual information - the compression phase is partly an estimator artefact.

For readers who want non-paper summaries of this debate, Adrian Colyer's three-part Morning Paper series on Tishby's IB theory and Saxe's reply is the most accessible walkthrough.

A weaker version of the original claim does survive: networks trained with SGD often end up with representations that are sufficient for the labels and roughly insensitive to label-irrelevant input variation. "Information bottleneck" is a fine descriptive label for that. The strong version, in which training proceeds through two cleanly separated phases divided by a phase transition in $I(T;X)$, has no empirical support, and the further claim that SGD is implicitly minimizing an information-bottleneck objective remains unproven.

Some incorrect papers end up more useful to a field than correct ones, because they hand it vocabulary it did not have.

Figure 2 of Goldfeld et al. 2019. Architectural diagram of the noisy DNN: $T_{\ell-1}$ feeds through $\sigma(W_\ell^{(k)} T_{\ell-1} + b_\ell^{(k)})$ to produce a pre-noise hidden $S_\ell(k)$, to which Gaussian noise $Z_\ell(k) \sim \mathcal{N}(0,\beta^2)$ is added to yield the next-layer hidden $T_\ell(k)$.
Figure 2 of Goldfeld et al. [3]. The noisy-network construction: adding Gaussian noise after each layer makes mutual information well-defined and lets the analysis distinguish genuine compression dynamics from estimator artefacts.

Saxe's argument alone is not fatal: a noisy version of the network has well-defined mutual information and can be analyzed directly, which gets you out of the estimator trap. Goldfeld et al. did exactly that and found the two-phase trajectory does not hold up across estimator choices once the quantities being plotted are well-defined. After that, the empirical case for Tishby's strong claims is essentially gone.

Further reading

References