Table of Contents
Fetching ...

FlowNIB: An Information Bottleneck Analysis of Bidirectional vs. Unidirectional Language Models

Md Kowsher, Nusrat Jahan Prottasha, Shiyun Xu, Shetu Mohanto, Ozlem Garibay, Niloofar Yousefi, Chen Chen

TL;DR

The paper pairs information theory with empirical NLP analysis to explain why bidirectional language models excel at understanding context. It introduces FlowNIB, a dynamic mutual-information estimator with a schedule that unifies $I(X;Z)$ and $I(Z;Y)$ into a single trajectory per layer, and defines the Optimal Information Coordinate (OIC) to compare representations. Theoretical results show bidirectional representations retain more mutual information about inputs and targets and possess higher effective dimensionality, while FlowNIB enables practical estimation via variational MI bounds and normalization by generalized effective dimensionality. Experiments across 16 NLP datasets demonstrate consistent MI advantages for bidirectional models, with masking-based predictions delivering notable gains; the approach also reveals that smaller bidirectional models can outperform larger unidirectional ones under comparable compute. Overall, FlowNIB provides a principled explanation for bidirectional efficacy and a scalable tool for analyzing information flow in deep language models.

Abstract

Bidirectional language models have better context understanding and perform better than unidirectional models on natural language understanding tasks, yet the theoretical reasons behind this advantage remain unclear. In this work, we investigate this disparity through the lens of the Information Bottleneck (IB) principle, which formalizes a trade-off between compressing input information and preserving task-relevant content. We propose FlowNIB, a dynamic and scalable method for estimating mutual information during training that addresses key limitations of classical IB approaches, including computational intractability and fixed trade-off schedules. Theoretically, we show that bidirectional models retain more mutual information and exhibit higher effective dimensionality than unidirectional models. To support this, we present a generalized framework for measuring representational complexity and prove that bidirectional representations are strictly more informative under mild conditions. We further validate our findings through extensive experiments across multiple models and tasks using FlowNIB, revealing how information is encoded and compressed throughout training. Together, our work provides a principled explanation for the effectiveness of bidirectional architectures and introduces a practical tool for analyzing information flow in deep language models.

FlowNIB: An Information Bottleneck Analysis of Bidirectional vs. Unidirectional Language Models

TL;DR

The paper pairs information theory with empirical NLP analysis to explain why bidirectional language models excel at understanding context. It introduces FlowNIB, a dynamic mutual-information estimator with a schedule that unifies and into a single trajectory per layer, and defines the Optimal Information Coordinate (OIC) to compare representations. Theoretical results show bidirectional representations retain more mutual information about inputs and targets and possess higher effective dimensionality, while FlowNIB enables practical estimation via variational MI bounds and normalization by generalized effective dimensionality. Experiments across 16 NLP datasets demonstrate consistent MI advantages for bidirectional models, with masking-based predictions delivering notable gains; the approach also reveals that smaller bidirectional models can outperform larger unidirectional ones under comparable compute. Overall, FlowNIB provides a principled explanation for bidirectional efficacy and a scalable tool for analyzing information flow in deep language models.

Abstract

Bidirectional language models have better context understanding and perform better than unidirectional models on natural language understanding tasks, yet the theoretical reasons behind this advantage remain unclear. In this work, we investigate this disparity through the lens of the Information Bottleneck (IB) principle, which formalizes a trade-off between compressing input information and preserving task-relevant content. We propose FlowNIB, a dynamic and scalable method for estimating mutual information during training that addresses key limitations of classical IB approaches, including computational intractability and fixed trade-off schedules. Theoretically, we show that bidirectional models retain more mutual information and exhibit higher effective dimensionality than unidirectional models. To support this, we present a generalized framework for measuring representational complexity and prove that bidirectional representations are strictly more informative under mild conditions. We further validate our findings through extensive experiments across multiple models and tasks using FlowNIB, revealing how information is encoded and compressed throughout training. Together, our work provides a principled explanation for the effectiveness of bidirectional architectures and introduces a practical tool for analyzing information flow in deep language models.

Paper Structure

This paper contains 29 sections, 10 theorems, 73 equations, 10 figures, 33 tables, 1 algorithm.

Key Result

Theorem 2.1

Bidirectional representations preserve more mutual information about the input and the output: $I(X; Z_\ell^{\leftrightarrow}) \;\ge\; I(X; Z_\ell^{\rightarrow}) \text{ and } I(Z_\ell^{\leftrightarrow}; Y) \;\ge\; I(Z_\ell^{\rightarrow}; Y).$

Figures (10)

  • Figure 1: Information-plane trajectories under FlowNIB training for (left) DeBERTaV3-Base and (right) MobileLLM-350M on MRPC. Each curve shows mutual information $I(Z;Y)$ versus $I(X;Z)$ over training epochs, colored by epoch progression. A constant offset of $+0.05$ is added to $I(X;Z)$ for each successive layer to visually separate the layerwise trajectories. The green line represents the Optimal Information Coordinate (OIC) across layers.
  • Figure 2: Illustration of representation extraction methods: (a) prediction from CLS-token (bidirectional), (b) prediction from pooled embedding (unidirectional), (c) prediction from masked token (bidirectional), and (d) prediction from next-token generation (unidirectional).
  • Figure 3: Average OIC $I(X;Z)$ (top) and $I(Z;Y)$ (bottom) across all layers for unidirectional and bidirectional LMs over multiple datasets. Bars show dataset-wise and average values, comparing information flow differences between architectures.
  • Figure 4: Mutual information flow comparison between bidirectional (top) and unidirectional (bottom) models across three datasets. The first column shows results on the SICK dataset using DeBERTa-base and MobileLLM-350M. The second column shows SST-2 results using RoBERTa-base and MobileLLM-350M. The third column presents results on the CoLA dataset using DeBERTa-v3-Large and MobileLLM-600M.
  • Figure 5: (Left)Information plane trajectories under varying step sizes $\delta$ for $\alpha(t)$ in FlowNIB. Each curve shows the progression of mutual information $I(X;Z)$ and $I(Z;Y)$ across 2000 training epochs. (Right) Effective dimensionality $d_{\mathrm{eff}}(Z)$ across layers for different models on MRPC and SST-2. Bidirectional models show higher $d_{\mathrm{eff}}(Z)$ than unidirectional models at every layer.
  • ...and 5 more figures

Theorems & Definitions (22)

  • Definition 1.1: A valid information plane (post hoc)
  • Remark 1.2: Dynamics
  • Definition 1.3: Optimal Information Coordinate (OIC)
  • Theorem 2.1: Full version in Appendix \ref{['thm:bidirectional_mi']}
  • Definition 2.2: Generalized Effective Dimensionality
  • Lemma 2.3: Bidirectional Representations Exhibit Higher Spectral Complexity
  • Theorem A.1: Conditioning Reduces Entropy
  • proof
  • Theorem A.2: Monotonicity of Conditional Entropy
  • proof
  • ...and 12 more