Representation Learning with Conditional Information Flow Maximization

Dou Hu; Lingwei Wei; Wei Zhou; Songlin Hu

Representation Learning with Conditional Information Flow Maximization

Dou Hu, Lingwei Wei, Wei Zhou, Songlin Hu

TL;DR

We study noisy, potentially redundant information in representations learned for inputs $X$ and targets $Y$ and address this with CIFM, an information-theoretic framework combining Information Flow Maximization (IFM) and Conditional Information Minimization (CIM). CIFM jointly maximizes $I(Y;Z)$ and $I(X;Z)$ to yield informative, uniformly distributed representations while adversarially minimizing $I(X;Z_{\delta}|Y)$ to remove negative redundancies, yielding noise-invariant representations. Empirical results across 13 NLP benchmarks show CIFM improves RoBERTa/BERT performance on classification and regression, with strong generalization in out-of-distribution and data-constrained settings and better transferability. The work provides a principled approach to information flow in neural representations, achieving more sufficient, robust, and transferable features for pre-trained language models.

Abstract

This paper proposes an information-theoretic representation learning framework, named conditional information flow maximization, to extract noise-invariant sufficient representations for the input data and target task. It promotes the learned representations have good feature uniformity and sufficient predictive ability, which can enhance the generalization of pre-trained language models (PLMs) for the target task. Firstly, an information flow maximization principle is proposed to learn more sufficient representations for the input and target by simultaneously maximizing both input-representation and representation-label mutual information. Unlike the information bottleneck, we handle the input-representation information in an opposite way to avoid the over-compression issue of latent representations. Besides, to mitigate the negative effect of potential redundant features from the input, we design a conditional information minimization principle to eliminate negative redundant features while preserve noise-invariant features. Experiments on 13 language understanding benchmarks demonstrate that our method effectively improves the performance of PLMs for classification and regression. Extensive experiments show that the learned representations are more sufficient, robust and transferable.

Representation Learning with Conditional Information Flow Maximization

TL;DR

We study noisy, potentially redundant information in representations learned for inputs

and targets

and address this with CIFM, an information-theoretic framework combining Information Flow Maximization (IFM) and Conditional Information Minimization (CIM). CIFM jointly maximizes

and

to yield informative, uniformly distributed representations while adversarially minimizing

to remove negative redundancies, yielding noise-invariant representations. Empirical results across 13 NLP benchmarks show CIFM improves RoBERTa/BERT performance on classification and regression, with strong generalization in out-of-distribution and data-constrained settings and better transferability. The work provides a principled approach to information flow in neural representations, achieving more sufficient, robust, and transferable features for pre-trained language models.

Abstract

Paper Structure (32 sections, 7 equations, 6 figures, 10 tables)

This paper contains 32 sections, 7 equations, 6 figures, 10 tables.

Introduction
Methodology
Information Flow Maximization
Implementation of IFM
Conditional Information Minimization
Implementation of CIM
CIFM Framework
Experiments
Experimental Setups
Downstream Tasks and Datasets
Comparison Methods
Evaluation Metrics
Implementation Details
Overall Results
Ablation Study
...and 17 more sections

Figures (6)

Figure 1: Venn information diagram comparison of our CIFM with existing principles. The learned representations by each principle is circled by the red dashed line.
Figure 2: Comparison results of CIFM with different MI Estimators and the CE baseline on classification tasks. RoBERTa is the default backbone model.
Figure 3: Results of different methods against different sizes of training set with RoBERTa backbone.
Figure 4: Robust scores against different random perturbation strengths. RoBERTa is the default backbone.
Figure 5: Robust scores against different adversarial perturbation strengths. RoBERTa is the default backbone.
...and 1 more figures

Representation Learning with Conditional Information Flow Maximization

TL;DR

Abstract

Representation Learning with Conditional Information Flow Maximization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)