A Mutual Information Maximization Perspective of Language Representation Learning

Lingpeng Kong; Cyprien de Masson d'Autume; Wang Ling; Lei Yu; Zihang Dai; Dani Yogatama

A Mutual Information Maximization Perspective of Language Representation Learning

Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama

TL;DR

The paper reframes language representation learning as mutual information maximization, using the InfoNCE lower bound to derive a unifying objective that encompasses Skip-gram, BERT, and XLNet. It introduces InfoWord, a principled self-supervised framework that combines a discounted mutual-information term over sentence-wide representations with a DIM-inspired objective over $n$-grams, optimized via negative sampling. Empirical results on GLUE and SQuAD show that InfoWord yields stronger representations than prior MLM-based approaches, especially with limited data, and highlight the value of span-based and higher-order views. The work provides a bridging perspective across NLP, computer vision, and related domains, suggesting flexible future directions for incorporating richer views and priors into mutual-information-based pretraining.

Abstract

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

A Mutual Information Maximization Perspective of Language Representation Learning

TL;DR

Abstract

A Mutual Information Maximization Perspective of Language Representation Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)