Table of Contents
Fetching ...

An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding

Dou Hu, Lingwei Wei, Wei Zhou, Songlin Hu

TL;DR

InfoMTL tackles the challenge of learning multi-task language representations that are both sufficient for all tasks and robust to noise and data scarcity. It introduces two information-theoretic principles: SIMax, which maximizes shared input relevance and cross-task target relevance, and TIMin, which compresses task-specific redundancy in the output representations. By integrating these into a single framework, InfoMTL achieves superior performance across six NLP benchmarks and shows clear gains in data-constrained and noisy settings, outperforming a wide range of baselines and even GPT-3.5 in some setups. The work demonstrates that carefully balancing information preservation and compression at both shared and task-specific levels yields more accurate, efficient, and robust multi-task representations for natural language understanding.

Abstract

This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target tasks. It can avoid the insufficiency issue arising from representation compression in the multi-task paradigm. Secondly, a task-specific information minimization principle is designed to mitigate the negative effect of potential redundant features in the input for each task. It can compress task-irrelevant redundant information and preserve necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks show that our method outperforms 12 comparative multi-task methods under the same multi-task settings, especially in data-constrained and noisy scenarios. Extensive experiments demonstrate that the learned representations are more sufficient, data-efficient, and robust.

An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding

TL;DR

InfoMTL tackles the challenge of learning multi-task language representations that are both sufficient for all tasks and robust to noise and data scarcity. It introduces two information-theoretic principles: SIMax, which maximizes shared input relevance and cross-task target relevance, and TIMin, which compresses task-specific redundancy in the output representations. By integrating these into a single framework, InfoMTL achieves superior performance across six NLP benchmarks and shows clear gains in data-constrained and noisy settings, outperforming a wide range of baselines and even GPT-3.5 in some setups. The work demonstrates that carefully balancing information preservation and compression at both shared and task-specific levels yields more accurate, efficient, and robust multi-task representations for natural language understanding.

Abstract

This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target tasks. It can avoid the insufficiency issue arising from representation compression in the multi-task paradigm. Secondly, a task-specific information minimization principle is designed to mitigate the negative effect of potential redundant features in the input for each task. It can compress task-irrelevant redundant information and preserve necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks show that our method outperforms 12 comparative multi-task methods under the same multi-task settings, especially in data-constrained and noisy scenarios. Extensive experiments demonstrate that the learned representations are more sufficient, data-efficient, and robust.

Paper Structure

This paper contains 40 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparison of different learning principles under Markov constraints in MTL paradigm. Given the input variable X, shared representations $Z$, task-specific output representations $Z_t$, and the prediction variable $\hat{Y}_t$, the Markov chain for each task $t$ is $Y_t \rightarrow X \rightarrow Z \rightarrow Z_t \rightarrow \hat{Y}_t$.
  • Figure 2: Mutual information analysis results. The X-axis refers to the mutual information between the shared representations $Z$ and the input $X$, i.e, $I(X;Z)$. Y-axis represents the mutual information between the shared and output representations, i.e., $I(Z;Z_t)$. Each number on the line is the training epoch, and the optimal epochs are marked with dashed lines.
  • Figure 3: Robust scores (%) against adversarial perturbation strengths. RoBERTa is the default backbone.
  • Figure 4: Robust scores (%) against random perturbation strengths. RoBERTa is the default backbone.