Table of Contents
Fetching ...

Statistical Deficiency for Task Inclusion Estimation

Loïc Fosse, Frédéric Béchet, Benoît Favre, Géraldine Damnati, Gwénolé Lecorvé, Maxime Darrin, Philippe Formont, Pablo Piantanida

TL;DR

The paper defines tasks as joint probability measures and introduces a relaxed, information-theoretic notion of task inclusion grounded in statistical deficiency. It proposes Information Sufficiency (IS) as a tractable proxy to estimate how well solving one task $U$ informs solving another task $V$ via embeddings produced by fine-tuned models. Validation on synthetic HMM data and an OntoNotes-based NLP pipeline with LoRA-tuned Mistral 7B and Llama 3 8B shows that IS recovers a plausible partial order among linguistic tasks (e.g., SYN ⊂ SRL ⊂ NER, with SUM and COR presenting more challenging roles) and enables a compact summary of inter-task informativeness via Predictive Power (PP). The framework offers a data-efficient, activation-space-based lens to study task structure, suggest data-mixing strategies, and guide orthogonal benchmark design, while acknowledging IS as an approximate proxy and proposing future work to broaden task spaces and improve deficiency estimation.

Abstract

Tasks are central in machine learning, as they are the most natural objects to assess the capabilities of current models. The trend is to build general models able to address any task. Even though transfer learning and multitask learning try to leverage the underlying task space, no well-founded tools are available to study its structure. This study proposes a theoretically grounded setup to define the notion of task and to compute the {\bf inclusion} between two tasks from a statistical deficiency point of view. We propose a tractable proxy as information sufficiency to estimate the degree of inclusion between tasks, show its soundness on synthetic data, and use it to reconstruct empirically the classic NLP pipeline.

Statistical Deficiency for Task Inclusion Estimation

TL;DR

The paper defines tasks as joint probability measures and introduces a relaxed, information-theoretic notion of task inclusion grounded in statistical deficiency. It proposes Information Sufficiency (IS) as a tractable proxy to estimate how well solving one task informs solving another task via embeddings produced by fine-tuned models. Validation on synthetic HMM data and an OntoNotes-based NLP pipeline with LoRA-tuned Mistral 7B and Llama 3 8B shows that IS recovers a plausible partial order among linguistic tasks (e.g., SYN ⊂ SRL ⊂ NER, with SUM and COR presenting more challenging roles) and enables a compact summary of inter-task informativeness via Predictive Power (PP). The framework offers a data-efficient, activation-space-based lens to study task structure, suggest data-mixing strategies, and guide orthogonal benchmark design, while acknowledging IS as an approximate proxy and proposing future work to broaden task spaces and improve deficiency estimation.

Abstract

Tasks are central in machine learning, as they are the most natural objects to assess the capabilities of current models. The trend is to build general models able to address any task. Even though transfer learning and multitask learning try to leverage the underlying task space, no well-founded tools are available to study its structure. This study proposes a theoretically grounded setup to define the notion of task and to compute the {\bf inclusion} between two tasks from a statistical deficiency point of view. We propose a tractable proxy as information sufficiency to estimate the degree of inclusion between tasks, show its soundness on synthetic data, and use it to reconstruct empirically the classic NLP pipeline.

Paper Structure

This paper contains 57 sections, 4 theorems, 54 equations, 14 figures, 12 tables.

Key Result

Theorem 1

Figures (14)

  • Figure 1: Illustration of proposed task comparison framework. $X$ is textual input, $Y$ is reference output, $\hat{Y}$ is system output, $Z$ represents embeddings, ($U$, $V$) is a pair of tasks. $\delta()$ is statistical deficiency and $\mathcal{I}_S$ is the information sufficiency proxy.
  • Figure 2: Layerwise information sufficiency between Mistral 7B base and that model model finetuned, averaged over the NLP pipeline tasks.
  • Figure 3: Average of $\mathcal{I}_S(\text{row} \rightarrow \text{col})$ across models.
  • Figure 4: Illustration of the used markov chain for data generation. The quantity $\mathbb{P}_{i, \mathcal{A}}$ refer to the emission probabilities of each states.
  • Figure 5: HMM forward likelihood v.s. empirical likelihood of the transformer based model
  • ...and 9 more figures

Theorems & Definitions (24)

  • Definition 1: Task
  • Remark 1
  • Definition 2: Lenient-inclusion
  • Definition 3: Deficiency camSufficiencyApproximateSufficiency1964
  • Theorem 1: $0$-deficiency
  • Theorem 2: $\varepsilon$-deficiency camSufficiencyApproximateSufficiency1964
  • Definition 4: Total variation distance
  • Definition 5: Markov composition operation
  • proof
  • Example 1: Difference
  • ...and 14 more