Table of Contents
Fetching ...

Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding

Yunchang Zhu, Liang Pang, Kangxi Wu, Yanyan Lan, Huawei Shen, Xueqi Cheng

TL;DR

The universal effectiveness of comparative loss is demonstrated through extensive experiments on 14 datasets from three distinct NLU tasks based on five widely used pre-trained language models and find it particularly superior for models with few parameters or long input.

Abstract

Current natural language understanding (NLU) models have been continuously scaling up, both in terms of model size and input context, introducing more hidden and input neurons. While this generally improves performance on average, the extra neurons do not yield a consistent improvement for all instances. This is because some hidden neurons are redundant, and the noise mixed in input neurons tends to distract the model. Previous work mainly focuses on extrinsically reducing low-utility neurons by additional post- or pre-processing, such as network pruning and context selection, to avoid this problem. Beyond that, can we make the model reduce redundant parameters and suppress input noise by intrinsically enhancing the utility of each neuron? If a model can efficiently utilize neurons, no matter which neurons are ablated (disabled), the ablated submodel should perform no better than the original full model. Based on such a comparison principle between models, we propose a cross-model comparative loss for a broad range of tasks. Comparative loss is essentially a ranking loss on top of the task-specific losses of the full and ablated models, with the expectation that the task-specific loss of the full model is minimal. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks based on 5 widely used pretrained language models and find it particularly superior for models with few parameters or long input.

Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding

TL;DR

The universal effectiveness of comparative loss is demonstrated through extensive experiments on 14 datasets from three distinct NLU tasks based on five widely used pre-trained language models and find it particularly superior for models with few parameters or long input.

Abstract

Current natural language understanding (NLU) models have been continuously scaling up, both in terms of model size and input context, introducing more hidden and input neurons. While this generally improves performance on average, the extra neurons do not yield a consistent improvement for all instances. This is because some hidden neurons are redundant, and the noise mixed in input neurons tends to distract the model. Previous work mainly focuses on extrinsically reducing low-utility neurons by additional post- or pre-processing, such as network pruning and context selection, to avoid this problem. Beyond that, can we make the model reduce redundant parameters and suppress input noise by intrinsically enhancing the utility of each neuron? If a model can efficiently utilize neurons, no matter which neurons are ablated (disabled), the ablated submodel should perform no better than the original full model. Based on such a comparison principle between models, we propose a cross-model comparative loss for a broad range of tasks. Comparative loss is essentially a ranking loss on top of the task-specific losses of the full and ablated models, with the expectation that the task-specific loss of the full model is minimal. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks based on 5 widely used pretrained language models and find it particularly superior for models with few parameters or long input.
Paper Structure (36 sections, 1 theorem, 11 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 36 sections, 1 theorem, 11 equations, 6 figures, 8 tables, 1 algorithm.

Key Result

corollary 1

Suppose $f(x^{(0)}; \bm{\theta}^{(0)})$ is a hereditarily efficient neural model for the input $x^{(0)}$ with respect to the parameter space $\mathbb{R}^{|\bm{\theta}^{(0)}|}$, let $\{f(x^{(i)}; \bm{\theta}^{(i)})\}_{i=1}^{c}$ be its multiple progressively ablated models, where $x^{(i)} \sqsubset x

Figures (6)

  • Figure 1: An illustration of a full neural model (a) and its ablated models (b, c, and d), where a hidden neuron is ablated in (b), an input neuron is ablated in (c), and (d) additionally ablate another input neuron based on (b). According to the comparison principle, if the full model (a) is an efficient model, the comparative relation between the task-specific losses obtained by these models should be (a) $\le$ (b), (c), (d). If the ablated model (b) is also efficient in its parameter space, then their comparative relation can be further written as (a) $\le$ (b) $\le$ (d). Note that (b, c) and (c, d) are two non-comparable model pairs. This is because the ablated model (c) is not a submodel of (b) and (d), and vice versa.
  • Figure 2: The Venn diagram for some of the concepts in this paper. The empirical risk minimized (ERM) refers to the minimization of Eq. \ref{['eq:emp']}, which is a subset of the parameter-efficient (satisfying Eq. \ref{['eq:pe']}). The efficient (intersecting purple region) model in the comparison principle, in addition to being parameter-efficient, also needs to be input-efficient (satisfying Eq. \ref{['eq:ie']}). The hereditarily efficient model requires not only the full model to be efficient, but also any of its ablated models to be efficient, i.e., satisfying Eq. \ref{['eq:ee']} in Corollary \ref{['corollary:cmp']}. The training objective of the comparative loss Eq. \ref{['eq:cmp']} is both hereditarily efficient and ERM, i.e., the central overlapping grid region.
  • Figure 3: The overview of comparative loss (best viewed in color). Given a data sample $(x, y)$, conventional training typically feeds the input context $x$ into the neural model to obtain the prediction $y^{(0)}$ and then just minimizes the task-specific loss $l^{(0)}$. In contrast, comparative loss not only progressively ablates the original model to minimize multiple task-specific losses $\{l^{(i)}\}_{i=0}^{c}$, but also constrains their comparative relation with a pairwise hinge loss.
  • Figure 4: Average results on eight GLUE datasets as the number of ablation steps changes.
  • Figure 5: Performance curves using different context sizes. (a) PRF models on MARCO Dev, the horizontal dotted line represents the base retrieval model. (b) RC models on HotpotQA Dev.
  • ...and 1 more figures

Theorems & Definitions (1)

  • corollary 1