Table of Contents
Fetching ...

Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks

Trapit Bansal, Rishikesh Jha, Andrew McCallum

TL;DR

The paper tackles the challenge of few-shot learning across diverse NLP classification tasks with varying numbers of labels. It introduces LEOPARD, an optimization-based meta-learning framework that combines a shared Transformer encoder with a task-conditioned softmax parameter generator and a MAML-style adaptation process, distinguishing task-agnostic and task-specific parameters. Trained on GLUE-style tasks and evaluated on 17 unseen NLP tasks, LEOPARD achieves substantial improvements over strong baselines, including notable gains with as few as 4 examples per label and robust cross-domain transfer. These results demonstrate that meta-learning can yield more generalizable initialization for rapid adaptation to new NLP tasks, paving the way for more flexible and data-efficient language understanding systems.

Abstract

Self-supervised pre-training of transformer models has shown enormous success in improving performance on a number of downstream tasks. However, fine-tuning on a new task still requires large amounts of task-specific labelled data to achieve good performance. We consider this problem of learning to generalize to new tasks with few examples as a meta-learning problem. While meta-learning has shown tremendous progress in recent years, its application is still limited to simulated problems or problems with limited diversity across tasks. We develop a novel method, LEOPARD, which enables optimization-based meta-learning across tasks with different number of classes, and evaluate different methods on generalization to diverse NLP classification tasks. LEOPARD is trained with the state-of-the-art transformer architecture and shows better generalization to tasks not seen at all during training, with as few as 4 examples per label. Across 17 NLP tasks, including diverse domains of entity typing, natural language inference, sentiment analysis, and several other text classification tasks, we show that LEOPARD learns better initial parameters for few-shot learning than self-supervised pre-training or multi-task training, outperforming many strong baselines, for example, yielding 14.5% average relative gain in accuracy on unseen tasks with only 4 examples per label.

Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks

TL;DR

The paper tackles the challenge of few-shot learning across diverse NLP classification tasks with varying numbers of labels. It introduces LEOPARD, an optimization-based meta-learning framework that combines a shared Transformer encoder with a task-conditioned softmax parameter generator and a MAML-style adaptation process, distinguishing task-agnostic and task-specific parameters. Trained on GLUE-style tasks and evaluated on 17 unseen NLP tasks, LEOPARD achieves substantial improvements over strong baselines, including notable gains with as few as 4 examples per label and robust cross-domain transfer. These results demonstrate that meta-learning can yield more generalizable initialization for rapid adaptation to new NLP tasks, paving the way for more flexible and data-efficient language understanding systems.

Abstract

Self-supervised pre-training of transformer models has shown enormous success in improving performance on a number of downstream tasks. However, fine-tuning on a new task still requires large amounts of task-specific labelled data to achieve good performance. We consider this problem of learning to generalize to new tasks with few examples as a meta-learning problem. While meta-learning has shown tremendous progress in recent years, its application is still limited to simulated problems or problems with limited diversity across tasks. We develop a novel method, LEOPARD, which enables optimization-based meta-learning across tasks with different number of classes, and evaluate different methods on generalization to diverse NLP classification tasks. LEOPARD is trained with the state-of-the-art transformer architecture and shows better generalization to tasks not seen at all during training, with as few as 4 examples per label. Across 17 NLP tasks, including diverse domains of entity typing, natural language inference, sentiment analysis, and several other text classification tasks, we show that LEOPARD learns better initial parameters for few-shot learning than self-supervised pre-training or multi-task training, outperforming many strong baselines, for example, yielding 14.5% average relative gain in accuracy on unseen tasks with only 4 examples per label.

Paper Structure

This paper contains 21 sections, 6 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: The proposed LEOPARD model. Input is first encoded using the Transformer. The first batch from the support set is passed through the parameter generator which learns a per-class set representation that is used to generate the initial softmax parameters. Subsequently, the support batches are used for adaptation of the generated parameters as well as the encoder parameters. Pink box (dashed) outline shows modules that are adapted in the inner loop, whereas blue boxes are optimized in the outer loop.
  • Figure 2: Analyzing target task performance as a function of training tasks (best viewed in color). Each column represents one held-out training task (name on $x$-axis) and each row corresponds to one target task (name on $y$-axis). Each cell is the relative change in performance on the target task when the corresponding training task is held-out, compared to training on all the train tasks. Dark blue indicates large drop, dark red indicates large increase and grey indicates close to no change in performance. In general, LEOPARD's performance is more consistent compared to MT-BERT indicating that meta-training learns more generalized initial parameters compared to multi-task training.
  • Figure 3: Analyzing target task performance as a function of training tasks (best viewed in color). Heatmaps on the left are for LEOPARD and on the right are for MT-BERT. Each column represents one held-out training task (name on $x$-axis) and each row corresponds to one target task (name on $y$-axis). Each cell is the relative change in performance on the target task when the corresponding training task is held-out, compared to training on all the train tasks. Dark blue indicates large drop, dark red indicates large increase and grey indicates close to no change in performance. In general, LEOPARD's performance is more consistent compared to MT-BERT indicating that meta-training learns more generalized initial parameters compared to multi-task training.