Table of Contents
Fetching ...

UniICL: An Efficient Unified Framework Unifying Compression, Selection, and Generation

Jun Gao, Qi Lv, Zili Wang, Tianxiang Wu, Ziqiang Cao, Wenjie Li

TL;DR

UniICL tackles the core inefficiencies of in-context learning by unifying demonstration compression, selection, and generation within a parameter-efficient framework that freezes a large LLM while learning a small adapter and Memory Slot-based representations. Demonstration compression produces Memory Tokens via a learnable projection, and Demonstration Bank caching prevents repeated compression, enabling scaling to $64$-shot ICL within $24$ GB of VRAM. A two-phase training regime combines a language-model objective with a contrastive selection loss to improve demonstration relevance, achieving strong out-of-domain results on tasks like CoLA, SST-2, IMDb, Arxiv, MS MARCO, and MMLU, and competitive MS MARCO passage ranking after fine-tuning. The approach offers practical gains in efficiency and scalability for real-world ICL applications, reducing hardware burden while preserving or improving performance across diverse benchmarks.

Abstract

In-context learning (ICL) enhances the reasoning abilities of Large Language Models (LLMs) by prepending a few demonstrations. It motivates researchers to introduce more examples to provide additional contextual information for the generation. However, existing methods show a significant limitation due to the problem of excessive growth in context length, which causes a large hardware burden. In addition, shallow-relevant examples selected by off-the-shelf tools hinder LLMs from capturing useful contextual information for generation. In this paper, we propose \textbf{UniICL}, a novel \textbf{Uni}fied \textbf{ICL} framework that unifies demonstration compression, demonstration selection, and final response generation. Furthermore, to boost inference efficiency, we design a tailored compression strategy that allows UniICL to cache compression results into \textbf{Demonstration Bank} (\textbf{DB}), which avoids repeated compression of the same demonstration. Extensive out-of-domain evaluations prove the advantages of UniICL in both effectiveness and efficiency.

UniICL: An Efficient Unified Framework Unifying Compression, Selection, and Generation

TL;DR

UniICL tackles the core inefficiencies of in-context learning by unifying demonstration compression, selection, and generation within a parameter-efficient framework that freezes a large LLM while learning a small adapter and Memory Slot-based representations. Demonstration compression produces Memory Tokens via a learnable projection, and Demonstration Bank caching prevents repeated compression, enabling scaling to -shot ICL within GB of VRAM. A two-phase training regime combines a language-model objective with a contrastive selection loss to improve demonstration relevance, achieving strong out-of-domain results on tasks like CoLA, SST-2, IMDb, Arxiv, MS MARCO, and MMLU, and competitive MS MARCO passage ranking after fine-tuning. The approach offers practical gains in efficiency and scalability for real-world ICL applications, reducing hardware burden while preserving or improving performance across diverse benchmarks.

Abstract

In-context learning (ICL) enhances the reasoning abilities of Large Language Models (LLMs) by prepending a few demonstrations. It motivates researchers to introduce more examples to provide additional contextual information for the generation. However, existing methods show a significant limitation due to the problem of excessive growth in context length, which causes a large hardware burden. In addition, shallow-relevant examples selected by off-the-shelf tools hinder LLMs from capturing useful contextual information for generation. In this paper, we propose \textbf{UniICL}, a novel \textbf{Uni}fied \textbf{ICL} framework that unifies demonstration compression, demonstration selection, and final response generation. Furthermore, to boost inference efficiency, we design a tailored compression strategy that allows UniICL to cache compression results into \textbf{Demonstration Bank} (\textbf{DB}), which avoids repeated compression of the same demonstration. Extensive out-of-domain evaluations prove the advantages of UniICL in both effectiveness and efficiency.
Paper Structure (30 sections, 9 equations, 13 figures, 13 tables)

This paper contains 30 sections, 9 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: (a) Prompt compression methods that indiscriminately compress both demonstrations and queries.(b) Retrieval-based demonstration selection methods select lexical demonstrations. (c) UniICL discriminately compresses demonstrations and performs selection upon the compression results.
  • Figure 2: The workflow of Demonstration Bank.
  • Figure 3: Demonstration compression. $k$ Memory Slots are attached behind each demonstration.
  • Figure 4: Demonstrations selection.
  • Figure 5: In-context generation. The Memory Tokens from different demonstrations are concatenated horizontally at the input end of Vicuna.
  • ...and 8 more figures