Table of Contents
Fetching ...

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou

TL;DR

A comprehensive analysis of collaborative methodologies between large and small language models, elucidating their common properties and shedding light on the differential knowledge capabilities of System 2 versus System 1 through the FS-GEN framework.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities across various applications but encounter substantial challenges such as high inference latency, considerable training costs, and the generation of hallucinations. Collaborative decoding between large and small language models (SLMs) presents a promising strategy to mitigate these issues through methods including speculative decoding, contrastive decoding, and emulator or proxy fine-tuning. However, the specifics of such collaborations, particularly from a unified perspective, remain largely unexplored. Inspired by dual-process cognitive theory, we propose a unified framework in this paper, termed Fast and Slow Generating (FS-GEN). Within this framework, LLMs (sometimes along with SLMs) are categorized as System 2 (slow and deliberate), while independent SLMs are designated as System 1 (fast and intuitive). We provide a comprehensive analysis of these collaborative methodologies, elucidating their common properties and shedding light on the differential knowledge capabilities of System 2 versus System 1 through the FS-GEN framework. Our findings indicate that only a small proportion of collaborative interactions (approximately less than 20\% in most instances) are necessary across various methods. These interactions between System 1 and System 2 conform to a scaling law related to the parameter ratios, enabling predictable collaboration. Furthermore, we explore the specific conditions under which collaboration proves most effective, particularly from an uncertainty perspective, offering novel insights that may guide future optimization efforts. Our research underscores that the fundamental distinction between System 1 and System 2 lies in the uncertainty of next token predictions, where interventions by System 2 are crucial to support System 1. Code for Reproduction: https://github.com/TsinghuaC3I/FS-GEN

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

TL;DR

A comprehensive analysis of collaborative methodologies between large and small language models, elucidating their common properties and shedding light on the differential knowledge capabilities of System 2 versus System 1 through the FS-GEN framework.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities across various applications but encounter substantial challenges such as high inference latency, considerable training costs, and the generation of hallucinations. Collaborative decoding between large and small language models (SLMs) presents a promising strategy to mitigate these issues through methods including speculative decoding, contrastive decoding, and emulator or proxy fine-tuning. However, the specifics of such collaborations, particularly from a unified perspective, remain largely unexplored. Inspired by dual-process cognitive theory, we propose a unified framework in this paper, termed Fast and Slow Generating (FS-GEN). Within this framework, LLMs (sometimes along with SLMs) are categorized as System 2 (slow and deliberate), while independent SLMs are designated as System 1 (fast and intuitive). We provide a comprehensive analysis of these collaborative methodologies, elucidating their common properties and shedding light on the differential knowledge capabilities of System 2 versus System 1 through the FS-GEN framework. Our findings indicate that only a small proportion of collaborative interactions (approximately less than 20\% in most instances) are necessary across various methods. These interactions between System 1 and System 2 conform to a scaling law related to the parameter ratios, enabling predictable collaboration. Furthermore, we explore the specific conditions under which collaboration proves most effective, particularly from an uncertainty perspective, offering novel insights that may guide future optimization efforts. Our research underscores that the fundamental distinction between System 1 and System 2 lies in the uncertainty of next token predictions, where interventions by System 2 are crucial to support System 1. Code for Reproduction: https://github.com/TsinghuaC3I/FS-GEN
Paper Structure (23 sections, 2 equations, 23 figures, 1 table)

This paper contains 23 sections, 2 equations, 23 figures, 1 table.

Figures (23)

  • Figure 1: In the fast-and-slow generation framework, we conceptualize small language models as System 1 and a mixed-scale compound system of large and small language models as System 2. This compound system incorporates various collaborative decoding methods, such as speculative decoding, contrastive decoding, and emulator tuning. In this paper, we investigate the optimal conditions and contexts for applying System 1 and System 2 at the token-level decoding stage.
  • Figure 2: Collaboration frequency between models across various tasks (Qwen Series).
  • Figure 3: Collaboration frequency between models across various tasks (Pythia Series).
  • Figure 4: The relationship between models' parameters ratios and the lower bounds of collaboration frequency (Qwen Series).
  • Figure 5: The relationship between models' parameters ratios and the lower bounds of collaboration frequency (Pythia Series).
  • ...and 18 more figures