Table of Contents
Fetching ...

Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian

Abstract

Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.

Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Abstract

Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.

Paper Structure

This paper contains 44 sections, 2 theorems, 23 equations, 5 figures, 16 tables.

Key Result

Corollary 1

Let $X'_l$ and $X'_s$ are subsets of $X_l$ and $X_s$ respectively, where the attention score $\sum_{j:j\in X'_l} q_l^j\geq \delta, \sum_{j:j\in X'_s} q_s^j\geq \delta$. $\delta$ is a positive constant in $[0.8,1]$. By assuming the norm of tokens is bounded, i.e.$\forall i\quad \|X_l^i\|_2\leq c, \|X_s^ $\blacktriangleleft$$\blacktriangleleft$

Figures (5)

  • Figure 1: Efficiency emerges with scale. (a) Latency grows almost linearly on the number of output tokens, and larger models have the higher per-token cost. (b)-(d) However, smaller models (2B/4B) require way more tokens to achieve a comparable performance as larger models (8B).
  • Figure 2: Illustration of the proposed multi-agent inference framework. (a) shows our empirical observation that a large model with a short response can achieve a similar performance as the small model with additional reasoning tokens. (b) demonstrates the proposed reasoning transfer strategy that can reuse the reasoning tokens output by the small model for the large model to improve its performance. (c) Our final proposal adopts mutual verification to further reduce the number of expensive model calls for efficient inference.
  • Figure 3: Illustration of total attention weights of all reasoning tokens and sparsity within reasoning tokens averaged over 32 heads. Sparsity is measured by the ratio of reasoning tokens that contribute to 80% total attention weights of reasoning tokens.
  • Figure 4: Head-wise attention weights of total reasoning tokens across different layers.
  • Figure 5: Head-wise sparsity within reasoning tokens across different layers.

Theorems & Definitions (4)

  • Corollary 1
  • Proposition 1
  • proof
  • proof