Table of Contents
Fetching ...

Efficient Multi-modal Long Context Learning for Training-free Adaptation

Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian

TL;DR

This work tackles the challenge of adapting multimodal large language models to downstream tasks without fine-tuning, especially under very long input contexts. It introduces Efficient Multi-modal Long Context Learning (EMLoC), a training-free approach that compresses long multimodal demonstrations into a compact memory M via chunk-wise processing and layer-wise adaptive pruning guided by Jensen-Shannon divergence constraints. The authors prove a theoretical bound on information loss and demonstrate empirical gains across six vision-language benchmarks, achieving dramatic reductions in context length and inference cost while maintaining or improving accuracy. The proposed framework enables scalable, resource-efficient deployment of multimodal models in real-world settings, with public code for reproduction.

Abstract

Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at https://github.com/Zehong-Ma/EMLoC.

Efficient Multi-modal Long Context Learning for Training-free Adaptation

TL;DR

This work tackles the challenge of adapting multimodal large language models to downstream tasks without fine-tuning, especially under very long input contexts. It introduces Efficient Multi-modal Long Context Learning (EMLoC), a training-free approach that compresses long multimodal demonstrations into a compact memory M via chunk-wise processing and layer-wise adaptive pruning guided by Jensen-Shannon divergence constraints. The authors prove a theoretical bound on information loss and demonstrate empirical gains across six vision-language benchmarks, achieving dramatic reductions in context length and inference cost while maintaining or improving accuracy. The proposed framework enables scalable, resource-efficient deployment of multimodal models in real-world settings, with public code for reproduction.

Abstract

Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at https://github.com/Zehong-Ma/EMLoC.

Paper Structure

This paper contains 26 sections, 16 equations, 6 figures, 16 tables, 1 algorithm.

Figures (6)

  • Figure 1: The comparison between EMLoC and MLoC on ImageNet100 with varying numbers of demonstration examples. With 200 examples, EMLoC achieves 4.4× context compression over vanilla MLoC without performance loss. It significantly outperforms MLoC with 50 examples using a similar context length.
  • Figure 2: (a) The overall framewrk of efficient multi-modal long-context learning. (b) Chunk-wise compression with layer-adaptive pruning, where pruning steps iteratively update output probabilities and are validated using a JS divergence check. Gray squares indicate pruned tokens, with red and green arrows representing failed and successful pruning steps, respectively.
  • Figure 3: Performance and context length trends of EMLoC on ImageNet100 with 200 examples across different $\delta$ values
  • Figure 4: Remaining token number of EMLoC and PyramidKV in ImageNet100 with 200 demonstrations and MME-RW with 20 demonstrations. The corresponding JS divergence after pruning is also illustrated to demonstrate the advantage of EMLoC.
  • Figure 5: Distribution of pruned and reserved tokens.
  • ...and 1 more figures