Table of Contents
Fetching ...

FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training

Anjia Cao, Xing Wei, Zhiheng Ma

TL;DR

FLAME challenges the convention that frozen text encoders hinder performance by using frozen large language models to process long, multilingual captions. It introduces multifaceted prompt distillation to extract diverse semantic facets and a facet-decoupled attention mechanism with offline embedding to enable efficient, single-pass training. Empirical results show FLAME achieves state-of-the-art data efficiency, strong long-context and multilingual retrieval, and competitive zero-shot classification across datasets, even when trained on substantially smaller corpora. The approach offers a practical path toward scalable, multilingual, long-context vision-language pre-training with measurable improvements and interpretable semantic mappings.

Abstract

Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders. While prevailing methods attempt to address these issues through data augmentation and architecture modifications, they continue to struggle with processing long-form text inputs, and the inherent limitations of traditional CLIP text encoders lead to suboptimal downstream generalization. In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, which better aligns with the multifaceted nature of images, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy, to ensure efficient computation. Extensive empirical evaluations demonstrate FLAME's superior performance. When trained on CC3M, FLAME surpasses the previous state-of-the-art by 4.9% in ImageNet top-1 accuracy. On YFCC15M, FLAME surpasses the WIT-400M-trained CLIP by 44.4\% in average image-to-text recall@1 across 36 languages, and by 34.6% in text-to-image recall@1 for long-context retrieval on Urban-1k. Code is available at https://github.com/MIV-XJTU/FLAME.

FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training

TL;DR

FLAME challenges the convention that frozen text encoders hinder performance by using frozen large language models to process long, multilingual captions. It introduces multifaceted prompt distillation to extract diverse semantic facets and a facet-decoupled attention mechanism with offline embedding to enable efficient, single-pass training. Empirical results show FLAME achieves state-of-the-art data efficiency, strong long-context and multilingual retrieval, and competitive zero-shot classification across datasets, even when trained on substantially smaller corpora. The approach offers a practical path toward scalable, multilingual, long-context vision-language pre-training with measurable improvements and interpretable semantic mappings.

Abstract

Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders. While prevailing methods attempt to address these issues through data augmentation and architecture modifications, they continue to struggle with processing long-form text inputs, and the inherent limitations of traditional CLIP text encoders lead to suboptimal downstream generalization. In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, which better aligns with the multifaceted nature of images, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy, to ensure efficient computation. Extensive empirical evaluations demonstrate FLAME's superior performance. When trained on CC3M, FLAME surpasses the previous state-of-the-art by 4.9% in ImageNet top-1 accuracy. On YFCC15M, FLAME surpasses the WIT-400M-trained CLIP by 44.4\% in average image-to-text recall@1 across 36 languages, and by 34.6% in text-to-image recall@1 for long-context retrieval on Urban-1k. Code is available at https://github.com/MIV-XJTU/FLAME.

Paper Structure

This paper contains 40 sections, 2 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Conceptual comparison of text streams. FLAME leverages frozen large language models (LLMs) to directly process long captions. With multifaceted prompts, this framework extracts diverse semantic embeddings, achieving data efficiency. Preserving LLMs' inherent capabilities enables multilingual generalization.
  • Figure 2: FLAME overview. This framework harnesses the sophisticated long-text comprehension capabilities of large language models to conduct language-image pre-training directly on long captions. Based on multifaceted prompts, it extracts a diverse array of representations embedded within the long caption, thereby enhancing semantic alignment.
  • Figure 3: Facet-decoupled attention. By streamlining all prompts with a shared prefix and applying this facet-decoupled attention mask, the overhead of feature extraction is greatly reduced. The positions in red indicate the features to be extracted.
  • Figure 4: Multilingual zero-shot retrieval recall@1 results on Crossmodal-3600 (text-to-image retrieval). Despite being trained solely on English datasets, FLAME achieves outstanding average performance across all 36 languages, surpassing mSigLIP, which is trained on the multilingual WebLI dataset with 100 languages. The image-to-text results are provided in the supplementary materials.
  • Figure 5: Semantic interpretability. Based on vocabulary mapping, FLAME achieves patch-to-word translation with competent interpretability of language-image alignment. We apply average pooling to reduce the number of words for a clearer presentation.
  • ...and 3 more figures