Table of Contents
Fetching ...

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin

TL;DR

FG-CLIP 2 addresses the gap in bilingual fine-grained vision-language understanding by introducing a two-stage training paradigm that combines global alignment with region-text and text-only discriminative objectives, including the novel Textual Intra-modal Contrastive (TIC) loss. Built on a SigLIP 2–based dual-encoder, it extends text length, employs region-text supervision, and uses a Cross-modal Rank Loss with Global Threshold Synchronization to stabilize training, achieving strong bilingual performance on both English and Chinese. The authors contribute a new Chinese multimodal benchmark suite with long-caption retrieval and region-level classification, and demonstrate state-of-the-art results across 29 datasets and 8 tasks, including open-vocabulary detection and dense segmentation. The work provides resources (model, code, benchmark) to advance bilingual fine-grained V-L understanding and suggests future work on longer textual inputs and relational reasoning between objects.

Abstract

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

TL;DR

FG-CLIP 2 addresses the gap in bilingual fine-grained vision-language understanding by introducing a two-stage training paradigm that combines global alignment with region-text and text-only discriminative objectives, including the novel Textual Intra-modal Contrastive (TIC) loss. Built on a SigLIP 2–based dual-encoder, it extends text length, employs region-text supervision, and uses a Cross-modal Rank Loss with Global Threshold Synchronization to stabilize training, achieving strong bilingual performance on both English and Chinese. The authors contribute a new Chinese multimodal benchmark suite with long-caption retrieval and region-level classification, and demonstrate state-of-the-art results across 29 datasets and 8 tasks, including open-vocabulary detection and dense segmentation. The work provides resources (model, code, benchmark) to advance bilingual fine-grained V-L understanding and suggests future work on longer textual inputs and relational reasoning between objects.

Abstract

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

Paper Structure

This paper contains 31 sections, 4 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Overview of the FG-CLIP 2.
  • Figure A: Visualization of FG-CLIP 2's dense feature maps and semantic alignment capability in bilingual scenarios.
  • Figure B: Examples from BoxClass-CN.