Table of Contents
Fetching ...

VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

Ziteng Wang, Siqi Yang, Limeng Qiao, Lin Ma

TL;DR

This work tackles CLIP's challenge of fine-grained visual understanding by introducing CLIP-IN, which fuses instruction-editing hard negatives and long descriptive captions. A two-stage training pipeline first adapts the text encoder to long inputs via Rotary Positional Embeddings (RoPE) with knowledge distillation, then jointly trains on instruction editing data and long captions using a symmetric hard negative loss and standard contrastive loss. The approach yields substantial gains on MMVP and other fine-grained perception benchmarks, while preserving strong zero-shot performance and enhancing multimodal reasoning in Multimodal Large Language Models. The results demonstrate that targeted, instruction-based contrastive learning together with rich descriptive supervision can significantly elevate fine-grained vision-language understanding with relatively modest data scales, offering practical improvements for downstream multimodal systems.

Abstract

Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN's visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.

VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

TL;DR

This work tackles CLIP's challenge of fine-grained visual understanding by introducing CLIP-IN, which fuses instruction-editing hard negatives and long descriptive captions. A two-stage training pipeline first adapts the text encoder to long inputs via Rotary Positional Embeddings (RoPE) with knowledge distillation, then jointly trains on instruction editing data and long captions using a symmetric hard negative loss and standard contrastive loss. The approach yields substantial gains on MMVP and other fine-grained perception benchmarks, while preserving strong zero-shot performance and enhancing multimodal reasoning in Multimodal Large Language Models. The results demonstrate that targeted, instruction-based contrastive learning together with rich descriptive supervision can significantly elevate fine-grained vision-language understanding with relatively modest data scales, offering practical improvements for downstream multimodal systems.

Abstract

Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN's visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.

Paper Structure

This paper contains 22 sections, 8 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Instruction Editing Data as Hard Negatives. (a) We illustrate how instruction editing data provides challenging negative examples for CLIP. Given a source image and caption, an editing instruction leads to a target image and caption with subtle, fine-grained changes. The (source_image, target_caption) and (target_image, source_caption) pairs serve as hard negatives in (b), requiring the model to distinguish these nuanced visual-semantic differences. (c) We propose a symmetric hard negative contrastive loss to explicitly train the model to discern these subtle visual-semantic differences from both image-to-text and text-to-image perspectives.
  • Figure 2: CLIP-IN Framework Overview. Stage1. We adapt the CLIP text encoder to process long captions using Rotary Positional Embeddings (RoPE) via knowledge distillation. Stage2. Our proposed framework, CLIP-IN, leverages two complementary data sources: instruction editing data and long descriptive captions. Instruction editing data excels at teaching the "where" and "how" of subtle visual details, while long captions provide the broader "what" and "why" of the scene, capturing complex relationships and contextual information.
  • Figure 3: Examples of feature visualization.
  • Figure 4: Examples of instruction editing hard image-text pairs.