Table of Contents
Fetching ...

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Darshan Singh S, Zeeshan Khan, Makarand Tapaswi

TL;DR

FiGCLIP addresses CLIP's gap in fine-grained and compositional reasoning by post-pretraining CLIP on a densely annotated VidSitu dataset using SRL-based prompts and LoRA adapters. A Video Contextualizer aggregates frame-level features into event- and video-level representations, and task-specific prompts guide the cross-modal alignment across multiple losses. The approach achieves state-of-the-art or strong gains across five tasks, including video situation recognition, zero-shot text-to-video retrieval, zero-shot action recognition, dense captioning, and compositional reasoning benchmarks like ARO and SugarCrepe, while using a compact, high-quality dataset. This demonstrates that small, richly annotated datasets can meaningfully enhance CLIP's fine-grained and syntactic capabilities without sacrificing its semantic strengths, offering a practical path for scalable, high-quality VL models.

Abstract

While contrastive language image pretraining (CLIP) have exhibited impressive performance by learning highly semantic and generalized representations, recent works have exposed a fundamental drawback in its syntactic properties, that includes interpreting fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning. One reason for this is that natural captions often do not capture all the visual details of a scene. This leads to unaddressed visual concepts being misattributed to the wrong words. And the pooled image and text features, ends up acting as a bag of words, hence losing the syntactic information. In this work, we ask: Is it possible to enhance CLIP's fine-grained and syntactic abilities without compromising its semantic properties? We show that this is possible by adapting CLIP efficiently on a high-quality, comprehensive, and relatively small dataset. We demonstrate our adaptation strategy on VidSitu, a video situation recognition dataset annotated with verbs and rich semantic role labels (SRL). We use the SRL and verb information to create rule-based detailed captions, making sure they capture most of the visual concepts. Combined with hard negatives and hierarchical losses, these annotations allow us to learn a powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that preserves semantic understanding while being detail-oriented. We evaluate on five diverse vision-language tasks in both fine-tuning and zero-shot settings, achieving consistent improvements over the base CLIP model.

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

TL;DR

FiGCLIP addresses CLIP's gap in fine-grained and compositional reasoning by post-pretraining CLIP on a densely annotated VidSitu dataset using SRL-based prompts and LoRA adapters. A Video Contextualizer aggregates frame-level features into event- and video-level representations, and task-specific prompts guide the cross-modal alignment across multiple losses. The approach achieves state-of-the-art or strong gains across five tasks, including video situation recognition, zero-shot text-to-video retrieval, zero-shot action recognition, dense captioning, and compositional reasoning benchmarks like ARO and SugarCrepe, while using a compact, high-quality dataset. This demonstrates that small, richly annotated datasets can meaningfully enhance CLIP's fine-grained and syntactic capabilities without sacrificing its semantic strengths, offering a practical path for scalable, high-quality VL models.

Abstract

While contrastive language image pretraining (CLIP) have exhibited impressive performance by learning highly semantic and generalized representations, recent works have exposed a fundamental drawback in its syntactic properties, that includes interpreting fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning. One reason for this is that natural captions often do not capture all the visual details of a scene. This leads to unaddressed visual concepts being misattributed to the wrong words. And the pooled image and text features, ends up acting as a bag of words, hence losing the syntactic information. In this work, we ask: Is it possible to enhance CLIP's fine-grained and syntactic abilities without compromising its semantic properties? We show that this is possible by adapting CLIP efficiently on a high-quality, comprehensive, and relatively small dataset. We demonstrate our adaptation strategy on VidSitu, a video situation recognition dataset annotated with verbs and rich semantic role labels (SRL). We use the SRL and verb information to create rule-based detailed captions, making sure they capture most of the visual concepts. Combined with hard negatives and hierarchical losses, these annotations allow us to learn a powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that preserves semantic understanding while being detail-oriented. We evaluate on five diverse vision-language tasks in both fine-tuning and zero-shot settings, achieving consistent improvements over the base CLIP model.
Paper Structure (62 sections, 4 equations, 8 figures, 18 tables)

This paper contains 62 sections, 4 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: We illustrate the qualitative performance of FiGCLIP, a fine-grained adaptation of the popular CLIP model across multiple datasets. Left: In video situation recognition vidsitu, we highlight a couple of example events showing the input frames, corresponding attention maps, and the event-level predictions. FiGCLIP has more focused attention as compared to CLIP in localizing the driver of the car (top) and the man (bottom). Middle: In text-to-video retrieval on MSRVTT msrvtt, we observe that FiGCLIP outperforms CLIP, especially in cases where compositional reasoning is required. CLIP performs poorer on queries with attributes such as red dress or blonde hair, and multi-shot events such as woman and man talking on the phone. Right: On SugarCrepe sugarcrepe, FiGCLIP is able to pick the correct caption between two descriptions differing only in one aspect.
  • Figure 2: We visualize an overview of our CLIP adaptation strategy. On the left panel, the visual encoder, consisting of the CLIP backbone and the video contextualize, is applied to a single video with $P$ events. The middle panel shows the frozen CLIP text encoder extracting event-level text representations. Finally, in the right panel, we highlight how the 4 losses are computed by putting together different tokens.
  • Figure 3: Video Situation Recognition on 5 videos. FiGCLIP performs much better than CLIP in picking the right attribute of an entity. The last row shows a failure case where the semantic role labels predicted by FiGCLIP deviates from the ground-truth (GT).
  • Figure 4: Zero-shot text-to-video retrieval on the MSRVTT dataset. We show three frames of the top-1 retrieved video for each query. We can see that FiGCLIP outperforms CLIP, specially when compositional reasoning is required. The last row shows a failure case. Although FiGCLIP retrieves a video in which a man is talking, and potentially with more appropriate background, he is not talking about hiking.
  • Figure 5: Zero-shot text-to-video retrieval on the LSMDC dataset. We show three frames of the top-1 retrieved video for each query. We can again notice that FiGCLIP performs better than CLIP when compositional reasoning is needed. The last row shows a failure case.
  • ...and 3 more figures