FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
Darshan Singh S, Zeeshan Khan, Makarand Tapaswi
TL;DR
FiGCLIP addresses CLIP's gap in fine-grained and compositional reasoning by post-pretraining CLIP on a densely annotated VidSitu dataset using SRL-based prompts and LoRA adapters. A Video Contextualizer aggregates frame-level features into event- and video-level representations, and task-specific prompts guide the cross-modal alignment across multiple losses. The approach achieves state-of-the-art or strong gains across five tasks, including video situation recognition, zero-shot text-to-video retrieval, zero-shot action recognition, dense captioning, and compositional reasoning benchmarks like ARO and SugarCrepe, while using a compact, high-quality dataset. This demonstrates that small, richly annotated datasets can meaningfully enhance CLIP's fine-grained and syntactic capabilities without sacrificing its semantic strengths, offering a practical path for scalable, high-quality VL models.
Abstract
While contrastive language image pretraining (CLIP) have exhibited impressive performance by learning highly semantic and generalized representations, recent works have exposed a fundamental drawback in its syntactic properties, that includes interpreting fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning. One reason for this is that natural captions often do not capture all the visual details of a scene. This leads to unaddressed visual concepts being misattributed to the wrong words. And the pooled image and text features, ends up acting as a bag of words, hence losing the syntactic information. In this work, we ask: Is it possible to enhance CLIP's fine-grained and syntactic abilities without compromising its semantic properties? We show that this is possible by adapting CLIP efficiently on a high-quality, comprehensive, and relatively small dataset. We demonstrate our adaptation strategy on VidSitu, a video situation recognition dataset annotated with verbs and rich semantic role labels (SRL). We use the SRL and verb information to create rule-based detailed captions, making sure they capture most of the visual concepts. Combined with hard negatives and hierarchical losses, these annotations allow us to learn a powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that preserves semantic understanding while being detail-oriented. We evaluate on five diverse vision-language tasks in both fine-tuning and zero-shot settings, achieving consistent improvements over the base CLIP model.
