Table of Contents
Fetching ...

ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Advik Sinha, Saurabh Atreya, Aashutosh A, Sk Aziz Ali, Abhijit Das

TL;DR

ScenarioCLIP introduces a relation-aware, multi-level vision-language pretraining framework that explicitly models global scene context, objects, and grounded inter-object relations. It combines three visual and three textual encoders with EMA-based intra-modal knowledge distillation and cross-modal contrastive alignment, trained on the Action-Genome dataset—a large corpus of actions, objects, and relation triplets with relation-focused regions. The approach yields consistent gains in zero-shot retrieval, linear-probe classification, and object detection, and includes extensive ablations and visualizations to validate relation-grounded representations. The Action-Genome dataset and the proposed pipeline enable robust scenario-level understanding, facilitating more precise action understanding and relational reasoning in complex scenes.

Abstract

Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP

ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

TL;DR

ScenarioCLIP introduces a relation-aware, multi-level vision-language pretraining framework that explicitly models global scene context, objects, and grounded inter-object relations. It combines three visual and three textual encoders with EMA-based intra-modal knowledge distillation and cross-modal contrastive alignment, trained on the Action-Genome dataset—a large corpus of actions, objects, and relation triplets with relation-focused regions. The approach yields consistent gains in zero-shot retrieval, linear-probe classification, and object detection, and includes extensive ablations and visualizations to validate relation-grounded representations. The Action-Genome dataset and the proposed pipeline enable robust scenario-level understanding, facilitating more precise action understanding and relational reasoning in complex scenes.

Abstract

Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP

Paper Structure

This paper contains 28 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: ScenarioCLIP can not only detect actions and objects but also localize the relations between objects in both single-relation (a) and multi-relation (b) scenes.
  • Figure 2: Focused regions and scenario-centric grounding. (a) Our pipeline generates relation-focused regions for each triplet (e.g., woman wearing glasses, vessel on stove for food). (b) A VLM produces scenario-centric metadata, which GroundingDINO liu2023grounding and SAM kirillov2023segment then ground with bounding boxes and segmentation masks that we convert into the focused regions used by ScenarioCLIP.
  • Figure 3: Data generation pipeline for the Action-Genome Dataset. Given a raw image, a vision-language model produces a global action caption, object list, and relation triplets (left). An object grounding model (GroundingDINO liu2023grounding) then predicts bounding boxes for the mentioned objects (middle). Finally, a segmentation model (SAM kirillov2023segment) with RBF-based weighting constructs relation-focused regions that highlight the spatial context of each $(\text{object}_1,\text{relation},\text{object}_2)$ triplet (right).
  • Figure 4: Overview of ScenarioCLIP. A global, object, and relation encoder extracts visual features from the full image, object crops, and focused regions, while corresponding text encoders embed the action caption, object names, and relation triplets. Contrastive losses align image-action, object-object, and relation-relation pairs, and EMA teachers provide knowledge-distillation targets.
  • Figure 5: t-SNE visualization of object-level semantic embeddings of 10,000 objects from the Action Genome Dataset. The large points represent the text features and the smaller points represent the image features.
  • ...and 4 more figures