ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Advik Sinha; Saurabh Atreya; Aashutosh A; Sk Aziz Ali; Abhijit Das

ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Advik Sinha, Saurabh Atreya, Aashutosh A, Sk Aziz Ali, Abhijit Das

TL;DR

ScenarioCLIP introduces a relation-aware, multi-level vision-language pretraining framework that explicitly models global scene context, objects, and grounded inter-object relations. It combines three visual and three textual encoders with EMA-based intra-modal knowledge distillation and cross-modal contrastive alignment, trained on the Action-Genome dataset—a large corpus of actions, objects, and relation triplets with relation-focused regions. The approach yields consistent gains in zero-shot retrieval, linear-probe classification, and object detection, and includes extensive ablations and visualizations to validate relation-grounded representations. The Action-Genome dataset and the proposed pipeline enable robust scenario-level understanding, facilitating more precise action understanding and relational reasoning in complex scenes.

Abstract

Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP

ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

TL;DR

Abstract

ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)