Table of Contents
Fetching ...

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi

Abstract

A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Abstract

A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism

Paper Structure

This paper contains 46 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: PRISM captures multi-view retail video from four synchronized modalities - egocentric, exocentric, 360° panoramic and depth - and structures 270K video SFT samples across 20+ task types organized into four capability dimensions. All capabilities feed into a model-agnostic fine-tuning format compatible with any VLM or VLA, producing embodied agents for real-world retail deployment.
  • Figure 2: PRISM pipeline overview. The overview of the PRISM pipeline is shown here. PRISM is built on egocentric and exocentric videos from real-world retail stores. Four annotation strategies - metadata extraction, LLM generation (Gemini 2.5 Flash), depth analysis (DepthCrafter), and self-supervised transformations - produce twenty tasks (full list of tasks is mentioned in section \ref{['sec:method']}) across four capability domains, totaling 270K instruction-tuning samples. Solid boxes denote egocentric tasks; dashed boxes denote exocentric tasks (exo). Samples from PRISM are used for finetuning Cosmos-Reason2-2B via BF16 LoRA.
  • Figure 3: PRISM capability probe examples (Part 1 of 3). Each row shows a representative video frame alongside the full question and model answer. CoT tasks include the chain-of-thought in $\langle$think$\rangle$ tags (shown in italics) before the final answer.
  • Figure 4: PRISM capability probe examples (Part 2 of 3). Continued from above.
  • Figure 5: PRISM capability probe examples (Part 3 of 3). CS-R-4 and SP tasks test spatial understanding. IP-1 CoT variants demonstrate physics-grounded reasoning about temporal direction. IP-2 evaluates object permanence. MCQ Overlay converts open-ended tasks into four-choice format.
  • ...and 1 more figures