Table of Contents
Fetching ...

SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos

Joshua Li, Fernando Jose Pena Cantu, Emily Yu, Alexander Wong, Yuchen Cui, Yuhao Chen

TL;DR

The paper tackles zero-shot video scene graph generation (VidSGG) in egocentric kitchen videos, where maintaining stable object identities across frames is challenging. It introduces SAMJAM, a 5-stage pipeline that fuses Gemini's open-vocabulary frame-level scene graphs with SAM2's robust temporal segmentation and tracking to produce temporally-consistent graphs. A base-frame object-to-mask matching via IoU and a mask-propagation mechanism form temporally coherent representations, while subsequent frames refine them with new masks and overlaps. Empirical results on EPIC-KITCHENS and EPIC-KITCHENS-100 show SAMJAM achieving a mean recall of $39.66\%$, an $8.33\%$ improvement over Gemini alone, demonstrating improved identity stability and bounding-box grounding. This approach enhances zero-shot VidSGG applicability for egocentric cooking tasks and has implications for downstream QA and robotic assistance in dynamic environments.

Abstract

Video Scene Graph Generation (VidSGG) is an important topic in understanding dynamic kitchen environments. Current models for VidSGG require extensive training to produce scene graphs. Recently, Vision Language Models (VLM) and Vision Foundation Models (VFM) have demonstrated impressive zero-shot capabilities in a variety of tasks. However, VLMs like Gemini struggle with the dynamics for VidSGG, failing to maintain stable object identities across frames. To overcome this limitation, we propose SAMJAM, a zero-shot pipeline that combines SAM2's temporal tracking with Gemini's semantic understanding. SAM2 also improves upon Gemini's object grounding by producing more accurate bounding boxes. In our method, we first prompt Gemini to generate a frame-level scene graph. Then, we employ a matching algorithm to map each object in the scene graph with a SAM2-generated or SAM2-propagated mask, producing a temporally-consistent scene graph in dynamic environments. Finally, we repeat this process again in each of the following frames. We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.

SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos

TL;DR

The paper tackles zero-shot video scene graph generation (VidSGG) in egocentric kitchen videos, where maintaining stable object identities across frames is challenging. It introduces SAMJAM, a 5-stage pipeline that fuses Gemini's open-vocabulary frame-level scene graphs with SAM2's robust temporal segmentation and tracking to produce temporally-consistent graphs. A base-frame object-to-mask matching via IoU and a mask-propagation mechanism form temporally coherent representations, while subsequent frames refine them with new masks and overlaps. Empirical results on EPIC-KITCHENS and EPIC-KITCHENS-100 show SAMJAM achieving a mean recall of , an improvement over Gemini alone, demonstrating improved identity stability and bounding-box grounding. This approach enhances zero-shot VidSGG applicability for egocentric cooking tasks and has implications for downstream QA and robotic assistance in dynamic environments.

Abstract

Video Scene Graph Generation (VidSGG) is an important topic in understanding dynamic kitchen environments. Current models for VidSGG require extensive training to produce scene graphs. Recently, Vision Language Models (VLM) and Vision Foundation Models (VFM) have demonstrated impressive zero-shot capabilities in a variety of tasks. However, VLMs like Gemini struggle with the dynamics for VidSGG, failing to maintain stable object identities across frames. To overcome this limitation, we propose SAMJAM, a zero-shot pipeline that combines SAM2's temporal tracking with Gemini's semantic understanding. SAM2 also improves upon Gemini's object grounding by producing more accurate bounding boxes. In our method, we first prompt Gemini to generate a frame-level scene graph. Then, we employ a matching algorithm to map each object in the scene graph with a SAM2-generated or SAM2-propagated mask, producing a temporally-consistent scene graph in dynamic environments. Finally, we repeat this process again in each of the following frames. We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.

Paper Structure

This paper contains 12 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: A video scene graph captures evolving relationships between objects in a dynamic environment. Video Scene Graph Generation (VidSGG) involves generating many frame-level scene graphs, each containing a set of objects and relationships. Objects that share a common ID across multiple frames are the same object temporally.
  • Figure 2: SAMJAM is a 5-stage pipeline at every frame. Given matched masks from earlier frames, SAM2 propagates masks to the current frame in stage 1. In stage 2, SAM2 generate a set of new masks and combines them with propagated masks, filtering out any overlap. In stage 3, Gemini independently produces a frame-level scene graph. We employ a matching algorithm in stage 4 that maps each Gemini object to a SAM2 mask, and finally synthesize a temporally-consistent scene graph in stage 5. To illustrate the transition from Gemini to SAM2, we also zooms in on two scene graphs produced along the pipeline. See Sec. \ref{['first_frame']} and Sec. \ref{['following_frames']} for details.
  • Figure 3: Qualitative results. We evaluate VidSGG models using a brief video clip taken from EPIC-KITCHENS Damen2018EPICKITCHENS that shows a mug being moved. Illustrated above are the trimmed scene graph outputs on 4 frames from the clip, with bounding boxes for the mug highlighted in red. For Gemini (Video), object grounding of the mug completely fails. For all other methods, we display the object IDs assigned to the mug at each frame. Notably, SAMJAM is the only method that produces a consistent object ID across all 4 frames.