Table of Contents
Fetching ...

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem

TL;DR

The paper introduces SceneDiff, a training-free framework for multiview object change detection, and SceneDiff Benchmark, the first dataset with dense instance-level annotations across diverse scenes and viewpoints. It leverages pretrained 3D reconstruction (pi^3), segmentation (SAM), and semantic features (DINOv3) to align temporal captures in 3D and detect changes via region-level scoring and cross-frame instance association. The approach yields large improvements over existing baselines on both multiview and two-view benchmarks and is demonstrated in a robotic tidying application. Limitations include ambiguity in cluttered scenes, sensitivity to geometry reconstruction, and a focus on object-level changes, with future work aiming to handle semantic state changes and deformable changes.

Abstract

We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

TL;DR

The paper introduces SceneDiff, a training-free framework for multiview object change detection, and SceneDiff Benchmark, the first dataset with dense instance-level annotations across diverse scenes and viewpoints. It leverages pretrained 3D reconstruction (pi^3), segmentation (SAM), and semantic features (DINOv3) to align temporal captures in 3D and detect changes via region-level scoring and cross-frame instance association. The approach yields large improvements over existing baselines on both multiview and two-view benchmarks and is demonstrated in a robotic tidying application. Limitations include ambiguity in cluttered scenes, sensitivity to geometry reconstruction, and a focus on object-level changes, with future work aiming to handle semantic state changes and deformable changes.

Abstract

We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.

Paper Structure

This paper contains 41 sections, 6 equations, 24 figures, 6 tables.

Figures (24)

  • Figure 1: Multiview change detection. We identify the changed objects (Removed, Added, and Moved) given two videos capturing the same scene at different times. The right panel shows a projected 3D visualization of our 2D predictions, with object boundaries manually overlaid. Dashed lines indicate occluded changed objects.
  • Figure 2: Dataset Examples. We visualize video pairs before and after changes. Changed objects are color-masked by change type: Removed, Added, and Moved. The background is masked white. The first example is from SD-V, and the second is from SD-K.
  • Figure 3: Dataset Statistics. Distribution of object properties, changed object counts, and sequence lengths in the SceneDiff Benchmark. Object size categorization is based on the average pixel size across all frames. SD-V contains larger objects and longer sequences, while SD-K contains more deformable objects.
  • Figure 4: SceneDiff Method.Top (Overall Pipeline): Our pipeline jointly regresses geometry from the before and after sequences, selects paired views with high co-visibility, and computes region-level change scores for each pair. We then threshold these scores to detect changed regions, merge regions within each sequence into object-level changes, and match objects across sequences to classify the change type (Added, Removed, or Moved). Bottom (Region-Level Change Scoring): For each paired view, we extract geometry, appearance features, and instance regions. Geometry and appearance consistency scores are computed via depth and feature reprojection. Region matching scores are generated by mean-pooling features within regions and comparing them across images using feature similarity. These three scores are combined and mean-pooled over regions to produce unified region-level change scores.
  • Figure 5: Qualitative comparison on the SceneDiff benchmark. Ground-truth changed objects are labeled with identification numbers. Object-level predictions are annotated with their matched ground-truth ID irrespective of change type, or with a unique ID if unmatched. Although our method misses the bread in one view, it correctly predicts all changed objects overall. 3DGS-CD produces some correct per-view detections but struggles to associate them into consistent objects, and therefore fails to match any ground-truth objects. The VLM baseline generates reasonable text descriptions ("Removed: basket, orange, snack bag, snack box; Added: pineapple, sandwich, bread") but fails to consistently localize the corresponding objects. Color map: Removed, Added, and Moved.
  • ...and 19 more figures