Table of Contents
Fetching ...

Lookalike3D: Seeing Double in 3D

Chandan Yeshwanth, Angela Dai

Abstract

3D object understanding and generation methods produce impressive results, yet they often overlook a pervasive source of information in real-world scenes: repeated objects. We introduce the task of lookalike object detection in indoor scenes, which leverages repeated and complementary cues from identical and near-identical object pairs. Given an input scene, the task is to classify pairs of objects as identical, similar or different using multiview images as input. To address this, we present Lookalike3D, a multiview image transformer that effectively distinguishes such object pairs by harnessing strong semantic priors from large image foundation models. To support this task, we collected the 3DTwins dataset, containing 76k manually annotated identical, similar and different pairs of objects based on ScanNet++, and show an improvement of 104% IoU over baselines. We demonstrate how our method improves downstream tasks such as enabling joint 3D object reconstruction and part co-segmentation, turning repeated and lookalike objects into a powerful cue for consistent, high-quality 3D perception. Our code, dataset and models will be made publicly available.

Lookalike3D: Seeing Double in 3D

Abstract

3D object understanding and generation methods produce impressive results, yet they often overlook a pervasive source of information in real-world scenes: repeated objects. We introduce the task of lookalike object detection in indoor scenes, which leverages repeated and complementary cues from identical and near-identical object pairs. Given an input scene, the task is to classify pairs of objects as identical, similar or different using multiview images as input. To address this, we present Lookalike3D, a multiview image transformer that effectively distinguishes such object pairs by harnessing strong semantic priors from large image foundation models. To support this task, we collected the 3DTwins dataset, containing 76k manually annotated identical, similar and different pairs of objects based on ScanNet++, and show an improvement of 104% IoU over baselines. We demonstrate how our method improves downstream tasks such as enabling joint 3D object reconstruction and part co-segmentation, turning repeated and lookalike objects into a powerful cue for consistent, high-quality 3D perception. Our code, dataset and models will be made publicly available.

Paper Structure

This paper contains 47 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Many real-world scenes naturally contain repeated objects. Lookalike3D proposes to identify such groups of identical or near-identical (similar) objects across multiview images of an indoor scene. This enables holistic reasoning to be exploited in object-focused 3D reconstruction and perception methods. In particular, we demonstrate joint optimization under these global scene constraints using state-of-the-art, pre-trained 3D object reconstruction and part segmentation models.
  • Figure 2: Overview of the Lookalike3D model. Lookalike3D takes as input a single image or multiple posed images each of a pair of objects, passes these through a shared frozen DINOv2 backbone to extract image patch features, and then encodes these patch features with three kinds of attention layers: single-view, multiview and global. The resulting patch features are aggregated and compared to obtain a single similarity measure of the two objects. The model is trained with a triplet loss to separate the identical, similar and different pairs, and an alignment loss to align the similarities to the classification thresholds.
  • Figure 3: Examples of identical, similar and different pairs of objects from the 3DTwins dataset. Identical objects have the exact same shape and appearance, differing only in scene context and occlusions. Similar objects differ slightly in shape or appearance, and different objects have large differences in shape or appearance.
  • Figure 4: Identical object groups identified by Lookalike3D and baselines. Each group is indicated by a unique color which is randomized across scenes and methods. Correct and incorrect predictions are circled in green and red, respectively. Unlike baselines that frequently misclassify or over-group instances, Lookalike3D leverages multi-view images and fine-grained appearance for accurate object grouping.
  • Figure 5: Joint 3D reconstruction using SAM 3D on Lookalike3D outputs. While individual reconstructions are plausible, they vary significantly in shape and size, often omitting parts due to limited context. In contrast, joint prediction leverages redundant and complementary cues across instances to produce a single, consistent reconstruction. Insets show objects from their original viewpoints.
  • ...and 3 more figures