Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting
Anand Bhattad, Konpat Preechakul, Alexei A. Efros
TL;DR
Visual Jenga introduces a training-free, counterfactual inpainting-based framework to uncover object dependencies in single images by sequentially removing objects while preserving scene coherence. The method leverages asymmetries in object relationships, approximating conditional probabilities with inpainting diversity scored by CLIP and DINO, and ranks removal order accordingly. Evaluations on NYU-v2 and HardParse, plus full-scene decompositions, demonstrate strong pairwise accuracy and plausible sequential removals, outperforming simple heuristics and revealing practical insights for scene understanding and manipulation. The work highlights both the promise of large generative models for counterfactual reasoning and the need for end-to-end, physically grounded approaches to capture richer dependencies in real-world scenes.
Abstract
This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.
