Table of Contents
Fetching ...

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

Anand Bhattad, Konpat Preechakul, Alexei A. Efros

TL;DR

Visual Jenga introduces a training-free, counterfactual inpainting-based framework to uncover object dependencies in single images by sequentially removing objects while preserving scene coherence. The method leverages asymmetries in object relationships, approximating conditional probabilities with inpainting diversity scored by CLIP and DINO, and ranks removal order accordingly. Evaluations on NYU-v2 and HardParse, plus full-scene decompositions, demonstrate strong pairwise accuracy and plausible sequential removals, outperforming simple heuristics and revealing practical insights for scene understanding and manipulation. The work highlights both the promise of large generative models for counterfactual reasoning and the need for end-to-end, physically grounded approaches to capture richer dependencies in real-world scenes.

Abstract

This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

TL;DR

Visual Jenga introduces a training-free, counterfactual inpainting-based framework to uncover object dependencies in single images by sequentially removing objects while preserving scene coherence. The method leverages asymmetries in object relationships, approximating conditional probabilities with inpainting diversity scored by CLIP and DINO, and ranks removal order accordingly. Evaluations on NYU-v2 and HardParse, plus full-scene decompositions, demonstrate strong pairwise accuracy and plausible sequential removals, outperforming simple heuristics and revealing practical insights for scene understanding and manipulation. The work highlights both the promise of large generative models for counterfactual reasoning and the need for end-to-end, physically grounded approaches to capture richer dependencies in real-world scenes.

Abstract

This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.

Paper Structure

This paper contains 28 sections, 2 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Counterfactual Inpainting. Given a pair of objects in an image (here, cat and table), we want to compute which of the two objects is more dependent on the other. We do this by removing each object in turn (masked images), and use a large inpainting model to generate $N$ possible inpaintings for the masked regions (top and bottom rows). The number below each inpainting result is a pairwise cosine similarity (derived from CLIP and DINO) between it and the object in the original image. The average images (for illustration only) show that the cat could be replaced by many different objects, while the table remains largely unchanged. This suggests that the table is supporting the cat.
  • Figure 2: Asymmetric Relationships in Real-World Images. Consider performing two internet image searches: "cup" (left) and "table" (middle). Notice that almost all the cups are depicted on top of a table, whereas images of tables rarely contain cups. That is, $P(\text{Table}~\mid~\text{Cup}) \gg P(\text{Cup}~\mid~\text{Table}).$ The Venn diagram (right) illustrates how object A (cup) depends on object B (table) for structural support: observing a table does not guarantee a cup, but observing a cup strongly implies a table (i.e. $P(\text{Table} \mid \text{Cup}) \gg P(\text{Cup} \mid \text{Table})$). By leveraging these asymmetric relationships, we can infer object dependencies in a scene from the distributions $P(A \mid B)$ and $P(B \mid A)$ learned from large-scale data.
  • Figure 3: Our Pipeline. Given only an input image, (a) we first run Molmo Deitke2024-jt which places a point on each object in the image. (b) These points then serve as prompts for the Segment-Anything (SAM 2) Ravi2025-qf model to obtain segmentation maps for each object. (c) Given the object masks, we can now run our Counterfactual Inpainting method on all object candidates to determine their removal order via a ranking strategy (illustrated in Fig. \ref{['fig:cat']}). (d) Finally, we use Firefly Adobe-IncUnknown-mb to remove objects based on these ranking order.
  • Figure 4: Results on diverse images with increasing numbers of objects (top to bottom). Our method produces plausible removal sequences for both simple stacked setups (first and last rows) and complex indoor scenes. For example, in the second row, the cat is removed first, followed by the laptop and table at the end. In the fifth row, the napkin is removed after the serving spoons and tray at the end. In the sixth row, the removal order for a dinner plate is: hard-wheat rolls (baati), lentil soup, sauce, spoon, and finally the plate (note that our method even removes the lentil soup). In the second-last row, note that one of three glasses is removed before the last book, which is also correct, resulting in a physically plausible sequence. For ease of visualization, we show yellow markers to highlight the object that is removed next.
  • Figure 5: Removal sequence of a breakfast table on a balcony generated by our pipeline. Our method can accurately rank partially occluded objects, as well as a busy breakfast table setup, by sequentially removing all items. Note that new objects, such as a plate behind the basket due to occlusion, may also be introduced after an object is removed and are treated normally by our pipeline.
  • ...and 14 more figures