Table of Contents
Fetching ...

Common Inpainted Objects In-N-Out of Context

Tianze Yang, Tyson Jordan, Ruitong Sun, Ninghao Liu, Jin Sun

Abstract

We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments. We demonstrate three key tasks enabled by COinCO: (1) a fine-grained context reasoning approach that classifies objects as in- or out-of-context based on three criteria; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique level semantics, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision, including image forensics. Code and dataset are available at https://co-in-co.github.io/.

Common Inpainted Objects In-N-Out of Context

Abstract

We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments. We demonstrate three key tasks enabled by COinCO: (1) a fine-grained context reasoning approach that classifies objects as in- or out-of-context based on three criteria; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique level semantics, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision, including image forensics. Code and dataset are available at https://co-in-co.github.io/.

Paper Structure

This paper contains 15 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: COinCO contains a rich set of inpainted out-of-context objects (top two rows) and in-context objects (bottom row).
  • Figure 2: COinCO pipeline. Left: Building COinCO. We start with COCO images and annotations. For each image, we randomly select an object and replace it using Stable Diffusion inpainting. We verify the inpainted object using object detection and a vision-language model. Successfully inpainted images are added to the dataset, while failed cases are regenerated and retested. Finally, each inpainted object is labeled as in-context or out-of-context based on three criteria: location, size, and co-occurrence. Right: COinCO Tasks. We demonstrate three downstream applications enabled by COinCO. Sec 5.1: Fine-grained context classification evaluates objects on each criterion using vision-language models. Sec 5.2: Objects-from-context prediction identifies which object categories fit a given context at both instance and clique level semantics, and can recover the original object with an image generation model. Sec 5.3: Context-enhanced fake detection improves pretrained fake detectors by enhancing predictions in out-of-context regions.
  • Figure 3: Context reasoning prompt for LVLMs.
  • Figure 4: Inpainting, fake detection, and objects-from-context results. Context reasoning responses are color-coded by location, size, and co-occurrence. Original objects are in red. Inpainted objects: kite, cow, horse, cup, sports ball, refrigerator.
  • Figure 5: Object-from-Context prediction. A red box in each image indicates the query region. The top row shows three examples (two inpainted, one original COCO). The bottom row lists instance-level and clique-level predictions with their probabilities (P(%)). Objects in red are the top predictions.
  • ...and 2 more figures