Table of Contents
Fetching ...

ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes

Hashmat Shadab Malik, Muhammad Huzaifa, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

TL;DR

This work evaluates the resilience of current vision-based models against diverse object-to-background context variations and harnesses the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes.

Abstract

Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiments to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks. Code https://github.com/Muhammad-Huzaifaa/ObjectCompose.

ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes

TL;DR

This work evaluates the resilience of current vision-based models against diverse object-to-background context variations and harnesses the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes.

Abstract

Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiments to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks. Code https://github.com/Muhammad-Huzaifaa/ObjectCompose.
Paper Structure (33 sections, 7 equations, 36 figures, 23 tables, 1 algorithm)

This paper contains 33 sections, 7 equations, 36 figures, 23 tables, 1 algorithm.

Figures (36)

  • Figure 1: Image-to-background variations generated by our method, with each column representing a specific background based on the prompt below.
  • Figure 2: ObjectCompose uses an inpainting-based diffusion model to generate counterfactual backgrounds. The object mask is obtained from SAM using the class label as a prompt. The segmentation mask and original image caption (from BLIP-2) are fed into the diffusion model. For adversarial examples, both the latent and conditional embeddings are optimized during denoising.
  • Figure 3: The loss surfaces (flipped) of the ViT-S depicted on ImageNet-B. Significant distribution shifts result in narrow and shallow surfaces at convergence.
  • Figure 4: Qualitative comparison of our method (bottom row) with previous related work (top row). Our method enables diversity and controlled background edits.
  • Figure 5: Evaluating LANCE on $\texttt{ImageNet-B}_{1000}$ dataset with masked background.
  • ...and 31 more figures