Table of Contents
Fetching ...

Robust Context-Aware Object Recognition

Klara Janouskova, Cristian Gavrus, Jiri Matas

TL;DR

This work tackles the instability of visual recognition models caused by over-reliance on background context by proposing RCOR, Robust Context-Aware Object Recognition. RCOR jointly models foreground object features and contextual information by decoupling FG and full representations via class-agnostic localization and then fusing them with a robust, non-parametric rule. The approach yields robustness to background distribution shifts while preserving in-domain accuracy, demonstrated across both supervised models and vision-language models on ImageNet-1k–like benchmarks and several fine-grained datasets. The findings highlight that localization quality is the main limiting factor, suggesting ample room for gains from improved FG localization, and show RCOR’s practical potential for real-world robustness without requiring extensive fine-tuning.

Abstract

In visual recognition, both the object of interest (referred to as foreground, FG, for simplicity) and its surrounding context (background, BG) play an important role. However, standard supervised learning often leads to unintended over-reliance on the BG, known as shortcut learning of spurious correlations, limiting model robustness in real-world deployment settings. In the literature, the problem is mainly addressed by suppressing the BG, sacrificing context information for improved generalization. We propose RCOR -- Robust Context-Aware Object Recognition -- the first approach that jointly achieves robustness and context-awareness without compromising either. RCOR treats localization as an integral part of recognition to decouple object-centric and context-aware modelling, followed by a robust, non-parametric fusion. It improves the performance of both supervised models and VLM on datasets with both in-domain and out-of-domain BG, even without fine-tuning. The results confirm that localization before recognition is now possible even in complex scenes as in ImageNet-1k.

Robust Context-Aware Object Recognition

TL;DR

This work tackles the instability of visual recognition models caused by over-reliance on background context by proposing RCOR, Robust Context-Aware Object Recognition. RCOR jointly models foreground object features and contextual information by decoupling FG and full representations via class-agnostic localization and then fusing them with a robust, non-parametric rule. The approach yields robustness to background distribution shifts while preserving in-domain accuracy, demonstrated across both supervised models and vision-language models on ImageNet-1k–like benchmarks and several fine-grained datasets. The findings highlight that localization quality is the main limiting factor, suggesting ample room for gains from improved FG localization, and show RCOR’s practical potential for real-world robustness without requiring extensive fine-tuning.

Abstract

In visual recognition, both the object of interest (referred to as foreground, FG, for simplicity) and its surrounding context (background, BG) play an important role. However, standard supervised learning often leads to unintended over-reliance on the BG, known as shortcut learning of spurious correlations, limiting model robustness in real-world deployment settings. In the literature, the problem is mainly addressed by suppressing the BG, sacrificing context information for improved generalization. We propose RCOR -- Robust Context-Aware Object Recognition -- the first approach that jointly achieves robustness and context-awareness without compromising either. RCOR treats localization as an integral part of recognition to decouple object-centric and context-aware modelling, followed by a robust, non-parametric fusion. It improves the performance of both supervised models and VLM on datasets with both in-domain and out-of-domain BG, even without fine-tuning. The results confirm that localization before recognition is now possible even in complex scenes as in ImageNet-1k.

Paper Structure

This paper contains 26 sections, 2 equations, 7 figures, 14 tables, 1 algorithm.

Figures (7)

  • Figure 1: The complementarity of object ( fg) and context ( bg). The standard approach, bg suppression, makes correct identification in (a) nearly impossible, and difficult in (b); the spectacled bear is the most herbivorous of all bear species, but its facial marks are partially occluded. In generated content (d), any fg can appear on any bg as in ChatGPT 4o's response to "a dolphin on the moon". Rare, even adversarial bgs with possibly huge diversity hurt classification -- (e) shows a cheetah after a snowfall in South Africa, not a snow leopard.
  • Figure 2: VLM (CLIP-B) -- zero-shot recognition with ground truth prompts and selected distractors. In the top example, recognition fails on the foreground (left, crop of a tight object bounding box). In the bottom, it fails on the full image (right). The proposed robust fusion, RCOR, is correct both times.
  • Figure 3: The proposed approach to robust context-aware recognition proceeds in three stages: (1) decomposition of image $x$ into fg and bg by zero-shot class-agnostic detection, (2) independent modelling of the fg and the context-aware full (original image), which also serves as a fallback option when detection fails, and (3) fusion that robustly combines the representations from stage (2) to form the output prediction $p(k|x)$.
  • Figure 4: Localisation --- the role of objectness. Blue crops (maximising weighted confidence) lead to correct predictions, red crops (maximising unweighted confidence) lead to incorrect predictions, representing incomplete, unfocused or over-zoomed regions.
  • Figure 5: Text (red) vs image query (blue) localisation for the 'hard-disk' ImageNet-1k class (592) using OWLv2.
  • ...and 2 more figures