Table of Contents
Fetching ...

Resilience through Scene Context in Visual Referring Expression Generation

Simeon Junker, Sina Zarrieß

TL;DR

This work tackles REG under imperfect visual input by injecting noise into target representations while supplying scene context in visual and symbolic forms. It demonstrates that scene context acts as a resilience resource, enabling accurate referent-type identification even when the target is occluded, across two Transformer-based REG architectures (TRF and CC) and multiple input configurations. Automatic metrics and human judgments converge to show context improves robustness, with symbolic context sometimes outperforming visual cues, and attention analyses revealing context-driven copying tendencies. The findings highlight the central role of scene understanding in REG and motivate future research on richer scene-context representations and diverse datasets to better separate human-like from model-driven use of context.

Abstract

Scene context is well known to facilitate humans' perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models' visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.

Resilience through Scene Context in Visual Referring Expression Generation

TL;DR

This work tackles REG under imperfect visual input by injecting noise into target representations while supplying scene context in visual and symbolic forms. It demonstrates that scene context acts as a resilience resource, enabling accurate referent-type identification even when the target is occluded, across two Transformer-based REG architectures (TRF and CC) and multiple input configurations. Automatic metrics and human judgments converge to show context improves robustness, with symbolic context sometimes outperforming visual cues, and attention analyses revealing context-driven copying tendencies. The findings highlight the central role of scene understanding in REG and motivate future research on richer scene-context representations and diverse datasets to better separate human-like from model-driven use of context.

Abstract

Scene context is well known to facilitate humans' perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models' visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.
Paper Structure (29 sections, 4 figures, 5 tables)

This paper contains 29 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example from RefCOCO (displayed with noise level $0.5$) with generated expressions and human judgments. Visual or symbolic scene context allows to identify even fully occluded targets (noise $1.0$).
  • Figure 2: Relative CIDEr scores with respect to noise $0.0$ for RefCOCO testA and testB. For both TRF and CC, model variants with access to context are more robust against noise, especially for testB.
  • Figure 3: Examples from RefCOCO with generated expressions and human judgments (targets are marked red).
  • Figure 4: Examples from RefCOCO with expressions generated by CC variants and human judgments (targets are marked red).