Resilience through Scene Context in Visual Referring Expression Generation
Simeon Junker, Sina Zarrieß
TL;DR
This work tackles REG under imperfect visual input by injecting noise into target representations while supplying scene context in visual and symbolic forms. It demonstrates that scene context acts as a resilience resource, enabling accurate referent-type identification even when the target is occluded, across two Transformer-based REG architectures (TRF and CC) and multiple input configurations. Automatic metrics and human judgments converge to show context improves robustness, with symbolic context sometimes outperforming visual cues, and attention analyses revealing context-driven copying tendencies. The findings highlight the central role of scene understanding in REG and motivate future research on richer scene-context representations and diverse datasets to better separate human-like from model-driven use of context.
Abstract
Scene context is well known to facilitate humans' perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models' visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.
