SPOLRE: Semantic Preserving Object Layout Reconstruction for Image Captioning System Testing

Yi Liu; Guanyu Wang; Xinyi Zheng; Gelei Deng; Kailong Wang; Yang Liu; Haoyu Wang

SPOLRE: Semantic Preserving Object Layout Reconstruction for Image Captioning System Testing

Yi Liu, Guanyu Wang, Xinyi Zheng, Gelei Deng, Kailong Wang, Yang Liu, Haoyu Wang

TL;DR

SPOLRE tackles the challenge of robustly testing image captioning systems by reconstructing semantically equivalent but layout-varied object configurations. It combines semantic segmentation, inpainting-based mask refinement, metamorphic layout edits (translation, rotation, scaling, mirroring), and diffusion-based mask-to-image translation to generate diverse, realistic test cases without manual annotations. Across seven IC systems and 200 seeds, SPOLRE delivers higher realism and a superior error-detection precision (avg. 91.62%) than state-of-the-art baselines, uncovering 31,544 errors (including 6,236 Azure-specific). The work demonstrates that semantically preserving layout transformations can substantially improve IC testing coverage and reliability, with open-source artifacts and datasets facilitating further research.

Abstract

Image captioning (IC) systems, such as Microsoft Azure Cognitive Service, translate image content into descriptive language but can generate inaccuracies leading to misinterpretations. Advanced testing techniques like MetaIC and ROME aim to address these issues but face significant challenges. These methods require intensive manual labor for detailed annotations and often produce unrealistic images, either by adding unrelated objects or failing to remove existing ones. Additionally, they generate limited test suites, with MetaIC restricted to inserting specific objects and ROME limited to a narrow range of variations. We introduce SPOLRE, a novel automated tool for semantic-preserving object layout reconstruction in IC system testing. SPOLRE leverages four transformation techniques to modify object layouts without altering the image's semantics. This automated approach eliminates the need for manual annotations and creates realistic, varied test suites. Our tests show that over 75% of survey respondents find SPOLRE-generated images more realistic than those from state-of-the-art methods. SPOLRE excels in identifying caption errors, detecting 31,544 incorrect captions across seven IC systems with an average precision of 91.62%, surpassing other methods which average 85.65% accuracy and identify 17,160 incorrect captions. Notably, SPOLRE identified 6,236 unique issues within Azure, demonstrating its effectiveness against one of the most advanced IC systems.

SPOLRE: Semantic Preserving Object Layout Reconstruction for Image Captioning System Testing

TL;DR

Abstract

Paper Structure (26 sections, 3 equations, 8 figures, 2 tables, 3 algorithms)

This paper contains 26 sections, 3 equations, 8 figures, 2 tables, 3 algorithms.

Introduction
Background
Image Captioning
Testing for IC Systems
Image-to-Image Translation
Motivation
Methodology And Implementation
Semantic Segmentation
Mask Extractor
Layout Editor
Mask-to-Image Translation
Caption Parser
Error Detection
Evaluation
RQ1: What is the level of realism and diversity in the images generated by SPOLRE?
...and 11 more sections

Figures (8)

Figure 1: Examples of three types of common errors.
Figure 2: Unrealistic examples generated by existing metamorphic testing frameworks for IC systems. The left is the seed image. The middle and right come from MetaIC and ROME.
Figure 3: SPOLRE overview with two major parts: Image Processing and Text Processing.
Figure 4: The impact of inpainting image on generating new images when extracting masks.
Figure 5: Research results on the authenticity and diversity of generated images.
...and 3 more figures

Theorems & Definitions (1)

definition 1

SPOLRE: Semantic Preserving Object Layout Reconstruction for Image Captioning System Testing

TL;DR

Abstract

SPOLRE: Semantic Preserving Object Layout Reconstruction for Image Captioning System Testing

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (1)