How Multi-Modal LLMs Reshape Visual Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation

Liwen Wang; Yuanyuan Yuan; Ao Sun; Zongjie Li; Pingchuan Ma; Daoyuan Wu; Shuai Wang

How Multi-Modal LLMs Reshape Visual Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation

Liwen Wang, Yuanyuan Yuan, Ao Sun, Zongjie Li, Pingchuan Ma, Daoyuan Wu, Shuai Wang

TL;DR

The paper investigates how multi-modal LLMs enable instruction-driven image mutations for visual deep learning testing, evaluating semantic validity, prompt alignment, and faithfulness through a large-scale human study. It shows that MLLMs cannot reliably edit existing semantics like traditional mutations but can generate powerful semantic-replacement mutations that expand the testing landscape; two specialized pipelines (unidirectional fine-tuning and bidirectional post-hoc) offer complementary strengths. The findings underscore the complementary role of MLLM-based mutations with traditional mutations, highlight limitations of existing validation metrics for such mutations, and advocate an integrated testing approach that leverages the best of both worlds for robust VDL testing. Practically, this work informs how to design mutation pipelines, select appropriate datasets, and develop better evaluation metrics to harness MLLMs for comprehensive, reliable VDL testing.

Abstract

Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving. To evaluate the reliability of VDL, a mainstream approach is software testing, which requires diverse mutations over image semantics. The rapid development of multi-modal large language models (MLLMs) has introduced revolutionary image mutation potentials through instruction-driven methods. Users can now freely describe desired mutations and let MLLMs generate the mutated images. Hence, parallel to large language models' (LLMs) recent success in traditional software fuzzing, one may also expect MLLMs to be promising for VDL testing in terms of offering unified, diverse, and complex image mutations. However, the quality and applicability of MLLM-based mutations in VDL testing remain largely unexplored. We present the first study, aiming to assess MLLMs' adequacy from 1) the semantic validity of MLLM mutated images, 2) the alignment of MLLM mutated images with their text instructions (prompts), and 3) the faithfulness of how different mutations preserve semantics that are ought to remain unchanged. With large-scale human studies and quantitative evaluations, we identify MLLM's promising potentials in expanding the covered semantics of image mutations. Notably, while SoTA MLLMs (e.g., GPT-4V) fail to support or perform worse in editing existing semantics in images (as in traditional mutations like rotation), they generate high-quality test inputs using "semantic-replacement" mutations (e.g., "dress a dog with clothes"), which bring extra semantics to images; these were infeasible for past approaches. Hence, we view MLLM-based mutations as a vital complement to traditional mutations, and advocate future VDL testing tasks to combine MLLM-based methods and traditional image mutations for comprehensive and reliable testing.

How Multi-Modal LLMs Reshape Visual Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 4 figures, 7 tables)

This paper contains 20 sections, 1 equation, 4 figures, 7 tables.

Introduction
Background: VDL Testing
Image Mutations and Validation
Explicit & Mathematical Transformations
Implicit & Data Exploration
MLLM-Based Mutations
Research Motivations
Research Study Setup
Evaluated Aspects
MLLMs and Their Pipelines
Datasets and Mutations
Human Studies Setup
Results and Findings.
RQ1: Unifying and Generalizing Mutations
RQ2: Expanding Mutated Semantics
...and 5 more sections

Figures (4)

Figure 1: Decomposition of image semantics and their corresponding mutation schemes.
Figure 2: Pipelines of MLLM-based mutations. Fig. \ref{['fig:pipeline']} shows the most straightforward dialog-based pipeline. Fig. \ref{['fig:pipeline']} illustrates two pipelines specifically optimized for image mutations. The unidirectional pipeline shown in Fig. \ref{['fig:pipeline']} requires fine-tuning a MLLM and should be used along with this tailored MLLM. The bidirectional pipeline presented in Fig. \ref{['fig:pipeline']} is post-hoc and supports incorporating different MLLMs.
Figure 3: Textual descriptions generated in the bidirectional & post-hoc pipeline of MLLM-based mutations.
Figure 4: Mutation examples achieved via traditional method and different MLLMs.

How Multi-Modal LLMs Reshape Visual Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation

TL;DR

Abstract

How Multi-Modal LLMs Reshape Visual Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)