Visual Transformation Telling

Wanqing Cui; Xin Hong; Yanyan Lan; Liang Pang; Jiafeng Guo; Xueqi Cheng

Visual Transformation Telling

Wanqing Cui, Xin Hong, Yanyan Lan, Liang Pang, Jiafeng Guo, Xueqi Cheng

TL;DR

Visual Transformation Telling (VTT) defines a real-world visual reasoning task to describe transformations between adjacent image states, aiming to uncover underlying actions. The authors construct a 13,547-sample dataset from CrossTask and COIN and benchmark traditional visual storytelling models and multimodal large language models, revealing substantial room for improvement and four common error types: bias, misidentification, hallucination, and illogicality. They propose TTNet, a transformation-aware model with difference-sensitive encoding, masked transformation modeling, and auxiliary learning, and show ablations that emphasize the importance of semantic differences and context. The work highlights the need for richer transformation supervision and data to enable robust transformation reasoning, with potential applications in video generation, procedural planning, and embodied AI.

Abstract

Humans can naturally reason from superficial state differences (e.g. ground wetness) to transformations descriptions (e.g. raining) according to their life experience. In this paper, we propose a new visual reasoning task to test this transformation reasoning ability in real-world scenarios, called \textbf{V}isual \textbf{T}ransformation \textbf{T}elling (VTT). Given a series of states (i.e. images), VTT requires to describe the transformation occurring between every two adjacent states. Different from existing visual reasoning tasks that focus on surface state reasoning, the advantage of VTT is that it captures the underlying causes, e.g. actions or events, behind the differences among states. We collect a novel dataset to support the study of transformation reasoning from two existing instructional video datasets, CrossTask and COIN, comprising 13,547 samples. Each sample involves the key state images along with their transformation descriptions. Our dataset covers diverse real-world activities, providing a rich resource for training and evaluation. To construct an initial benchmark for VTT, we test several models, including traditional visual storytelling methods (CST, GLACNet, Densecap) and advanced multimodal large language models (LLaVA v1.5-7B, Qwen-VL-chat, Gemini Pro Vision, GPT-4o, and GPT-4). Experimental results reveal that even state-of-the-art models still face challenges in VTT, highlighting substantial areas for improvement.

Visual Transformation Telling

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 11 figures, 13 tables)

This paper contains 30 sections, 2 equations, 11 figures, 13 tables.

Introduction
Related Works
Visual Transformation Telling Dataset
Task Definition
VTT Dataset Construction
Benchmark on VTT
Model Selection
Evaluation Protocol
Experimental Results and Analysis
Comparison of Baseline Models
Qualitative Analysis and Common Error Types
Further Exploration
Conclusion and Discussion
Limitation.
Dataset Scale Discussion
...and 15 more sections

Figures (11)

Figure 1: Visual Transformation Telling (VTT). Given states, which are images extracted from videos, the goal is to reason and describe transformations between every two adjacent states.
Figure 2: Distributions of VTT samples. (a) Category. (b) Words. (c) Transformation length (top), sentence length (bottom). (d) Topic.
Figure 3: Performance of models under different data: (a) The SPICE values with respect to the number of transformation items. (b) The SPICE values with respect to different categories of data.
Figure 4: Qualitative comparison on the VTT test data. Above: cut mango. Below: wear contact lenses. Different error types are marked with different colors: bias (red), misidentification (green), hallucination (orange), and illogicality (blue).
Figure 5: The web interface of human evaluation on VTT.
...and 6 more figures

Visual Transformation Telling

TL;DR

Abstract

Visual Transformation Telling

Authors

TL;DR

Abstract

Table of Contents

Figures (11)