Table of Contents
Fetching ...

Visual Transformation Telling

Wanqing Cui, Xin Hong, Yanyan Lan, Liang Pang, Jiafeng Guo, Xueqi Cheng

TL;DR

Visual Transformation Telling (VTT) defines a real-world visual reasoning task to describe transformations between adjacent image states, aiming to uncover underlying actions. The authors construct a 13,547-sample dataset from CrossTask and COIN and benchmark traditional visual storytelling models and multimodal large language models, revealing substantial room for improvement and four common error types: bias, misidentification, hallucination, and illogicality. They propose TTNet, a transformation-aware model with difference-sensitive encoding, masked transformation modeling, and auxiliary learning, and show ablations that emphasize the importance of semantic differences and context. The work highlights the need for richer transformation supervision and data to enable robust transformation reasoning, with potential applications in video generation, procedural planning, and embodied AI.

Abstract

Humans can naturally reason from superficial state differences (e.g. ground wetness) to transformations descriptions (e.g. raining) according to their life experience. In this paper, we propose a new visual reasoning task to test this transformation reasoning ability in real-world scenarios, called \textbf{V}isual \textbf{T}ransformation \textbf{T}elling (VTT). Given a series of states (i.e. images), VTT requires to describe the transformation occurring between every two adjacent states. Different from existing visual reasoning tasks that focus on surface state reasoning, the advantage of VTT is that it captures the underlying causes, e.g. actions or events, behind the differences among states. We collect a novel dataset to support the study of transformation reasoning from two existing instructional video datasets, CrossTask and COIN, comprising 13,547 samples. Each sample involves the key state images along with their transformation descriptions. Our dataset covers diverse real-world activities, providing a rich resource for training and evaluation. To construct an initial benchmark for VTT, we test several models, including traditional visual storytelling methods (CST, GLACNet, Densecap) and advanced multimodal large language models (LLaVA v1.5-7B, Qwen-VL-chat, Gemini Pro Vision, GPT-4o, and GPT-4). Experimental results reveal that even state-of-the-art models still face challenges in VTT, highlighting substantial areas for improvement.

Visual Transformation Telling

TL;DR

Visual Transformation Telling (VTT) defines a real-world visual reasoning task to describe transformations between adjacent image states, aiming to uncover underlying actions. The authors construct a 13,547-sample dataset from CrossTask and COIN and benchmark traditional visual storytelling models and multimodal large language models, revealing substantial room for improvement and four common error types: bias, misidentification, hallucination, and illogicality. They propose TTNet, a transformation-aware model with difference-sensitive encoding, masked transformation modeling, and auxiliary learning, and show ablations that emphasize the importance of semantic differences and context. The work highlights the need for richer transformation supervision and data to enable robust transformation reasoning, with potential applications in video generation, procedural planning, and embodied AI.

Abstract

Humans can naturally reason from superficial state differences (e.g. ground wetness) to transformations descriptions (e.g. raining) according to their life experience. In this paper, we propose a new visual reasoning task to test this transformation reasoning ability in real-world scenarios, called \textbf{V}isual \textbf{T}ransformation \textbf{T}elling (VTT). Given a series of states (i.e. images), VTT requires to describe the transformation occurring between every two adjacent states. Different from existing visual reasoning tasks that focus on surface state reasoning, the advantage of VTT is that it captures the underlying causes, e.g. actions or events, behind the differences among states. We collect a novel dataset to support the study of transformation reasoning from two existing instructional video datasets, CrossTask and COIN, comprising 13,547 samples. Each sample involves the key state images along with their transformation descriptions. Our dataset covers diverse real-world activities, providing a rich resource for training and evaluation. To construct an initial benchmark for VTT, we test several models, including traditional visual storytelling methods (CST, GLACNet, Densecap) and advanced multimodal large language models (LLaVA v1.5-7B, Qwen-VL-chat, Gemini Pro Vision, GPT-4o, and GPT-4). Experimental results reveal that even state-of-the-art models still face challenges in VTT, highlighting substantial areas for improvement.
Paper Structure (30 sections, 2 equations, 11 figures, 13 tables)

This paper contains 30 sections, 2 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Visual Transformation Telling (VTT). Given states, which are images extracted from videos, the goal is to reason and describe transformations between every two adjacent states.
  • Figure 2: Distributions of VTT samples. (a) Category. (b) Words. (c) Transformation length (top), sentence length (bottom). (d) Topic.
  • Figure 3: Performance of models under different data: (a) The SPICE values with respect to the number of transformation items. (b) The SPICE values with respect to different categories of data.
  • Figure 4: Qualitative comparison on the VTT test data. Above: cut mango. Below: wear contact lenses. Different error types are marked with different colors: bias (red), misidentification (green), hallucination (orange), and illogicality (blue).
  • Figure 5: The web interface of human evaluation on VTT.
  • ...and 6 more figures