MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Chenhao Zhang; Yazhe Niu; Hongsheng Li

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Chenhao Zhang, Yazhe Niu, Hongsheng Li

TL;DR

MetaphorStar tackles image implication, a non-literal visual reasoning challenge that extends beyond standard VQA. It introduces TFQ-GRPO, an end-to-end visual reinforcement learning pipeline trained on a fine-grained TFQ-Data and evaluated with TFQ-Bench, with a structured prompting scheme Image Description → Implication Analysis → Final Answer. The approach yields state-of-the-art results on TFQ, MCQ, and OSQ benchmarks, demonstrates strong generalization to broader visual reasoning tasks, and reveals that end-to-end RL avoids the detrimental effects of supervised fine-tuning (the 'SFT Curse'). All models, datasets, and code are open-sourced, enabling broad adoption and further study of non-literal visual understanding. Overall, image implication training enhances core visual reasoning capabilities and offers a principled path to richer multimodal cognition.

Abstract

Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at https://metaphorstar.github.io.

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

TL;DR

Abstract

Paper Structure (32 sections, 3 equations, 4 figures, 11 tables)

This paper contains 32 sections, 3 equations, 4 figures, 11 tables.

Introduction
Related Work
Image Implication
Vision-language Reasoning
Method
True-False Question (TFQ) For Image Implication Understanding
TFQ-Data & TFQ-Bench
Data Generation
Dataset and Benchmark Splits
TFQ-GRPO
MetaphorStar Family
Training Setup
Analyzing Token Entropy in Reasoning
Experiment
Main Experiment
...and 17 more sections

Figures (4)

Figure 1: A picture is worth a thousand words: While MLLMs excel at literal object recognition ("See things as they are"), they often miss the deeper implication. Humans and our MetaphorStar model interpret the world "See things as we are", grasping complex implications which behind the simple factual descriptions.
Figure 2: The visualization of token entropy for MetaphorStar-7B on TFQ, MCQ, and OSQ. High-entropy (red) indicates high uncertainty, while low-entropy (blue) indicates high confidence.
Figure 3: The model parameter scaling law.
Figure 4: Entropy loss of models with different strategies.

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

TL;DR

Abstract

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)