Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Minheng Ni; Yutao Fan; Lei Zhang; Wangmeng Zuo

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Minheng Ni, Yutao Fan, Lei Zhang, Wangmeng Zuo

TL;DR

The paper tackles ambiguity in natural-language instructions within multi-modal tasks by introducing Visual-O1, a multi-modal multi-turn chain-of-thought framework that generates instance-specific (instantial) or general (empirical) disambiguation experience. The method uses iterative reasoning and reflection guided by prompts to refine understanding of instructions when visual context is available, producing final answers either directly from reasoning history or after transforming instructions into clearer forms. Visual-O1 demonstrates significant improvements on ambiguous instruction understanding across RIS and VQA, and also enhances performance on general datasets, with robust generalization across intelligence levels and models, including tasks beyond the core RIS/VQA settings. The authors provide a one-time optimization phase for general models and an inference-time disambiguation process for high-intelligent models, reporting low computational overhead and delivering a reproducible framework with publicly released data and code in the appendix. This work advances practical AI alignment in real-world scenarios by enabling machines to reason about and resolve ambiguity with human-like multimodal disambiguation capabilities.

Abstract

As large-scale models evolve, language instructions are increasingly utilized in multi-modal tasks. Due to human language habits, these instructions often contain ambiguities in real-world scenarios, necessitating the integration of visual context or common sense for accurate interpretation. However, even highly intelligent large models exhibit significant performance limitations on ambiguous instructions, where weak reasoning abilities of disambiguation can lead to catastrophic errors. To address this issue, this paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models or empirical experience for generally intelligent models to understand ambiguous instructions. Unlike traditional methods that require models to possess high intelligence to understand long texts or perform lengthy complex reasoning, our framework does not significantly increase computational overhead and is more general and effective, even for generally intelligent models. Experiments show that our method not only significantly enhances the performance of models of different intelligence levels on ambiguous instructions but also improves their performance on general datasets. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity. We will release our data and code.

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

TL;DR

Abstract

Paper Structure (41 sections, 8 equations, 6 figures, 13 tables)

This paper contains 41 sections, 8 equations, 6 figures, 13 tables.

Introduction
Related Work
Language Instruction Understanding in Multi-modal Tasks
Complex Reasoning with Large Multi-modal Models
Visual-O1: Multi-modal Multi-turn Chain-of-thoughts Reasoning Framework
Overview of Proposed Framework
Reasoning and Reflection
Reasoning and Reflection for Instantial Experience
Reasoning and Reflection for Empirical Experience
Response Synthesis
Response Synthesis by Instantial Experience
Response Synthesis by Empirical Experience
Implementation Details
Experiments
Experimental Setup
...and 26 more sections

Figures (6)

Figure 1: Understanding ambiguous instruction. The AI model may not be able to execute instructions normally when encountering ambiguous instructions. However, humans can usually correctly analyze the actual meaning of ambiguous instructions by combining visual context and can accurately interpret ambiguous instructions. Based on this, we propose Visual-O1, which simulates human multi-modal multi-turn reasoning to gain instantial (for high-intelligent models) or empirical (for general-intelligent models) experience in order to understand ambiguous instructions.
Figure 2: The overview of Visual-O1.Visual-O1 introduces multi-modal multi-turn chain-of-thoughts to understand ambiguity with (a) instantial experience for high-intelligent models to generate the correct answer directly, and (b) empirical experience for general-intelligent models to transform ambiguous instructions into clear instructions and then generate the correct answer. Instantial and empirical experience develops during inference and one-time optimization stage.
Figure 3: Case studies on RIS. Our approach aids the model in understanding ambiguous instructions by incorporating Visual-O1, which significantly improves the accuracy of instructions, thus enabling more effective segmentation of the target.
Figure D: Cases of Visual-O1 on VLN.
Figure E: Cases of Visual-O1 on Image Synthesis.
...and 1 more figures

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

TL;DR

Abstract

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)