Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Jialiang Zhang; Junlong Tong; Junyan Lin; Hao Wu; Yirong Sun; Yunpu Ma; Xiaoyu Shen

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun, Yunpu Ma, Xiaoyu Shen

TL;DR

Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay, and demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs.

Abstract

Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

TL;DR

Abstract

Paper Structure (45 sections, 17 equations, 12 figures, 4 tables, 2 algorithms)

This paper contains 45 sections, 17 equations, 12 figures, 4 tables, 2 algorithms.

Introduction
Contributions.
Related Work
Multimodal Chain-of-Thought Reasoning.
Streaming and Memory-Based Video Understanding.
Methodology
Task Definition and Preliminaries
Streaming Video CoT vs. Offline Video CoT.
Design Principles.
Streaming Video CoT Generation
Frame ID Alignment.
Structured Trajectory Construction.
Quality Control.
Naive Streaming Paradigm
Parallel Streaming Paradigm
...and 30 more sections

Figures (12)

Figure 1: Conventional LVLM reasoning adheres to the batch thinking paradigm, deferring inference until the entire input is received. This approach often leads to high latency and uneven attention allocation across inputs. In contrast, our proposed streaming thinking paradigm enables LVLMs to reason concurrently with input reception, thereby reducing latency and ensuring consistency between attention and input order.
Figure 2: Overview of the two-step process for generating Streaming Video CoT. Step 1 Adjust the frame ID while maintaining frame caption alignment. Step 2 Generate a progressive frame aware trajectory using the original annotations.
Figure 3: Overview of the streaming reasoning framework. (a) Parallel video reasoning KV caches enable concurrent visual encoding and reasoning generation via dynamic merge and split operations. (b) The streaming attention mask enforces causal alignment between frames and reasoning steps. (c) During inference, parallel information flow reduces attention path length and alleviates sequential blocking compared with interleaved paradigms.
Figure 4: Case study comparing TaYS with the Interleaved paradigm. TaYS produces temporally aligned reasoning, whereas the Interleaved model generates less accurate, fragmented descriptions.
Figure 5: (a) Latency comparison across paradigms. (b) Latency breakdown of TaYS. Parallel KV Cache design enables the lowest TTFT and stable delay.
...and 7 more figures

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

TL;DR

Abstract

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)