How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Yingqian Cui; Zhenwei Dai; Bing He; Zhan Shi; Hui Liu; Rui Sun; Zhiji Liu; Yue Xing; Jiliang Tang; Benoit Dumoulin

How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu, Rui Sun, Zhiji Liu, Yue Xing, Jiliang Tang, Benoit Dumoulin

TL;DR

A comprehensive analysis of latent reasoning methods reveals a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses, whereas weaker supervision allows richer latent representations at the cost of increased shortcut behavior.

Abstract

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space. This paradigm enables reasoning beyond discrete language tokens by performing multi-step computation in continuous latent spaces. Although there have been numerous studies focusing on improving the performance of latent reasoning, its internal mechanisms remain not fully investigated. In this work, we conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process. We identify two key issues across latent reasoning methods with different levels of supervision. First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning. Second, we examine the hypothesis that latent reasoning supports BFS-like exploration in latent space, and find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search, but instead exhibits implicit pruning and compression. Finally, our findings reveal a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses, whereas weaker supervision allows richer latent representations at the cost of increased shortcut behavior.

How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

TL;DR

Abstract

Paper Structure (26 sections, 3 equations, 3 figures, 6 tables)

This paper contains 26 sections, 3 equations, 3 figures, 6 tables.

Introduction
Background
Latent Reasoning Mechanism.
Weak/Strong Supervision Training Scheme.
Other Related Works
Additional implicit reasoning methods
Analytical work for latent reasoning methods
Shortcut Behavior of Latent Reasoning
Experimental Settings
Latent Reasoning Methods.
Base models and Benchmarks.
Influence of the latent length
Interventional Analysis
Investigation with Attention Score
Investigation on the BFS Mechanism
...and 11 more sections

Figures (3)

Figure 1: Performance of different methods across different numbers of latent steps
Figure 2: Tokens with Top-10 Attention Score for an Example in ProsQA Dataset
Figure 3: Tokens with Top-10 Attention Score for an Example in GSM8K Dataset

How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

TL;DR

Abstract

How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)