Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

Rujiao Long; Yang Li; Xingyao Zhang; Weixun Wang; Tianqianjin Lin; Xi Zhao; Yuchi Xu; Wenbo Su; Junchi Yan; Bo Zheng

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

Rujiao Long, Yang Li, Xingyao Zhang, Weixun Wang, Tianqianjin Lin, Xi Zhao, Yuchi Xu, Wenbo Su, Junchi Yan, Bo Zheng

TL;DR

This work tackles limited exploration diversity in RL for large (vision-)language models by injecting structured latent context that steers internal planning before generation.It introduces Reasoning Palette, a latent-modulation framework that learns a VAE over mean-pooled QA embeddings to produce prefix tokens that condition generation; a brief SFT warm-up aligns the base model to latent conditioning, and RL uses latent sampling to enable diverse reasoning strategies, improving exploration efficiency.Across math benchmarks and vision-language tasks, the method yields interpretable, domain-aware control over reasoning and consistently outperforms standard RL baselines, demonstrating the practicality and impact of structured latent modulation on complex multimodal reasoning.

Abstract

Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

TL;DR

Abstract

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)