Table of Contents
Fetching ...

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

Rujiao Long, Yang Li, Xingyao Zhang, Weixun Wang, Tianqianjin Lin, Xi Zhao, Yuchi Xu, Wenbo Su, Junchi Yan, Bo Zheng

TL;DR

This work tackles limited exploration diversity in RL for large (vision-)language models by injecting structured latent context that steers internal planning before generation.It introduces Reasoning Palette, a latent-modulation framework that learns a VAE over mean-pooled QA embeddings to produce prefix tokens that condition generation; a brief SFT warm-up aligns the base model to latent conditioning, and RL uses latent sampling to enable diverse reasoning strategies, improving exploration efficiency.Across math benchmarks and vision-language tasks, the method yields interpretable, domain-aware control over reasoning and consistently outperforms standard RL baselines, demonstrating the practicality and impact of structured latent modulation on complex multimodal reasoning.

Abstract

Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

TL;DR

This work tackles limited exploration diversity in RL for large (vision-)language models by injecting structured latent context that steers internal planning before generation.It introduces Reasoning Palette, a latent-modulation framework that learns a VAE over mean-pooled QA embeddings to produce prefix tokens that condition generation; a brief SFT warm-up aligns the base model to latent conditioning, and RL uses latent sampling to enable diverse reasoning strategies, improving exploration efficiency.Across math benchmarks and vision-language tasks, the method yields interpretable, domain-aware control over reasoning and consistently outperforms standard RL baselines, demonstrating the practicality and impact of structured latent modulation on complex multimodal reasoning.

Abstract

Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.

Paper Structure

This paper contains 11 sections, 14 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Motivation: Injecting a Gaussian noise token embedding before the prompt embeddings of Qwen-4B-Base enables substantial gains in pass@k accuracy by merely sampling in the Gaussian, despite using greedy decoding for each candidate.
  • Figure 2: Overview of the Reasoning Palette framework: a latent-modulation system that enables strategic, diverse reasoning in LLMs/VLMs by sampling and decoding contextual latent variables to guide internal planning before token generation.
  • Figure 3: Visualization of the learned latent space and generated prefix embeddings via PCA and t-SNE. Left two panels: projections of decoded prefix embeddings $D_\psi(\mathbf{z})$ (colored by domains). Right two panels: projections of the corresponding latent vectors $\mathbf{z} = E_\phi(\text{mean-pool}(\mathbf{q};\mathbf{o}))$. Clear clustering by reasoning domain in both spaces confirms that the VAE disentangles high-level reasoning strategies into distinct regions of the latent space.
  • Figure 4: Pass@32 curves on the RefCOCO datasets.
  • Figure 5: Qualitative results on RefCOCO dataset. From left to right: input image with the ground-truth bounding box, prediction from Qwen2.5VL-3B (greedy decoding), and prediction from our method, Qwen2.5VL-3B (greedy decoding) with a randomly sampled latent. The referring expressions for the top and bottom rows are train closest to the bottom and a zebra standing behind two other zebras, with only its mane and rear showing, respectively.
  • ...and 1 more figures