Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

Liyuan Mao; Le Yu; Jing Zhou; Chujie Zheng; Bowen Yu; Chang Gao; Shixuan Liu; An Yang; Weinan Zhang; JunYang Lin

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

Liyuan Mao, Le Yu, Jing Zhou, Chujie Zheng, Bowen Yu, Chang Gao, Shixuan Liu, An Yang, Weinan Zhang, JunYang Lin

TL;DR

This work proposes Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns.

Abstract

In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we propose Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

TL;DR

Abstract

Paper Structure (30 sections, 3 theorems, 31 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 30 sections, 3 theorems, 31 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Preliminaries
Exposing Behavioral Plasticity via Token-Conditional Generation
ToCoRL: Internalize Behavior Adaptation with Guided Exploration
Incorporate Guidance from Token-Conditional Generation via KL-Divergence
Practical Implementation
Algorithmic Instantiation: Evolving Reasoning Behavior for Factual Problem Solving
Experiments
Effective Performance Improvement and Emergent Reasoning Behavior (RQ1)
Underlying Mechanisms of Performance and Behavioral Emergence in ToCoRL (RQ2)
Robustness of ToCoRL (RQ3)
Transferring Emergent Behavior for Effective LRM Training (RQ4)
Related Works
Conclusion & Discussion
Performance Difference Between Thinking and Instruct Models (Qwen3 Open-sourced Series)
...and 15 more sections

Key Result

Theorem 4.1

Given that for every state $s$, there exists an action $a$ such that $A^{\pi_{\textnormal{TC}}}(s, a)\neq 0$, the policy $\tilde{\pi}_{\textnormal{TC}}(a|s)$ is well-defined.

Figures (6)

Figure 1: Like chameleons changing their coloration in response to external stimuli (top), language models can adapt their behavior according to the token prefix (bottom). $\oplus$ stands for concatenation.
Figure 2: Top: Conditioning response generation on a direct-answer prefix enables LRM to switch from step-by-step reasoning to direct knowledge retrieval, revealing a new perspective on language model behavioral plasticity. Middle: ToCoRL leverages token-conditional generation to guide exploration during RL training, inducing the emergence of new factual answering behaviors. Bottom: After an initial direct answer, the ToCoRL-trained model adopts an emergent recalibrative reasoning behavior for factual problems (see \ref{['subsection: Underlying Mechanisms of Performance and Behavioral Emergence']}), while its complex math problem–solving behavior remains unchanged. We provide concrete examples for further illustration in \ref{['fig: detailed demonstration (with example) for token conditional generation and influence of ToCoRL']}.
Figure 3: For clearer visualization, we plot the curves of these metrics over the first training epoch. Compared to ToCoRL, GRPO extends reasoning length while keeping the original reasoning style. Since unnecessary associations and unverified information persist, longer reasoning yields minor improvements, and response lengths eventually return to their initial levels. Adaptive-Thinking penalizes all reasoning, so evaluation scores rise quickly at first as ineffective context is removed. However, without additional reasoning capacity, the LRM’s potential is underutilized, limiting further gains. Instruct-tuning RL initially shortens reasoning, but the prompt instruction gradually loses control over the behavior.
Figure 4: Detailed demonstration of the behavior change brought by token-conditional generation and ToCoRL. Concrete queries and responses are provided.
Figure 5: Since most experiments are conducted with large reasoning models, we only illustrate the procedure in this setting. Moreover, we only show the case where the length of token prefix $k=3$ as an example. That is, besides the <think> token, two initial tokens from direct answer (in orange) are used for constructing token prefix.
...and 1 more figures

Theorems & Definitions (7)

Theorem 4.1
Theorem 4.2
Definition 4.3
Theorem 4.4
proof
proof
proof

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

TL;DR

Abstract

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (7)