Table of Contents
Fetching ...

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

Sarath Shekkizhar, Romain Cosentino, Adam Earle

Abstract

Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41\%$ ($0.8$B) to $96.8\%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22\%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

Abstract

Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across open-weight LLMs (Qwen3.5, gpt-oss, GLM) and datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from (B) to (B-AB), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching . Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.

Paper Structure

This paper contains 44 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: User-turn generation as a probe of interaction awareness.Left: A conversation context consisting of system, user, and assistant turns delimited by special tokens (e.g., <|im_start|>, <|im_end|>). Standard evaluation scores the assistant response for task accuracy. Our probe appends a <|im_start|>user header and lets model $M_\theta$ generate under the user role. A genuine follow-up indicates the model's weights encode conversational awareness. Right: Genuine follow-up rate (%) vs. sampling temperature for six models across three datasets. At greedy decoding ($T{=}0$), most models produce near-zero follow-ups. Increasing temperature surfaces latent interaction awareness in Qwen3.5 and GLM (e.g., Qwen3.5-27B rises from $0\%$ to $22\%$ on GSM8K), while gpt-oss remains near zero. Qwen3.5-9B starts with higher baseline follow-up on GPQA Diamond ($13.1\%$) and rises further with temperature.
  • Figure 2: Controlled perturbation examples.Left: Truncation removes the end of the assistant response, prompting the model to produce a reaction to complete the response. Right: Appending a generic question elicits a grounded critique rather than a prompt restatement. Assistant and user turns in the examples are generated by GLM-4.7. These controls demonstrate that interaction awareness can surface in specific contexts.
  • Figure 3: Cross-family dissociation: task accuracy vs. followup rate. Top row: task accuracy (%). Bottom row: followup rate (%). Five representative models on GSM8K, GPQA Diamond, and IFBench. Task accuracy does not predict follow-up quality: gpt-oss models produce the highest follow-up on GPQA despite lower accuracy than Qwen3.5 or GLM-4.7. Full results including IFEval and GPQA Main in Table \ref{['tab:cross-family-full']}.
  • Figure 4: Qwen3.5 family: interaction awareness across temperature. Genuine follow-up rate (%) vs. sampling temperature for all eight Qwen3.5 models on three datasets. At $T{=}0$ (greedy), follow-up is near zero for most models despite high task accuracy (Table \ref{['tab:qwen-scaling-full']}). Higher temperatures surface latent interaction awareness, but this does not scale monotonically with model size: Qwen3.5-9B and Qwen3.5-27B often match or exceed larger models. MoE variants (35B-A3B, 122B-A10B) consistently lag their dense counterparts.
  • Figure 5: Same question, same correct answer, different interaction awareness. Both models correctly answer a GPQA chemistry question (Answer: D). Left:Qwen3.5-9B generates a user turn that critically engages with the assistant's reasoning about the Corey-Chaykovsky reagent and reaction conditions. Right:Qwen3.5-27B, a $3\times$ larger model, restates the original prompt verbatim, an inability to generate realistic user turn after an assistant response.
  • ...and 1 more figures