Table of Contents
Fetching ...

Conversational Behavior Modeling Foundation Model With Multi-Level Perception

Dingkun Zhou, Shuchang Pan, Jiachen Lian, Siddharth Banerjee, Sarika Pasumarthy, Dhruv Hebbar, Siddhant Patel, Zeyi Austin Li, Kan Jen Cheng, Sanay Bordia, Krish Patel, Akshaj Gupta, Tingle Li, Gopala Anumanchipalli

TL;DR

This work tackles natural, full-duplex spoken dialogue by formalizing a perception-reasoning-generation loop. It introduces hierarchical speech-act perception and a Graph-of-Thought (GoT) reasoning framework to enable real-time, interpretable decisions with auditable rationales. A large-scale ConversationGoT-120h dataset of synthetic dialogues with two-level speech acts and rationale annotations is built to train and evaluate the system, and the GoT model demonstrates robust behavior detection and interpretable reasoning with strong transfer to real-world data. The approach achieves low-latency per-second decisions suitable for streaming applications, while offering explanations that support transparency and benchmarking of conversational reasoning in duplex systems.

Abstract

Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

Conversational Behavior Modeling Foundation Model With Multi-Level Perception

TL;DR

This work tackles natural, full-duplex spoken dialogue by formalizing a perception-reasoning-generation loop. It introduces hierarchical speech-act perception and a Graph-of-Thought (GoT) reasoning framework to enable real-time, interpretable decisions with auditable rationales. A large-scale ConversationGoT-120h dataset of synthetic dialogues with two-level speech acts and rationale annotations is built to train and evaluate the system, and the GoT model demonstrates robust behavior detection and interpretable reasoning with strong transfer to real-world data. The approach achieves low-latency per-second decisions suitable for streaming applications, while offering explanations that support transparency and benchmarking of conversational reasoning in duplex systems.

Abstract

Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
Paper Structure (52 sections, 17 equations, 2 figures, 8 tables)

This paper contains 52 sections, 17 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Comparison of dialogue paradigms. (Traditional) Traditional duplex systems frame conversation as a direct sequence prediction task. (Ours) We propose a framework based on next-behavior perception and reasoning: the agent first perceives the speaker’s behaviors at multiple levels, then reasons via a Graph-of-Thoughts, and finally generates a response.
  • Figure 2: Causal streaming pipeline for conversational behavior modeling. At each 1 s tick, the model causally predicts hierarchical speech acts and generates an evidence-grounded rationale using a sliding-window Graph-of-Thoughts (GoT).