Table of Contents
Fetching ...

RESPOND: Responsive Engagement Strategy for Predictive Orchestration and Dialogue

Meng-Chen Lee, Costas Panay, Javier Hernandez, Sean Andrist, Dan Bohus, Anatoly Churikov, Andrew D. Wilson

Abstract

The majority of voice-based conversational agents still rely on pause-and-respond turn-taking, leaving interactions sounding stiff and robotic. We present RESPOND (Responsive Engagement Strategy for Predictive Orchestration and Dialogue), a framework that brings two staples of human conversation to agents: timely backchannels ("mm-hmm," "right") and proactive turn claims that can contribute relevant content before the speaker yields the conversational floor. Built on streaming ASR (Automatic Speech Recognition) and incremental semantics, RESPOND continuously predicts both when and how to interject, enabling fluid, listener-aware dialogue. A defining feature is its designer-facing controllability: two orthogonal dials, Backchannel Intensity (frequency of acknowledgments) and Turn Claim Aggressiveness (depth and assertiveness of early contributions), can be tuned to match the etiquette of contexts ranging from rapid ideation to reflective counseling. By coupling predictive orchestration with explicit control, RESPOND offers a practical path toward conversational agents that adapt their conversational footprint to social expectations, advancing the design of more natural and engaging voice interfaces.

RESPOND: Responsive Engagement Strategy for Predictive Orchestration and Dialogue

Abstract

The majority of voice-based conversational agents still rely on pause-and-respond turn-taking, leaving interactions sounding stiff and robotic. We present RESPOND (Responsive Engagement Strategy for Predictive Orchestration and Dialogue), a framework that brings two staples of human conversation to agents: timely backchannels ("mm-hmm," "right") and proactive turn claims that can contribute relevant content before the speaker yields the conversational floor. Built on streaming ASR (Automatic Speech Recognition) and incremental semantics, RESPOND continuously predicts both when and how to interject, enabling fluid, listener-aware dialogue. A defining feature is its designer-facing controllability: two orthogonal dials, Backchannel Intensity (frequency of acknowledgments) and Turn Claim Aggressiveness (depth and assertiveness of early contributions), can be tuned to match the etiquette of contexts ranging from rapid ideation to reflective counseling. By coupling predictive orchestration with explicit control, RESPOND offers a practical path toward conversational agents that adapt their conversational footprint to social expectations, advancing the design of more natural and engaging voice interfaces.
Paper Structure (25 sections, 3 equations, 8 figures, 3 tables)

This paper contains 25 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustration of the three annotation labels—turn claim, backchannel, and stay silent—and how they appear in different conversational contexts. Left: listener actions categorized as turn claims (interruption, overlap, turn-taking). Right: corresponding backchannel examples occurring in similar contexts. Gray-shaded regions represent stay silent intervals where the listener produces no response. Color bars indicate who is speaking or backchanneling.
  • Figure 2: Label distributions before and after downsampling.
  • Figure 3: Mapping control parameters to a uniform scale. Here, $c_{\text{bc}}$ denotes backchannel intensity and $c_{\text{tc}}$ denotes turn claim aggressiveness. The left column shows the raw distributions of the calculated ratios, the middle column shows the distributions after quantile-based transformation, and the right column visualizes the scaled values as both scatter points and a heatmap.
  • Figure 4: System pipeline of our conversational agent module.
  • Figure 5: Confusion matrix on the CANDOR test set.
  • ...and 3 more figures