Table of Contents
Fetching ...

Applying General Turn-taking Models to Conversational Human-Robot Interaction

Gabriel Skantze, Bahar Irfan

TL;DR

This paper addresses the limitations of silence-based turn-taking in human-robot interaction by applying general, self-supervised turn-taking models trained on large human-human dialogue datasets. It combines TurnGPT (verbal-domain predictions) and VAP (acoustic-domain predictions) in a self-monitoring HRI architecture, enabling continuous, real-time predictions and preparation of responses. In a within-subject study with 39 participants using the Furhat robot, the proposed system substantially reduced response delays and interruptions and was preferred by users over a traditional baseline. The work demonstrates the viability of general turn-taking models for HRI, suggesting future work on incorporating additional cues (e.g., gaze), multi-party scenarios, and faster response pipelines to further enhance naturalistic interaction.

Abstract

Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.

Applying General Turn-taking Models to Conversational Human-Robot Interaction

TL;DR

This paper addresses the limitations of silence-based turn-taking in human-robot interaction by applying general, self-supervised turn-taking models trained on large human-human dialogue datasets. It combines TurnGPT (verbal-domain predictions) and VAP (acoustic-domain predictions) in a self-monitoring HRI architecture, enabling continuous, real-time predictions and preparation of responses. In a within-subject study with 39 participants using the Furhat robot, the proposed system substantially reduced response delays and interruptions and was preferred by users over a traditional baseline. The work demonstrates the viability of general turn-taking models for HRI, suggesting future work on incorporating additional cues (e.g., gaze), multi-party scenarios, and faster response pipelines to further enhance naturalistic interaction.

Abstract

Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.
Paper Structure (22 sections, 7 figures, 1 table)

This paper contains 22 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Example turn shift from user to robot when using the proposed system. From top to bottom: (1) The user's actual speech pattern (in orange), highlighted where the VAP model detects voice activity; (2) ASR transcription; (3) TurnGPT's likelihood for the turn to end; (4) Response generation (LLM+TTS); (5) the robot's actual speech pattern (in blue); (6) robot gaze (towards user or averted); (7) VAP model predictions.
  • Figure 2: Example turn shift from user to robot with the baseline.
  • Figure 3: System architecture. New components in proposed system shown in green.
  • Figure 4: Example of a user interruption and backchannel.
  • Figure 5: The setting for the evaluation, showing the red LED lights used in the baseline condition to indicate that the robot is not listening.
  • ...and 2 more figures