Table of Contents
Fetching ...

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass

TL;DR

The paper tackles the challenge of temporal dynamics in conversational Spoken Language Models by introducing the Game-Time Benchmark, which tests timing, tempo, and simultaneous speaking. Tasks are formalized as Instruction-Following problems with a base task $t$ and a constraint set $\mathcal{C}$, and the dataset contains 1,475 test instances that cover Basic and Advanced scenarios. Evaluation uses a dual-channel setup with an LLM-as-a-judge and human validation to assess instruction-following under temporal constraints. Results show that while basic instruction-following is achievable for some models, temporal constraints cause substantial degradation, revealing a critical gap in time-awareness and real-time coordination. The benchmark provides a scalable framework to drive the development of temporally-aware conversational AI.

Abstract

Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

TL;DR

The paper tackles the challenge of temporal dynamics in conversational Spoken Language Models by introducing the Game-Time Benchmark, which tests timing, tempo, and simultaneous speaking. Tasks are formalized as Instruction-Following problems with a base task and a constraint set , and the dataset contains 1,475 test instances that cover Basic and Advanced scenarios. Evaluation uses a dual-channel setup with an LLM-as-a-judge and human validation to assess instruction-following under temporal constraints. Results show that while basic instruction-following is achievable for some models, temporal constraints cause substantial degradation, revealing a critical gap in time-awareness and real-time coordination. The benchmark provides a scalable framework to drive the development of temporally-aware conversational AI.

Abstract

Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.

Paper Structure

This paper contains 15 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the Game-Time Benchmark, evaluating temporal dynamics in conversational Spoken Language Models (SLMs).
  • Figure 2: Dual-channel Evaluation with LLM-as-a-judge.
  • Figure 3: Game-Time benchmark scores evaluated with LLM-as-a-judge. Top: results on Basic Tasks. Bottom: results on Advanced Tasks.
  • Figure 4: Human evaluation on Game-Time Advanced Tasks.