Table of Contents
Fetching ...

TiCo: Time-Controllable Training for Spoken Dialogue Models

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass

Abstract

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.

TiCo: Time-Controllable Training for Spoken Dialogue Models

Abstract

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.
Paper Structure (27 sections, 17 equations, 8 figures, 5 tables)

This paper contains 27 sections, 17 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of TiCo, a two-stage framework for time-controllable speech generation. Stage 1 (top): The model leverages self-generation to produce responses annotated with Spoken Time Markers (STMs), which serve as supervision for learning time awareness, i.e., associating intermediate generation states with temporal progress and estimating elapsed speaking time. Stage 2 (bottom): The model is further optimized via RLVR, where rewards are derived from STMs, enabling the model to regulate response duration in real time.
  • Figure 2: Overview of TiCo-Bench construction. Base queries are collected from four distinct text and speech datasets (totaling 720 queries). Explicit time-control instructions are then inserted into these queries. By applying both a short-duration setting (10–30 secs) and a long-duration setting (30–60 secs) to each query, the initial dataset is doubled, resulting in a final benchmark of 1440 evaluation samples.
  • Figure 3: Distribution of Spoken Time Markers in the First stage training data.
  • Figure 4: Duration MAE and MAPE of Qwen2-Omni-7B and TiCo across instructed-duration bins. TiCo maintains consistently lower error across all duration ranges.
  • Figure 5: Duration error of TiCo across instructed-duration bins, comparing two reference signals: the instructed duration $t_{\mathrm{inst}}$ and the final Spoken Time Marker $t_{\mathrm{last}}$. The close alignment indicates that the final time marker accurately estimates realized speech duration.
  • ...and 3 more figures