TiCo: Time-Controllable Training for Spoken Dialogue Models

Kai-Wei Chang; Wei-Chih Chen; En-Pei Hu; Hung-yi Lee; James Glass

TiCo: Time-Controllable Training for Spoken Dialogue Models

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass

Abstract

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.

TiCo: Time-Controllable Training for Spoken Dialogue Models

Abstract

Paper Structure (27 sections, 17 equations, 8 figures, 5 tables)

This paper contains 27 sections, 17 equations, 8 figures, 5 tables.

Introduction
Related Works
Spoken Dialogue Models
Length-Control Large Language Models
TiCo
TiCo Stage1: Time-Awareness Training
TiCo Stage 2: Time-Controllable Training
Experiments
TiCo-Bench
Experimental Setup
Results
TiCo-Bench
Generalization to Longer Responses and Text Queries
Spoken Time Token Prediction Analysis
Qualitative Results
...and 12 more sections

Figures (8)

Figure 1: Overview of TiCo, a two-stage framework for time-controllable speech generation. Stage 1 (top): The model leverages self-generation to produce responses annotated with Spoken Time Markers (STMs), which serve as supervision for learning time awareness, i.e., associating intermediate generation states with temporal progress and estimating elapsed speaking time. Stage 2 (bottom): The model is further optimized via RLVR, where rewards are derived from STMs, enabling the model to regulate response duration in real time.
Figure 2: Overview of TiCo-Bench construction. Base queries are collected from four distinct text and speech datasets (totaling 720 queries). Explicit time-control instructions are then inserted into these queries. By applying both a short-duration setting (10–30 secs) and a long-duration setting (30–60 secs) to each query, the initial dataset is doubled, resulting in a final benchmark of 1440 evaluation samples.
Figure 3: Distribution of Spoken Time Markers in the First stage training data.
Figure 4: Duration MAE and MAPE of Qwen2-Omni-7B and TiCo across instructed-duration bins. TiCo maintains consistently lower error across all duration ranges.
Figure 5: Duration error of TiCo across instructed-duration bins, comparing two reference signals: the instructed duration $t_{\mathrm{inst}}$ and the final Spoken Time Marker $t_{\mathrm{last}}$. The close alignment indicates that the final time marker accurately estimates realized speech duration.
...and 3 more figures

TiCo: Time-Controllable Training for Spoken Dialogue Models

Abstract

TiCo: Time-Controllable Training for Spoken Dialogue Models

Authors

Abstract

Table of Contents

Figures (8)