Table of Contents
Fetching ...

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun

TL;DR

LUCY introduces an end-to-end speech model that jointly optimizes emotion control, naturalness, and informativeness through carefully crafted synthetic data and a parallel text-speech architecture. By integrating linguistic and acoustic emotion cues, multi-round history, and function-calling, LUCY achieves richer emotional expression, more natural spoken responses, and real-time tool usage without sacrificing task performance. The three-stage training pipeline (encoder pretraining, language-model fine-tuning on AudioQA-1.0M, and targeted fine-tuning) enables robust multi-turn dialogue and silent function calls via batch-parallel decoding. Across evaluations on speech emotion, function calling, natural conversation, and spoken QA, LUCY demonstrates superior emotion control, competitive naturalness, and effective external knowledge access, with practical latency suitable for real-time interactions. These results underscore the potential of unified E2E speech systems for human-like audio agents in interactive settings.

Abstract

The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

TL;DR

LUCY introduces an end-to-end speech model that jointly optimizes emotion control, naturalness, and informativeness through carefully crafted synthetic data and a parallel text-speech architecture. By integrating linguistic and acoustic emotion cues, multi-round history, and function-calling, LUCY achieves richer emotional expression, more natural spoken responses, and real-time tool usage without sacrificing task performance. The three-stage training pipeline (encoder pretraining, language-model fine-tuning on AudioQA-1.0M, and targeted fine-tuning) enables robust multi-turn dialogue and silent function calls via batch-parallel decoding. Across evaluations on speech emotion, function calling, natural conversation, and spoken QA, LUCY demonstrates superior emotion control, competitive naturalness, and effective external knowledge access, with practical latency suitable for real-time interactions. These results underscore the potential of unified E2E speech systems for human-like audio agents in interactive settings.

Abstract

The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.

Paper Structure

This paper contains 16 sections, 1 equation, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Architecture overview.
  • Figure 2: Illustration of Emotion and Speaker Tokens.
  • Figure 3: Batch Parallel Decoding for Function-Call Samples.