Table of Contents
Fetching ...

Reject or Not?: A Benchmark for Voice Assistant Query Rejection in Smart Home Scenario and an Improved Method Based on LLMs

Huichao Men, Yizhen Hu, Yingyang He, Yu Gao, Xiaofeng Mou, Yi Xu

TL;DR

This work tackles the challenge of query rejection in smart-home voice assistants by introducing the first Chinese-oriented open multimodal rejection benchmark and an improved LLM-based rejection method. It presents a two-part contribution: (1) a dataset of 11,913 text-speech samples across 13 invalid-utterance types with multi-turn context, and (2) a three-layer architecture combining a universal LLM adapter, household-level personalized memory, and a RAG-based mis-rejection knowledge base. Experiments show the proposed method achieves up to 96.75% rejection accuracy and consistently outperforms zero-shot and fine-tuned baselines, especially on family-specific and complex multi-turn scenarios. This work provides a reproducible data foundation, an evaluative standard, and an extensible framework that enables reliable, context-aware rejection in smart-home voice interaction, ultimately improving user trust and system efficiency.

Abstract

In smart-home voice assistant scenario, deciding whether to accept or reject a user query is the first step before any downstream processing. To address the limited query-rejection capability of current voice assistants, this paper presents the first Chinese-oriented open-source benchmark and evaluation suite for smart homes, together with a personalized query-rejection method based on large language models. On the data side, we construct the first multimodal query-rejection dataset tailored for domestic scenarios, containing 11,913 manually labeled text-speech pairs that systematically cover twelve typical dialogue types (e.g., chit-chat, non-human sounds, valid commands, ambiguous references, device-irrelevant requests). Fine-grained labels, conversational context and multi-turn information are provided to support both zero-shot and fine-tuning evaluations across language and multimodal large models. On the method side, we propose a three-tier collaborative architecture: first, a Qwen-2.5-3B adapter fine-tuned to model family-agnostic semantic boundaries; second, a dynamic household-level historical dialogue module to capture personalized habits; third, a household-specific RAG knowledge base that explicitly memorizes and revises past false-rejection cases. Experiments show that the proposed approach significantly outperforms zero-shot and fine-tuned general LLMs on the constructed dataset, with pronounced gains in rejection accuracy for family-specific expressions and complex multi-turn scenarios. This work provides a reproducible data foundation, evaluation standard and extensible technical framework for reliability research in smart-home voice interaction.

Reject or Not?: A Benchmark for Voice Assistant Query Rejection in Smart Home Scenario and an Improved Method Based on LLMs

TL;DR

This work tackles the challenge of query rejection in smart-home voice assistants by introducing the first Chinese-oriented open multimodal rejection benchmark and an improved LLM-based rejection method. It presents a two-part contribution: (1) a dataset of 11,913 text-speech samples across 13 invalid-utterance types with multi-turn context, and (2) a three-layer architecture combining a universal LLM adapter, household-level personalized memory, and a RAG-based mis-rejection knowledge base. Experiments show the proposed method achieves up to 96.75% rejection accuracy and consistently outperforms zero-shot and fine-tuned baselines, especially on family-specific and complex multi-turn scenarios. This work provides a reproducible data foundation, an evaluative standard, and an extensible framework that enables reliable, context-aware rejection in smart-home voice interaction, ultimately improving user trust and system efficiency.

Abstract

In smart-home voice assistant scenario, deciding whether to accept or reject a user query is the first step before any downstream processing. To address the limited query-rejection capability of current voice assistants, this paper presents the first Chinese-oriented open-source benchmark and evaluation suite for smart homes, together with a personalized query-rejection method based on large language models. On the data side, we construct the first multimodal query-rejection dataset tailored for domestic scenarios, containing 11,913 manually labeled text-speech pairs that systematically cover twelve typical dialogue types (e.g., chit-chat, non-human sounds, valid commands, ambiguous references, device-irrelevant requests). Fine-grained labels, conversational context and multi-turn information are provided to support both zero-shot and fine-tuning evaluations across language and multimodal large models. On the method side, we propose a three-tier collaborative architecture: first, a Qwen-2.5-3B adapter fine-tuned to model family-agnostic semantic boundaries; second, a dynamic household-level historical dialogue module to capture personalized habits; third, a household-specific RAG knowledge base that explicitly memorizes and revises past false-rejection cases. Experiments show that the proposed approach significantly outperforms zero-shot and fine-tuned general LLMs on the constructed dataset, with pronounced gains in rejection accuracy for family-specific expressions and complex multi-turn scenarios. This work provides a reproducible data foundation, evaluation standard and extensible technical framework for reliability research in smart-home voice interaction.

Paper Structure

This paper contains 50 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Overview of the Method