Table of Contents
Fetching ...

Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks

Lucca Emmanuel Pineli Simões, Lucas Brandão Rodrigues, Rafaela Mota Silva, Gustavo Rodrigues da Silva

TL;DR

The study addresses hands-free drone control by comparing three voice-to-action pipelines on a Tello drone: STT followed by an LLM, direct audio-to-command classification, and a Siamese-network-based approach. It leverages WAV2Vec2-based features, Llama3 for language interpretation, a direct classification fine-tuned on Portuguese audio, and a Siamese architecture with a vector database and KNN retrieval to enable flexible generalization. Results show that the Direct Model achieves the highest accuracy (~0.99) with low latency, while the Siamese network excels in inference speed (0.006 s) and generalization to new commands, and STT+LLM offers robust interpretation with longer latency. These findings guide deployment choices for voice-controlled drones, balancing accuracy, responsiveness, and adaptability in real-world scenarios, and point to future improvements via dataset expansion and continued model enhancements.

Abstract

This paper presents the development and comparative evaluation of three voice command pipelines for controlling a Tello drone, using speech recognition and deep learning techniques. The aim is to enhance human-machine interaction by enabling intuitive voice control of drone actions. The pipelines developed include: (1) a traditional Speech-to-Text (STT) followed by a Large Language Model (LLM) approach, (2) a direct voice-to-function mapping model, and (3) a Siamese neural network-based system. Each pipeline was evaluated based on inference time, accuracy, efficiency, and flexibility. Detailed methodologies, dataset preparation, and evaluation metrics are provided, offering a comprehensive analysis of each pipeline's strengths and applicability across different scenarios.

Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks

TL;DR

The study addresses hands-free drone control by comparing three voice-to-action pipelines on a Tello drone: STT followed by an LLM, direct audio-to-command classification, and a Siamese-network-based approach. It leverages WAV2Vec2-based features, Llama3 for language interpretation, a direct classification fine-tuned on Portuguese audio, and a Siamese architecture with a vector database and KNN retrieval to enable flexible generalization. Results show that the Direct Model achieves the highest accuracy (~0.99) with low latency, while the Siamese network excels in inference speed (0.006 s) and generalization to new commands, and STT+LLM offers robust interpretation with longer latency. These findings guide deployment choices for voice-controlled drones, balancing accuracy, responsiveness, and adaptability in real-world scenarios, and point to future improvements via dataset expansion and continued model enhancements.

Abstract

This paper presents the development and comparative evaluation of three voice command pipelines for controlling a Tello drone, using speech recognition and deep learning techniques. The aim is to enhance human-machine interaction by enabling intuitive voice control of drone actions. The pipelines developed include: (1) a traditional Speech-to-Text (STT) followed by a Large Language Model (LLM) approach, (2) a direct voice-to-function mapping model, and (3) a Siamese neural network-based system. Each pipeline was evaluated based on inference time, accuracy, efficiency, and flexibility. Detailed methodologies, dataset preparation, and evaluation metrics are provided, offering a comprehensive analysis of each pipeline's strengths and applicability across different scenarios.
Paper Structure (15 sections, 3 figures, 5 tables)

This paper contains 15 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the STT and LLM pipeline
  • Figure 2: Overview of the Classification Model pipeline
  • Figure 3: Overview of the Siamese Network pipeline