Table of Contents
Fetching ...

Optimizing Small Language Models for In-Vehicle Function-Calling

Yahya Sowti Khiabani, Farris Atif, Chieh Hsu, Sven Stahlmann, Tobias Michels, Sebastian Kramer, Benedikt Heidrich, M. Saquib Sarfraz, Julian Merten, Faezeh Tafazzoli

TL;DR

The paper tackles enabling robust on-device function-calling for in-vehicle systems using small language models under strict hardware constraints. It proposes a holistic pipeline—structured pruning, healing, and task-specific fine-tuning—applied to the Phi-3 mini, followed by 4-bit quantization and deployment via llama.cpp to achieve real-time on-device inference without accelerator hardware, at about $11$ tokens per second. Key findings show that depth-wise pruning can remove up to about $1$–$2$B parameters with modest losses, while width pruning is more disruptive; long healing plus instruction tuning preserves or recovers capabilities, enabling high function-calling accuracy around $0.86$–$0.88$ across model sizes, with efficient on-device throughput. The work demonstrates a scalable, edge-friendly approach to modern vehicle control, enabling flexible user interactions and rapid software updates without dedicated hardware accelerators.

Abstract

We propose a holistic approach for deploying Small Language Models (SLMs) as function-calling agents within vehicles as edge devices, offering a more flexible and robust alternative to traditional rule-based systems. By leveraging SLMs, we simplify vehicle control mechanisms and enhance the user experience. Given the in-vehicle hardware constraints, we apply state-of-the-art model compression techniques, including structured pruning, healing, and quantization, ensuring that the model fits within the resource limitations while maintaining acceptable performance. Our work focuses on optimizing a representative SLM, Microsoft's Phi-3 mini, and outlines best practices for enabling embedded models, including compression, task-specific fine-tuning, and vehicle integration. We demonstrate that, despite significant reduction in model size which removes up to 2 billion parameters from the original model, our approach preserves the model's ability to handle complex in-vehicle tasks accurately and efficiently. Furthermore, by executing the model in a lightweight runtime environment, we achieve a generation speed of 11 tokens per second, making real-time, on-device inference feasible without hardware acceleration. Our results demonstrate the potential of SLMs to transform vehicle control systems, enabling more intuitive interactions between users and their vehicles for an enhanced driving experience.

Optimizing Small Language Models for In-Vehicle Function-Calling

TL;DR

The paper tackles enabling robust on-device function-calling for in-vehicle systems using small language models under strict hardware constraints. It proposes a holistic pipeline—structured pruning, healing, and task-specific fine-tuning—applied to the Phi-3 mini, followed by 4-bit quantization and deployment via llama.cpp to achieve real-time on-device inference without accelerator hardware, at about tokens per second. Key findings show that depth-wise pruning can remove up to about B parameters with modest losses, while width pruning is more disruptive; long healing plus instruction tuning preserves or recovers capabilities, enabling high function-calling accuracy around across model sizes, with efficient on-device throughput. The work demonstrates a scalable, edge-friendly approach to modern vehicle control, enabling flexible user interactions and rapid software updates without dedicated hardware accelerators.

Abstract

We propose a holistic approach for deploying Small Language Models (SLMs) as function-calling agents within vehicles as edge devices, offering a more flexible and robust alternative to traditional rule-based systems. By leveraging SLMs, we simplify vehicle control mechanisms and enhance the user experience. Given the in-vehicle hardware constraints, we apply state-of-the-art model compression techniques, including structured pruning, healing, and quantization, ensuring that the model fits within the resource limitations while maintaining acceptable performance. Our work focuses on optimizing a representative SLM, Microsoft's Phi-3 mini, and outlines best practices for enabling embedded models, including compression, task-specific fine-tuning, and vehicle integration. We demonstrate that, despite significant reduction in model size which removes up to 2 billion parameters from the original model, our approach preserves the model's ability to handle complex in-vehicle tasks accurately and efficiently. Furthermore, by executing the model in a lightweight runtime environment, we achieve a generation speed of 11 tokens per second, making real-time, on-device inference feasible without hardware acceleration. Our results demonstrate the potential of SLMs to transform vehicle control systems, enabling more intuitive interactions between users and their vehicles for an enhanced driving experience.
Paper Structure (12 sections, 3 figures, 5 tables)

This paper contains 12 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Proposed framework for optimizing and deploying SLM for in-vehicle function-calling. Red represents the pruning stages, green for healing, and blue for function-calling alignment.
  • Figure 2: Heatmap of distances for all 32 decoder layers of Phi-3, with varying block sizes $n \in \{1,\ldots, 24\}$, calibrated with the fineweb dataset. Dark purple indicates regions of minimum distance or maximum similarity. Layers 21-29 (highlighted in green) were found to be the optimal block of size $n=8$ to prune.
  • Figure 3: CPU usage of LLM process during inference on the vehicle head unit. The horizontal lines show binned values of the process across time. The Top 10% average (black line) shows the top 10% of CPU usage of the process.