Table of Contents
Fetching ...

Smarter, not Bigger: Fine-Tuned RAG-Enhanced LLMs for Automotive HIL Testing

Chao Feng, Zihan Liu, Siddhant Gupta, Gongpei Cui, Jan von der Assen, Burkhard Stiller

TL;DR

This work tackles fragmentation in automotive HIL validation data by introducing HIL-GPT, a retrieval-augmented agent that couples domain-adapted embeddings with a semantic vector store to access requirements, test sequences, and CAN/CAPL artifacts. The approach demonstrates that fine-tuned, compact embedding models can rival or exceed larger models in retrieval accuracy while offering lower latency and cost, and that RAG enhances grounding and reduces hallucinations in domain tasks. An extensive evaluation, including offline metrics and an engineering-focused A/B study, confirms improvements in relevance, truthfulness, and user satisfaction, albeit with some latency trade-offs. The results advocate for deploying efficient, domain-aligned LLM-based assistants in industrial HIL environments and point to future work in broader domain coverage and enhanced reasoning with structured signals.

Abstract

Hardware-in-the-Loop (HIL) testing is essential for automotive validation but suffers from fragmented and underutilized test artifacts. This paper presents HIL-GPT, a retrieval-augmented generation (RAG) system integrating domain-adapted large language models (LLMs) with semantic retrieval. HIL-GPT leverages embedding fine-tuning using a domain-specific dataset constructed via heuristic mining and LLM-assisted synthesis, combined with vector indexing for scalable, traceable test case and requirement retrieval. Experiments show that fine-tuned compact models, such as \texttt{bge-base-en-v1.5}, achieve a superior trade-off between accuracy, latency, and cost compared to larger models, challenging the notion that bigger is always better. An A/B user study further confirms that RAG-enhanced assistants improve perceived helpfulness, truthfulness, and satisfaction over general-purpose LLMs. These findings provide insights for deploying efficient, domain-aligned LLM-based assistants in industrial HIL environments.

Smarter, not Bigger: Fine-Tuned RAG-Enhanced LLMs for Automotive HIL Testing

TL;DR

This work tackles fragmentation in automotive HIL validation data by introducing HIL-GPT, a retrieval-augmented agent that couples domain-adapted embeddings with a semantic vector store to access requirements, test sequences, and CAN/CAPL artifacts. The approach demonstrates that fine-tuned, compact embedding models can rival or exceed larger models in retrieval accuracy while offering lower latency and cost, and that RAG enhances grounding and reduces hallucinations in domain tasks. An extensive evaluation, including offline metrics and an engineering-focused A/B study, confirms improvements in relevance, truthfulness, and user satisfaction, albeit with some latency trade-offs. The results advocate for deploying efficient, domain-aligned LLM-based assistants in industrial HIL environments and point to future work in broader domain coverage and enhanced reasoning with structured signals.

Abstract

Hardware-in-the-Loop (HIL) testing is essential for automotive validation but suffers from fragmented and underutilized test artifacts. This paper presents HIL-GPT, a retrieval-augmented generation (RAG) system integrating domain-adapted large language models (LLMs) with semantic retrieval. HIL-GPT leverages embedding fine-tuning using a domain-specific dataset constructed via heuristic mining and LLM-assisted synthesis, combined with vector indexing for scalable, traceable test case and requirement retrieval. Experiments show that fine-tuned compact models, such as \texttt{bge-base-en-v1.5}, achieve a superior trade-off between accuracy, latency, and cost compared to larger models, challenging the notion that bigger is always better. An A/B user study further confirms that RAG-enhanced assistants improve perceived helpfulness, truthfulness, and satisfaction over general-purpose LLMs. These findings provide insights for deploying efficient, domain-aligned LLM-based assistants in industrial HIL environments.

Paper Structure

This paper contains 33 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: System architecture of HIL-GPT.
  • Figure 2: Pre- and post-fine-tuning accuracy comparison (Top-1 retrieval).
  • Figure 3: Top-1 retrieval accuracy of bge-base-en-v1.5 under different training regimes.
  • Figure 4: Comparison of source attribution accuracy between GPT-4o and GPT-4o-mini using original and fine-tuned bge-base-en-v1.5 embeddings.
  • Figure 5: User evaluation ratings: Bot A (with RAG) vs. Bot B (no RAG).