A Multi-model Approach for Video Data Retrieval in Autonomous Vehicle Development
Jesper Knapp, Klas Moberg, Yuchuan Jin, Simin Sun, Miroslaw Staron
TL;DR
This paper tackles the challenge of locating specific driving scenarios in massive autonomous-vehicle logs without relying on SQL. It proposes a three-stage multi-model pipeline that converts high-frequency sensor signals and video into textual descriptions, computes embeddings with $BGE$-large, and stores them in the vector database $ChromaDB$ for semantic search. Evaluation indicates Gemma-7b provides strong descriptive quality and that multimodal inputs improve retrieval relevance, though challenges remain in video resolution and frame-level detail. The approach offers a practical NL-based tool that complements SQL-based searches, potentially speeding up scenario analysis and software development in autonomous driving.
Abstract
Autonomous driving software generates enormous amounts of data every second, which software development organizations save for future analysis and testing in the form of logs. However, given the vast size of this data, locating specific scenarios within a collection of vehicle logs can be challenging. Writing the correct SQL queries to find these scenarios requires engineers to have a strong background in SQL and the specific databases in question, further complicating the search process. This paper presents and evaluates a pipeline that allows searching for specific scenarios in log collections using natural language descriptions instead of SQL. The generated descriptions were evaluated by engineers working with vehicle logs at the Zenseact on a scale from 1 to 5. Our approach achieved a mean score of 3.3, demonstrating the potential of using a multi-model architecture to improve the software development workflow. We also present an interface that can visualize the query process and visualize the results.
