Table of Contents
Fetching ...

A Multi-model Approach for Video Data Retrieval in Autonomous Vehicle Development

Jesper Knapp, Klas Moberg, Yuchuan Jin, Simin Sun, Miroslaw Staron

TL;DR

This paper tackles the challenge of locating specific driving scenarios in massive autonomous-vehicle logs without relying on SQL. It proposes a three-stage multi-model pipeline that converts high-frequency sensor signals and video into textual descriptions, computes embeddings with $BGE$-large, and stores them in the vector database $ChromaDB$ for semantic search. Evaluation indicates Gemma-7b provides strong descriptive quality and that multimodal inputs improve retrieval relevance, though challenges remain in video resolution and frame-level detail. The approach offers a practical NL-based tool that complements SQL-based searches, potentially speeding up scenario analysis and software development in autonomous driving.

Abstract

Autonomous driving software generates enormous amounts of data every second, which software development organizations save for future analysis and testing in the form of logs. However, given the vast size of this data, locating specific scenarios within a collection of vehicle logs can be challenging. Writing the correct SQL queries to find these scenarios requires engineers to have a strong background in SQL and the specific databases in question, further complicating the search process. This paper presents and evaluates a pipeline that allows searching for specific scenarios in log collections using natural language descriptions instead of SQL. The generated descriptions were evaluated by engineers working with vehicle logs at the Zenseact on a scale from 1 to 5. Our approach achieved a mean score of 3.3, demonstrating the potential of using a multi-model architecture to improve the software development workflow. We also present an interface that can visualize the query process and visualize the results.

A Multi-model Approach for Video Data Retrieval in Autonomous Vehicle Development

TL;DR

This paper tackles the challenge of locating specific driving scenarios in massive autonomous-vehicle logs without relying on SQL. It proposes a three-stage multi-model pipeline that converts high-frequency sensor signals and video into textual descriptions, computes embeddings with -large, and stores them in the vector database for semantic search. Evaluation indicates Gemma-7b provides strong descriptive quality and that multimodal inputs improve retrieval relevance, though challenges remain in video resolution and frame-level detail. The approach offers a practical NL-based tool that complements SQL-based searches, potentially speeding up scenario analysis and software development in autonomous driving.

Abstract

Autonomous driving software generates enormous amounts of data every second, which software development organizations save for future analysis and testing in the form of logs. However, given the vast size of this data, locating specific scenarios within a collection of vehicle logs can be challenging. Writing the correct SQL queries to find these scenarios requires engineers to have a strong background in SQL and the specific databases in question, further complicating the search process. This paper presents and evaluates a pipeline that allows searching for specific scenarios in log collections using natural language descriptions instead of SQL. The generated descriptions were evaluated by engineers working with vehicle logs at the Zenseact on a scale from 1 to 5. Our approach achieved a mean score of 3.3, demonstrating the potential of using a multi-model architecture to improve the software development workflow. We also present an interface that can visualize the query process and visualize the results.
Paper Structure (14 sections, 8 figures, 2 tables)

This paper contains 14 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: SQL code for "Find where left side emergency lka interventions are triggered"
  • Figure 2: Illustration of vehicle log format with randomized data-points.
  • Figure 3: Sample images captured in the vehicle from the Zenseact Open Dataset zod
  • Figure 4: The data retrieval pipeline consists of three stages: converting signal and video logs into text (yellow), combining and embedding these descriptions and storing the embeddings (grey), and retrieving the most similar scenarios to a natural language query (pink)
  • Figure 5: User interface for scenario querying.
  • ...and 3 more figures