Table of Contents
Fetching ...

Zero-shot Audio Topic Reranking using Large Language Models

Mengjie Qian, Rao Ma, Adian Liusie, Erfan Loweimi, Kate M. Knill, Mark J. F. Gales

TL;DR

This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches, in particular, zero-shot reranking methods using large language models (LLMs) as these are applicable to any video archive audio content.

Abstract

Multimodal Video Search by Examples (MVSE) investigates using video clips as the query term for information retrieval, rather than the more traditional text query. This enables far richer search modalities such as images, speaker, content, topic, and emotion. A key element for this process is highly rapid and flexible search to support large archives, which in MVSE is facilitated by representing video attributes with embeddings. This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches. In particular, zero-shot reranking methods using large language models (LLMs) are investigated as these are applicable to any video archive audio content. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus. Results demonstrate that reranking significantly improves retrieval ranking without requiring any task-specific in-domain training data. Furthermore, three sources of information (ASR transcriptions, automatic summaries and synopses) as input for LLM reranking were compared. To gain a deeper understanding and further insights into the performance differences and limitations of these text sources, we employ a fact-checking approach to analyse the information consistency among them.

Zero-shot Audio Topic Reranking using Large Language Models

TL;DR

This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches, in particular, zero-shot reranking methods using large language models (LLMs) as these are applicable to any video archive audio content.

Abstract

Multimodal Video Search by Examples (MVSE) investigates using video clips as the query term for information retrieval, rather than the more traditional text query. This enables far richer search modalities such as images, speaker, content, topic, and emotion. A key element for this process is highly rapid and flexible search to support large archives, which in MVSE is facilitated by representing video attributes with embeddings. This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches. In particular, zero-shot reranking methods using large language models (LLMs) are investigated as these are applicable to any video archive audio content. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus. Results demonstrate that reranking significantly improves retrieval ranking without requiring any task-specific in-domain training data. Furthermore, three sources of information (ASR transcriptions, automatic summaries and synopses) as input for LLM reranking were compared. To gain a deeper understanding and further insights into the performance differences and limitations of these text sources, we employ a fact-checking approach to analyse the information consistency among them.
Paper Structure (13 sections, 6 figures, 7 tables)

This paper contains 13 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Pipeline of the MVSE audio topic retrieval system.
  • Figure 2: Topic embedding extraction.
  • Figure 3: Prompts for fact checking: 1-shot prompt for facts decomposition and zero-shot for fact evaluation.
  • Figure 4: Prompt design for listwise topic reranking.
  • Figure 5: Prompt design for pairwise topic reranking.
  • ...and 1 more figures