Table of Contents
Fetching ...

Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets

Huy M. Le, Dat Tien Nguyen, Phuc Binh Nguyen, Gia-Bao Le-Tran, Phu Truong Thien, Cuong Dinh, Minh Nguyen, Nga Nguyen, Thuy T. N. Nguyen, Huy Gia Ngo, Tan Nhat Nguyen, Binh T. Nguyen, Monojit Choudhury

TL;DR

The paper addresses the need for fast, accurate large-scale video retrieval under tight time constraints in Video Browser Showdown (VBS). It introduces Fusionista2.0, a multi-modal retrieval system that re-engineers core modules for speed, including an all-in-one ffmpeg-based keyframe extractor, Vintern-1B-v3.5 OCR, faster_whisper ASR, and lightweight VLLMs for QA, complemented by an ensemble textual search on CLIP-Sig400M and CLIP-ViT-5B using $s(q,v)=α s_Sig400M(q,v)+(1-α) s_ViT-5B(q,v)$ with α=0.7. A GPT-4o-driven reranking pipeline and VLLM choices (VideoLLaMA, BLIP-2) refine results based on clarifying yes-no questions. A redesigned UI/UX and batch operations drive usability; results show up to 75% faster retrieval and improved accuracy and user satisfaction, confirming Fusionista2.0's practical impact for VBS2026.

Abstract

The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.

Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets

TL;DR

The paper addresses the need for fast, accurate large-scale video retrieval under tight time constraints in Video Browser Showdown (VBS). It introduces Fusionista2.0, a multi-modal retrieval system that re-engineers core modules for speed, including an all-in-one ffmpeg-based keyframe extractor, Vintern-1B-v3.5 OCR, faster_whisper ASR, and lightweight VLLMs for QA, complemented by an ensemble textual search on CLIP-Sig400M and CLIP-ViT-5B using with α=0.7. A GPT-4o-driven reranking pipeline and VLLM choices (VideoLLaMA, BLIP-2) refine results based on clarifying yes-no questions. A redesigned UI/UX and batch operations drive usability; results show up to 75% faster retrieval and improved accuracy and user satisfaction, confirming Fusionista2.0's practical impact for VBS2026.

Abstract

The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.

Paper Structure

This paper contains 11 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of out system.
  • Figure 2: Illustration of reranking process. Images from prior search is reranked based on number of yes answers from the Vision language model
  • Figure 3: Our system's UI/UX for main functions. On the top-left, top-right, bottom-left, bottom-right is ASR search, object search, OCR search, QA respectively.