Table of Contents
Fetching ...

A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts

Wenzhuo Du, Gerun Wang, Guancheng Chen, Hang Zhao, Xin Li, Jian Gao

TL;DR

MiLoRA-ViSum introduces a mixture-of-LoRA-experts framework that extends LoRA to dual temporal–spatial adaptation within the Video-LLaMA backbone for video summarization. By dynamically combining multiple low-rank updates across temporal-attention and spatial-convolution modules, and employing alignment-guided cross-modal fusion, the method achieves competitive summarization quality with substantially fewer trainable parameters. Extensive experiments on VideoXum and ActivityNet show strong performance on ROUGE, BERTScore, METEOR, SacreBLEU, and NIST, while reducing training effort and inference latency. The work demonstrates a practical, scalable approach for efficient high-quality video summarization in large-scale deployments.

Abstract

With the exponential growth of user-generated content on video-sharing platforms, the challenge of facilitating efficient searching and browsing of videos has garnered significant attention. To enhance users' ability to swiftly locate and review pertinent videos, the creation of concise and informative video summaries has become increasingly important. Video-llama is an effective tool for generating video summarization, but it cannot effectively unify and optimize the modeling of temporal and spatial features and requires a lot of computational resources and time. Therefore, we propose MiLoRA-ViSum to more efficiently capture complex temporal dynamics and spatial relationships inherent in video data and to control the number of parameters for training. By extending traditional Low-Rank Adaptation (LoRA) into a sophisticated mixture-of-experts paradigm, MiLoRA-ViSum incorporates a dual temporal-spatial adaptation mechanism tailored specifically for video summarization tasks. This approach dynamically integrates specialized LoRA experts, each fine-tuned to address distinct temporal or spatial dimensions. Extensive evaluations of the VideoXum and ActivityNet datasets demonstrate that MiLoRA-ViSum achieves the best summarization performance compared to state-of-the-art models, while maintaining significantly lower computational costs. The proposed mixture-of-experts strategy, combined with the dual adaptation mechanism, highlights the model's potential to enhance video summarization capabilities, particularly in large-scale applications requiring both efficiency and precision.

A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts

TL;DR

MiLoRA-ViSum introduces a mixture-of-LoRA-experts framework that extends LoRA to dual temporal–spatial adaptation within the Video-LLaMA backbone for video summarization. By dynamically combining multiple low-rank updates across temporal-attention and spatial-convolution modules, and employing alignment-guided cross-modal fusion, the method achieves competitive summarization quality with substantially fewer trainable parameters. Extensive experiments on VideoXum and ActivityNet show strong performance on ROUGE, BERTScore, METEOR, SacreBLEU, and NIST, while reducing training effort and inference latency. The work demonstrates a practical, scalable approach for efficient high-quality video summarization in large-scale deployments.

Abstract

With the exponential growth of user-generated content on video-sharing platforms, the challenge of facilitating efficient searching and browsing of videos has garnered significant attention. To enhance users' ability to swiftly locate and review pertinent videos, the creation of concise and informative video summaries has become increasingly important. Video-llama is an effective tool for generating video summarization, but it cannot effectively unify and optimize the modeling of temporal and spatial features and requires a lot of computational resources and time. Therefore, we propose MiLoRA-ViSum to more efficiently capture complex temporal dynamics and spatial relationships inherent in video data and to control the number of parameters for training. By extending traditional Low-Rank Adaptation (LoRA) into a sophisticated mixture-of-experts paradigm, MiLoRA-ViSum incorporates a dual temporal-spatial adaptation mechanism tailored specifically for video summarization tasks. This approach dynamically integrates specialized LoRA experts, each fine-tuned to address distinct temporal or spatial dimensions. Extensive evaluations of the VideoXum and ActivityNet datasets demonstrate that MiLoRA-ViSum achieves the best summarization performance compared to state-of-the-art models, while maintaining significantly lower computational costs. The proposed mixture-of-experts strategy, combined with the dual adaptation mechanism, highlights the model's potential to enhance video summarization capabilities, particularly in large-scale applications requiring both efficiency and precision.

Paper Structure

This paper contains 21 sections, 17 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure S1: MiLoRA-ViSum model architecture.
  • Figure S2: Comparsion between our proposal and other works in terms of accuracy.
  • Figure S3: Comparison between our proposal and other works in terms of accuracy.
  • Figure S4: Comparsion between our proposal and other works in terms of intensity.