Table of Contents
Fetching ...

VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition

Lan Chen, Haoxiang Yang, Pengpeng Shao, Haoyu Song, Xiao Wang, Zhicheng Zhao, Yaowei Wang, Yonghong Tian

TL;DR

VELoRA proposes a parameter-efficient fine-tuning (PEFT) strategy for RGB-Event recognition by applying modality-specific LoRA blocks to lower Transformer layers for RGB and event streams, integrating frame-difference motion cues, and using a modality-shared LoRA fusion in a final fusion stage. The approach leverages a ViT-B/16/CLIP backbone with all base parameters frozen, updating only low-rank LoRA matrices, and enforces cross-modal alignment via reconstruction losses. Empirical results on PokerEvent and HARDVS show competitive top-1 accuracy (57.99% on PokerEvent, 50.89% on HARDVS) with substantial reductions in trainable parameters (from 1719 MB to 7.02 MB) and memory usage, demonstrating strong efficiency without sacrificing performance. The work also provides extensive ablations and visual analyses to justify the design, and releases code and pretrained models for reproducibility and broader impact.

Abstract

Pattern recognition leveraging both RGB and Event cameras can significantly enhance performance by deploying deep neural networks that utilize a fine-tuning strategy. Inspired by the successful application of large models, the introduction of such large models can also be considered to further enhance the performance of multi-modal tasks. However, fully fine-tuning these models leads to inefficiency and lightweight fine-tuning methods such as LoRA and Adapter have been proposed to achieve a better balance between efficiency and performance. To our knowledge, there is currently no work that has conducted parameter-efficient fine-tuning (PEFT) for RGB-Event recognition based on pre-trained foundation models. To address this issue, this paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification. Specifically, given the RGB frames and event streams, we extract the RGB and event features based on the vision foundation model ViT with a modality-specific LoRA tuning strategy. The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network. These features are concatenated and fed into high-level Transformer layers for efficient multi-modal feature learning via modality-shared LoRA tuning. Finally, we concatenate these features and feed them into a classification head to achieve efficient fine-tuning. The source code and pre-trained models will be released on \url{https://github.com/Event-AHU/VELoRA}.

VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition

TL;DR

VELoRA proposes a parameter-efficient fine-tuning (PEFT) strategy for RGB-Event recognition by applying modality-specific LoRA blocks to lower Transformer layers for RGB and event streams, integrating frame-difference motion cues, and using a modality-shared LoRA fusion in a final fusion stage. The approach leverages a ViT-B/16/CLIP backbone with all base parameters frozen, updating only low-rank LoRA matrices, and enforces cross-modal alignment via reconstruction losses. Empirical results on PokerEvent and HARDVS show competitive top-1 accuracy (57.99% on PokerEvent, 50.89% on HARDVS) with substantial reductions in trainable parameters (from 1719 MB to 7.02 MB) and memory usage, demonstrating strong efficiency without sacrificing performance. The work also provides extensive ablations and visual analyses to justify the design, and releases code and pretrained models for reproducibility and broader impact.

Abstract

Pattern recognition leveraging both RGB and Event cameras can significantly enhance performance by deploying deep neural networks that utilize a fine-tuning strategy. Inspired by the successful application of large models, the introduction of such large models can also be considered to further enhance the performance of multi-modal tasks. However, fully fine-tuning these models leads to inefficiency and lightweight fine-tuning methods such as LoRA and Adapter have been proposed to achieve a better balance between efficiency and performance. To our knowledge, there is currently no work that has conducted parameter-efficient fine-tuning (PEFT) for RGB-Event recognition based on pre-trained foundation models. To address this issue, this paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification. Specifically, given the RGB frames and event streams, we extract the RGB and event features based on the vision foundation model ViT with a modality-specific LoRA tuning strategy. The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network. These features are concatenated and fed into high-level Transformer layers for efficient multi-modal feature learning via modality-shared LoRA tuning. Finally, we concatenate these features and feed them into a classification head to achieve efficient fine-tuning. The source code and pre-trained models will be released on \url{https://github.com/Event-AHU/VELoRA}.
Paper Structure (16 sections, 10 equations, 7 figures, 8 tables)

This paper contains 16 sections, 10 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison between existing event-based classification models and our newly proposed VELoRA. The horizontal axis represents training time, the vertical axis represents the accuracy of the model, and the size of the bubble indicates the amount of parameters that need to be adjusted.
  • Figure 2: Comparison between existing LoRA strategy for (a) single modality tuning, (b) multi-task tuning, and (c) our newly proposed VELoRA for RGB-Event effective fusion.
  • Figure 3: An overview of our proposed Low-Rank Adaptation Approach for Efficient Visible-Event Pattern Recognition, termed VELoRA. We introduce a novel fine-tuning approach that integrates modality-specific and shared components, enabling the model to preserve sensitivity to distinct modalities while also extracting shared information across them, which boosts performance on multimodal tasks. We designate the last block as the high-level Transformer block, with the remaining blocks functioning as low-level ones. For RGB and event inputs, we encode them using a pre-trained large vision model and introduce a reconstruction loss to enhance feature fusion and generalization. Additionally, we incorporate the differences between consecutive frames as an auxiliary modality in the modality-specific stage to aid RGB and event modalities. In the shared modality stage, we refine the original output with the combined tri-modal data. The refined features are then fed into the classification head for final categorization.
  • Figure 4: Visualization of feature distribution of (a) Ours, (b) full fine-tuning on PokerEvent.
  • Figure 5: Visualization of the RGB frame differences (left) and Event frame differences (right) on the HARDVS dataset.
  • ...and 2 more figures