Table of Contents
Fetching ...

EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition

Xu Zheng, Lin Wang

TL;DR

This paper proposes EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective, and introduces a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner.

Abstract

In this paper, we address the challenging problem of cross-modal (image-to-events) adaptation for event-based recognition without accessing any labeled source image data. This task is arduous due to the substantial modality gap between images and events. With only a pre-trained source model available, the key challenge lies in extracting knowledge from this model and effectively transferring knowledge to the event-based domain. Inspired by the natural ability of language to convey semantics across different modalities, we propose EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective. We introduce a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner. Importantly, it leverages a vision-language model to provide further supervision, enriching the surrogate images and enhancing modality bridging. This enables the creation of surrogate images to extract knowledge (i.e., labels) from the source model. On top, we propose a multi-representation knowledge adaptation (MKA) module to transfer knowledge to target models, utilizing multiple event representations to capture the spatiotemporal characteristics of events fully. The L-RMB and MKA modules are jointly optimized to achieve optimal performance in bridging the modality gap. Experiments on three benchmark datasets demonstrate that EventDance++ performs on par with methods that utilize source data, validating the effectiveness of our language-guided approach in event-based recognition.

EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition

TL;DR

This paper proposes EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective, and introduces a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner.

Abstract

In this paper, we address the challenging problem of cross-modal (image-to-events) adaptation for event-based recognition without accessing any labeled source image data. This task is arduous due to the substantial modality gap between images and events. With only a pre-trained source model available, the key challenge lies in extracting knowledge from this model and effectively transferring knowledge to the event-based domain. Inspired by the natural ability of language to convey semantics across different modalities, we propose EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective. We introduce a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner. Importantly, it leverages a vision-language model to provide further supervision, enriching the surrogate images and enhancing modality bridging. This enables the creation of surrogate images to extract knowledge (i.e., labels) from the source model. On top, we propose a multi-representation knowledge adaptation (MKA) module to transfer knowledge to target models, utilizing multiple event representations to capture the spatiotemporal characteristics of events fully. The L-RMB and MKA modules are jointly optimized to achieve optimal performance in bridging the modality gap. Experiments on three benchmark datasets demonstrate that EventDance++ performs on par with methods that utilize source data, validating the effectiveness of our language-guided approach in event-based recognition.
Paper Structure (22 sections, 9 equations, 9 figures, 9 tables)

This paper contains 22 sections, 9 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Illustration of the challenging task of cross-modal adaptation from image to event modalities. We address it by introducing language-guided reconstruction-based modality bridging and multi-representation knowledge adaptation modules.
  • Figure 2: Cross-modal knowledge adaptation settings.
  • Figure 3: Overall framework of our proposed SFUDA for panoramic semantic segmentation.
  • Figure 4: (a) Example visualization of samples in the source (gray-scale image) and the surrogate (reconstructed) data in the image modality. (b) The reconstructed anchor data from the surrogate data across the knowledge adaptation.
  • Figure 5: Fine-tuning source model with both feature-wise and prediction-wise knowledge distillation.
  • ...and 4 more figures