EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition

Xu Zheng; Lin Wang

EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition

Xu Zheng, Lin Wang

TL;DR

This paper proposes EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective, and introduces a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner.

Abstract

In this paper, we address the challenging problem of cross-modal (image-to-events) adaptation for event-based recognition without accessing any labeled source image data. This task is arduous due to the substantial modality gap between images and events. With only a pre-trained source model available, the key challenge lies in extracting knowledge from this model and effectively transferring knowledge to the event-based domain. Inspired by the natural ability of language to convey semantics across different modalities, we propose EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective. We introduce a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner. Importantly, it leverages a vision-language model to provide further supervision, enriching the surrogate images and enhancing modality bridging. This enables the creation of surrogate images to extract knowledge (i.e., labels) from the source model. On top, we propose a multi-representation knowledge adaptation (MKA) module to transfer knowledge to target models, utilizing multiple event representations to capture the spatiotemporal characteristics of events fully. The L-RMB and MKA modules are jointly optimized to achieve optimal performance in bridging the modality gap. Experiments on three benchmark datasets demonstrate that EventDance++ performs on par with methods that utilize source data, validating the effectiveness of our language-guided approach in event-based recognition.

EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition

TL;DR

Abstract

Paper Structure (22 sections, 9 equations, 9 figures, 9 tables)

This paper contains 22 sections, 9 equations, 9 figures, 9 tables.

Introduction
Related Work
Event-based Object Recognition
Cross-modal Knowledge Transfer
Source-free Unsupervised Domain Adaptation
The Proposed Framework
Problem Setup and Overview
Primary Objective
Language-guided Reconstruction-based Modality Bridging (L-RMB)
Self-supervised Pre-training
CLIP Feature Extraction
Fine-tuning Reconstruction & Source Model
Multi-representation Knowledge Adaptation
Experiments
Datasets and Implementation Details
...and 7 more sections

Figures (9)

Figure 1: Illustration of the challenging task of cross-modal adaptation from image to event modalities. We address it by introducing language-guided reconstruction-based modality bridging and multi-representation knowledge adaptation modules.
Figure 2: Cross-modal knowledge adaptation settings.
Figure 3: Overall framework of our proposed SFUDA for panoramic semantic segmentation.
Figure 4: (a) Example visualization of samples in the source (gray-scale image) and the surrogate (reconstructed) data in the image modality. (b) The reconstructed anchor data from the surrogate data across the knowledge adaptation.
Figure 5: Fine-tuning source model with both feature-wise and prediction-wise knowledge distillation.
...and 4 more figures

EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition

TL;DR

Abstract

EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (9)