Table of Contents
Fetching ...

Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition

Ilker Demirel, Karan Thakkar, Benjamin Elizalde, Miquel Espi Marques, Aditya Sarathy, Yang Bai, Umamahesh Srinivas, Jiajie Xu, Shirley Ren, Jaya Narain

TL;DR

The paper tackles the challenge of fusing heterogeneous time-series sensors for activity recognition in data-scarce contexts by leveraging large language models for late fusion. It presents a framework that uses per-modality predictions (audio captions/labels and IMU data) plus synthetic context as prompts to two LLMs, enabling zero-shot and one-shot closed-set classification on a curated Ego4D-derived dataset. Results show meaningful, above-chance performance without task-specific fine-tuning, with audio information typically being most informative and context augmentation improving results; open-ended evaluations also indicate potential for more flexible labeling. This approach offers deployment advantages by avoiding additional modality-alignment training and can extend to additional modalities, making it practical for privacy-conscious health and context-aware sensing scenarios.

Abstract

Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We curated a subset of data for diverse activity recognition across contexts (e.g., household activities, sports) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, with no task-specific training. Zero-shot classification via LLM-based fusion from modality-specific models can enable multimodal temporal applications where there is limited aligned training data for learning a shared embedding space. Additionally, LLM-based fusion can enable model deploying without requiring additional memory and computation for targeted application-specific multimodal models.

Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition

TL;DR

The paper tackles the challenge of fusing heterogeneous time-series sensors for activity recognition in data-scarce contexts by leveraging large language models for late fusion. It presents a framework that uses per-modality predictions (audio captions/labels and IMU data) plus synthetic context as prompts to two LLMs, enabling zero-shot and one-shot closed-set classification on a curated Ego4D-derived dataset. Results show meaningful, above-chance performance without task-specific fine-tuning, with audio information typically being most informative and context augmentation improving results; open-ended evaluations also indicate potential for more flexible labeling. This approach offers deployment advantages by avoiding additional modality-alignment training and can extend to additional modalities, making it practical for privacy-conscious health and context-aware sensing scenarios.

Abstract

Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We curated a subset of data for diverse activity recognition across contexts (e.g., household activities, sports) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, with no task-specific training. Zero-shot classification via LLM-based fusion from modality-specific models can enable multimodal temporal applications where there is limited aligned training data for learning a shared embedding space. Additionally, LLM-based fusion can enable model deploying without requiring additional memory and computation for targeted application-specific multimodal models.

Paper Structure

This paper contains 10 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Model Architecture for Prompt Creation.