EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric Videos

Vineet Parikh; Saif Mahmud; Devansh Agarwal; Ke Li; François Guimbretière; Cheng Zhang

EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric Videos

Vineet Parikh, Saif Mahmud, Devansh Agarwal, Ke Li, François Guimbretière, Cheng Zhang

TL;DR

EchoGuide proposes a multimodal pipeline that combines low-power active acoustic sensing on eyeglasses with ego-centric video, followed by captioning and retrieval-augmented QA to efficiently record and analyze eating activities. By using ActSonic to guide clip selection and leveraging ego-captioning alongside LLMs, the system produces compact activity records that retain semantic relevance to densely captured references. In a semi-in-the-wild study with 9 participants, EchoGuide achieved an average 68% reduction in recording length while maintaining high alignment with dense captions, enabling effective QA over the activity records. This approach reduces data and energy costs for real-world eating-activity tracking and can be extended to other domains with constrained wearables and vision models.

Abstract

Self-recording eating behaviors is a step towards a healthy lifestyle recommended by many health professionals. However, the current practice of manually recording eating activities using paper records or smartphone apps is often unsustainable and inaccurate. Smart glasses have emerged as a promising wearable form factor for tracking eating behaviors, but existing systems primarily identify when eating occurs without capturing details of the eating activities (E.g., what is being eaten). In this paper, we present EchoGuide, an application and system pipeline that leverages low-power active acoustic sensing to guide head-mounted cameras to capture egocentric videos, enabling efficient and detailed analysis of eating activities. By combining active acoustic sensing for eating detection with video captioning models and large-scale language models for retrieval augmentation, EchoGuide intelligently clips and analyzes videos to create concise, relevant activity records on eating. We evaluated EchoGuide with 9 participants in naturalistic settings involving eating activities, demonstrating high-quality summarization and significant reductions in video data needed, paving the way for practical, scalable eating activity tracking.

EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric Videos

TL;DR

Abstract

Paper Structure (22 sections, 3 figures, 2 tables)

This paper contains 22 sections, 3 figures, 2 tables.

Introduction
Related Work
The System Design of EchoGuide
Hardware Prototype
Glasses with active acoustic Sensing
Head-mounted GoPro for egocentric video capture
Data Processing Pipeline
Using Active Acoustic Sensing to localize relevant actions and clip videos
Generating activity records from video and active acoustic sensing
Answering questions given activity records
User Study
Evaluating quality of eating activity summaries and responses with LLMs
Metrics
Answer alignment with dense captioning
Recording reduction compared to dense sampling
...and 7 more sections

Figures (3)

Figure 1: Glasses and GoPro Hardware setup for EchoGuide
Figure 2: EchoGuide vs Generic LLM Document Generation
Figure 3: Example showing field of view between GoPro vs Meta Ray-Bans

EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric Videos

TL;DR

Abstract

EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (3)