EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric Videos
Vineet Parikh, Saif Mahmud, Devansh Agarwal, Ke Li, François Guimbretière, Cheng Zhang
TL;DR
EchoGuide proposes a multimodal pipeline that combines low-power active acoustic sensing on eyeglasses with ego-centric video, followed by captioning and retrieval-augmented QA to efficiently record and analyze eating activities. By using ActSonic to guide clip selection and leveraging ego-captioning alongside LLMs, the system produces compact activity records that retain semantic relevance to densely captured references. In a semi-in-the-wild study with 9 participants, EchoGuide achieved an average 68% reduction in recording length while maintaining high alignment with dense captions, enabling effective QA over the activity records. This approach reduces data and energy costs for real-world eating-activity tracking and can be extended to other domains with constrained wearables and vision models.
Abstract
Self-recording eating behaviors is a step towards a healthy lifestyle recommended by many health professionals. However, the current practice of manually recording eating activities using paper records or smartphone apps is often unsustainable and inaccurate. Smart glasses have emerged as a promising wearable form factor for tracking eating behaviors, but existing systems primarily identify when eating occurs without capturing details of the eating activities (E.g., what is being eaten). In this paper, we present EchoGuide, an application and system pipeline that leverages low-power active acoustic sensing to guide head-mounted cameras to capture egocentric videos, enabling efficient and detailed analysis of eating activities. By combining active acoustic sensing for eating detection with video captioning models and large-scale language models for retrieval augmentation, EchoGuide intelligently clips and analyzes videos to create concise, relevant activity records on eating. We evaluated EchoGuide with 9 participants in naturalistic settings involving eating activities, demonstrating high-quality summarization and significant reductions in video data needed, paving the way for practical, scalable eating activity tracking.
