Table of Contents
Fetching ...

Understanding Image2Video Domain Shift in Food Segmentation: An Instance-level Analysis on Apples

Keonvin Park, Aditya Pal, Jin Hong Mok

TL;DR

This study tackles the mismatch between strong image-based food segmentation performance and unreliable video deployment. By reframing video food segmentation as an instance segmentation and tracking problem, it demonstrates that high per-frame accuracy does not guarantee stable instance identities over time, leading to significant counting errors for apples. The authors show that temporal variation in appearance causes mask flickering and identity fragmentation, which conventional frame-wise metrics fail to expose. They also test lightweight remedies such as post-hoc temporal smoothing and self-supervised temporal consistency, finding partial improvements but not a full solution, and argue that the root cause lies in image-centric training objectives rather than model capacity. The work highlights the need for temporally-aware training and evaluation protocols for reliable video-based food analysis.

Abstract

Food segmentation models trained on static images have achieved strong performance on benchmark datasets; however, their reliability in video settings remains poorly understood. In real-world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent, yet image-trained models often break down when deployed on videos. In this work, we analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category. Models are trained solely on image-level food segmentation data and evaluated on video sequences using an instance segmentation with tracking-by-matching framework, enabling object-level temporal analysis. Our results reveal that high frame-wise segmentation accuracy does not translate to stable instance identities over time. Temporal appearance variations, particularly illumination changes, specular reflections, and texture ambiguity, lead to mask flickering and identity fragmentation, resulting in significant errors in apple counting. These failures are largely overlooked by conventional image-based metrics, which substantially overestimate real-world video performance. Beyond diagnosing the problem, we examine practical remedies that do not require full video supervision, including post-hoc temporal regularization and self-supervised temporal consistency objectives. Our findings suggest that the root cause of failure lies in image-centric training objectives that ignore temporal coherence, rather than model capacity. This study highlights a critical evaluation gap in food segmentation research and motivates temporally-aware learning and evaluation protocols for video-based food analysis.

Understanding Image2Video Domain Shift in Food Segmentation: An Instance-level Analysis on Apples

TL;DR

This study tackles the mismatch between strong image-based food segmentation performance and unreliable video deployment. By reframing video food segmentation as an instance segmentation and tracking problem, it demonstrates that high per-frame accuracy does not guarantee stable instance identities over time, leading to significant counting errors for apples. The authors show that temporal variation in appearance causes mask flickering and identity fragmentation, which conventional frame-wise metrics fail to expose. They also test lightweight remedies such as post-hoc temporal smoothing and self-supervised temporal consistency, finding partial improvements but not a full solution, and argue that the root cause lies in image-centric training objectives rather than model capacity. The work highlights the need for temporally-aware training and evaluation protocols for reliable video-based food analysis.

Abstract

Food segmentation models trained on static images have achieved strong performance on benchmark datasets; however, their reliability in video settings remains poorly understood. In real-world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent, yet image-trained models often break down when deployed on videos. In this work, we analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category. Models are trained solely on image-level food segmentation data and evaluated on video sequences using an instance segmentation with tracking-by-matching framework, enabling object-level temporal analysis. Our results reveal that high frame-wise segmentation accuracy does not translate to stable instance identities over time. Temporal appearance variations, particularly illumination changes, specular reflections, and texture ambiguity, lead to mask flickering and identity fragmentation, resulting in significant errors in apple counting. These failures are largely overlooked by conventional image-based metrics, which substantially overestimate real-world video performance. Beyond diagnosing the problem, we examine practical remedies that do not require full video supervision, including post-hoc temporal regularization and self-supervised temporal consistency objectives. Our findings suggest that the root cause of failure lies in image-centric training objectives that ignore temporal coherence, rather than model capacity. This study highlights a critical evaluation gap in food segmentation research and motivates temporally-aware learning and evaluation protocols for video-based food analysis.
Paper Structure (31 sections, 1 equation, 4 tables)