XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
Ho Kei Cheng, Alexander G. Schwing
TL;DR
XMem addresses the challenge of long-term video object segmentation by introducing a tri-store memory architecture inspired by the Atkinson–Shiffrin model: a fast sensory memory, a high-resolution working memory, and a compact long-term memory of prototypes. A novel anisotropic L2 memory reading mechanism retrieves relevant features from the joined memory stores, while memory potentiation and prototype-based consolidation keep the long-term store compact without sacrificing accuracy. The approach yields state-of-the-art results on long-video benchmarks and remains competitive on short-video datasets, all with bounded GPU memory that scales gracefully to thousands of frames. Overall, XMem provides a scalable, memory-efficient framework for robust VOS across both long and short videos, with practical implications for deployment on resource-constrained devices.
Abstract
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem
