Table of Contents
Fetching ...

XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

Ho Kei Cheng, Alexander G. Schwing

TL;DR

XMem addresses the challenge of long-term video object segmentation by introducing a tri-store memory architecture inspired by the Atkinson–Shiffrin model: a fast sensory memory, a high-resolution working memory, and a compact long-term memory of prototypes. A novel anisotropic L2 memory reading mechanism retrieves relevant features from the joined memory stores, while memory potentiation and prototype-based consolidation keep the long-term store compact without sacrificing accuracy. The approach yields state-of-the-art results on long-video benchmarks and remains competitive on short-video datasets, all with bounded GPU memory that scales gracefully to thousands of frames. Overall, XMem provides a scalable, memory-efficient framework for robust VOS across both long and short videos, with practical implications for deployment on resource-constrained devices.

Abstract

We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem

XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

TL;DR

XMem addresses the challenge of long-term video object segmentation by introducing a tri-store memory architecture inspired by the Atkinson–Shiffrin model: a fast sensory memory, a high-resolution working memory, and a compact long-term memory of prototypes. A novel anisotropic L2 memory reading mechanism retrieves relevant features from the joined memory stores, while memory potentiation and prototype-based consolidation keep the long-term store compact without sacrificing accuracy. The approach yields state-of-the-art results on long-video benchmarks and remains competitive on short-video datasets, all with bounded GPU memory that scales gracefully to thousands of frames. Overall, XMem provides a scalable, memory-efficient framework for robust VOS across both long and short videos, with practical implications for deployment on resource-constrained devices.

Abstract

We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem
Paper Structure (30 sections, 5 equations, 14 figures, 19 tables)

This paper contains 30 sections, 5 equations, 14 figures, 19 tables.

Figures (14)

  • Figure 1: Do state-of-the-art VOS algorithms scale well? Left: Memory scaling with respect to short-term segmentation quality. Right: Segmentation quality scaling from standard short videos (y-axis) to long videos (x-axis) -- the dashed line indicates a 1:1 performance ratio. Error bars show standard deviations in memory sampling if applicable. See Section \ref{['sec:expr-long-vid']} for details.
  • Figure 2: Overview of XMem. The memory reading operation extracts relevant features from all three memory stores and uses those features to produce a mask. To incorporate new memory, the sensory memory is updated every frame while the working memory is only updated every $r$-th frame. The working memory is consolidated into the long-term memory in a compact form when it is full, and the long-term memory will forget obsolete features over time.
  • Figure 3: Process of memory reading and mask decoding of a single query frame. We extract query $\mathbf{q}$ from the image and perform attention-based memory reading from the working/long-term memory to obtain features $F$. Together with the sensory memory, it is fed into the decoder to generate a mask. For every $r$-th frame, we store new features into the working memory and perform a deep update to the sensory memory.
  • Figure 4: Visualization of similarity functions in 2D with the background color showing the influence of each memory element (RGB). L2 similarity (a) cheng2021stcn considers all memory elements uniformly. The shrinkage term (b) allows encoding element-level confidence (visualized by the size of dots) that accounts for the area of influence and sharpness of the mixing weights. The selection term allows query-specific interpretation of the memory -- (c) and (d) show its effect with two different queries that focus on the vertical and horizontal dimension respectively. (b) can be seen as a case where the selection term is isotropic. When combined, we can model more complex similarity relations.
  • Figure 5: Memory consolidation procedure. Given an image, we extract features as memory keys (image stride exaggerated). We visualize these features with colors. For memory consolidation, we first select prototype keys (stars) from the candidates (all grids). Then, we invoke potentiation which non-locally aggregates values from all the candidates to generate more representative prototype values (golden outline). The resultant prototype keys and values are added to the long-term memory. Only one frame is shown here -- in practice multiple frames are used in a single consolidation.
  • ...and 9 more figures