Table of Contents
Fetching ...

RMem: Restricted Memory Banks Improve Video Object Segmentation

Junbao Zhou, Ziqi Pang, Yu-Xiong Wang

TL;DR

RMem challenges the standard practice of expanding memory banks in video object segmentation by uncovering a memory deciphering effect: larger memory banks introduce redundant information that hinders decoding. It remedies this with a simple, plug-and-play approach that restricts memory to a fixed number of frames, paired with a UCB-inspired memory update and a temporal positional embedding to improve temporal reasoning. Across challenging benchmarks like VOST and LVOS, RMem achieves state-of-the-art performance, while maintaining efficiency on shorter datasets, and demonstrates that training-inference memory alignment coupled with explicit temporal encoding enhances long-video understanding. The work offers practical, scalable improvements for memory-based VOS and suggests broader implications for temporal reasoning in long videos and other memory-reliant vision systems.

Abstract

With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed "memory deciphering" study offers a pivotal insight underpinning such a strategy: expanding memory banks, while seemingly beneficial, actually increases the difficulty for VOS modules to decode relevant features due to the confusion from redundant information. By restricting memory banks to a limited number of essential frames, we achieve a notable improvement in VOS accuracy. This process balances the importance and freshness of frames to maintain an informative memory bank within a bounded capacity. Additionally, restricted memory banks reduce the training-inference discrepancy in memory lengths compared with continuous expansion. This fosters new opportunities in temporal reasoning and enables us to introduce the previously overlooked "temporal positional embedding." Finally, our insights are embodied in "RMem" ("R" for restricted), a simple yet effective VOS modification that excels at challenging VOS scenarios and establishes new state of the art for object state changes (on the VOST dataset) and long videos (on the Long Videos dataset). Our code and demo are available at https://restricted-memory.github.io/.

RMem: Restricted Memory Banks Improve Video Object Segmentation

TL;DR

RMem challenges the standard practice of expanding memory banks in video object segmentation by uncovering a memory deciphering effect: larger memory banks introduce redundant information that hinders decoding. It remedies this with a simple, plug-and-play approach that restricts memory to a fixed number of frames, paired with a UCB-inspired memory update and a temporal positional embedding to improve temporal reasoning. Across challenging benchmarks like VOST and LVOS, RMem achieves state-of-the-art performance, while maintaining efficiency on shorter datasets, and demonstrates that training-inference memory alignment coupled with explicit temporal encoding enhances long-video understanding. The work offers practical, scalable improvements for memory-based VOS and suggests broader implications for temporal reasoning in long videos and other memory-reliant vision systems.

Abstract

With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed "memory deciphering" study offers a pivotal insight underpinning such a strategy: expanding memory banks, while seemingly beneficial, actually increases the difficulty for VOS modules to decode relevant features due to the confusion from redundant information. By restricting memory banks to a limited number of essential frames, we achieve a notable improvement in VOS accuracy. This process balances the importance and freshness of frames to maintain an informative memory bank within a bounded capacity. Additionally, restricted memory banks reduce the training-inference discrepancy in memory lengths compared with continuous expansion. This fosters new opportunities in temporal reasoning and enables us to introduce the previously overlooked "temporal positional embedding." Finally, our insights are embodied in "RMem" ("R" for restricted), a simple yet effective VOS modification that excels at challenging VOS scenarios and establishes new state of the art for object state changes (on the VOST dataset) and long videos (on the Long Videos dataset). Our code and demo are available at https://restricted-memory.github.io/.
Paper Structure (28 sections, 10 equations, 6 figures, 11 tables)

This paper contains 28 sections, 10 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: In light of challenging object state changes tokmakov2023breakingxue2024learningyu2023video, we rethink the conventional VOS approach of continuously accumulating the features into memory banks: despite capturing all the information, it complicates the deciphering of relevant features. Conversely, restricted memory banks significantly enhance VOS.
  • Figure 2: Sketch of Pilot Study. Our memory deciphering analysis emulates decoding the mask on frame 0 from the memory bank features to quantify the impact of a growing memory bank on VOS modules, where the "desired results" in the figure are the ground truth. For a video shown in Block (a), we visualize its decoding results in Block (b): the masks degrade both quantitatively (yellow curve) and qualitatively, deviating from the desired results. However, selecting a set of concise frames mitigates this issue (blue curve in Block (b)). Therefore, we conjecture that the drawback of a growing memory bank lies in confusing the attention of VOS modules. In Block (c), we use red lines to indicate highly weighted associations in attention, with thickness denoting the attention score values. As illustrated, the query $F_0$ focuses less on its most relevant frame after the memory bank expands, with the attention score dropping from 0.247 to 0.056. (2$^{\text{nd}}$ row shows ground-truth masks $\widetilde{S}_t$ as the reference. $\mathcal{J}_{\mathrm{mean}}$ is the average Jaccard between $S_0^t$ and $\widetilde{S}_0$ over all videos.)
  • Figure 3: RMem Overview. (a) RMem revisits restricting memory banks to enhance VOS (Sec. \ref{['sec:restrict_memory']}), motivated by the insight from our pilot study. (b) To maintain an informative memory bank, we balance both the relevance and freshness of frames when updating the latest features (Sec. \ref{['sec:mem_update']}). (c) Benefiting from smaller memory size gaps between training and inference, we introduce previously overlooked temporal positional embedding to encode the orders of frames explicitly (Sec. \ref{['sec:mem_temporal']}), which enhances spatio-temporal reasoning.
  • Figure 4: Impact of memory bank size on VOS, tested on VOST. With more frames in the restricted memory, the accuracy first increases and then decreases until it approximates unrestricted memory. This supports the limited deciphering capability of VOS modules and our insight into restricting memory banks.
  • Figure 5: (Best viewed zoom-in with color.) Qualitative VOS results for object state changes on VOST tokmakov2023breaking. We provide two examples showing the challenges of object state changes, including slicing, occlusions, distraction from similar objects (other tomatoes), and shape changes. For both scenarios, using RMem shows advantages in robustly maintaining the masks of the target objects, as highlighted. (White pixels are annotated by VOST denoting "ignored" regions for evaluation, which are hard and ambiguous even for human annotators.)
  • ...and 1 more figures