Table of Contents
Fetching ...

Explicit Memory through Online 3D Gaussian Splatting Improves Class-Agnostic Video Segmentation

Anthony Opipari, Aravindhan K Krishnan, Shreekant Gayaka, Min Sun, Cheng-Hao Kuo, Arnie Sen, Odest Chadwicke Jenkins

TL;DR

This paper addresses the instability of class-agnostic video segmentation by introducing an explicit 3D memory via online 3D Gaussian Splatting (3DGS). It presents two memory-augmented baselines, FastSAM-Splat and SAM2-Splat, that fuse or re-prompt with past segment memories to improve accuracy and temporal consistency. Through real-world (ScanNet-MV) and simulated (MVPd) benchmarks, the approach yields notable gains in VSQ and STQ over memoryless or purely implicit-memory baselines, with ablations clarifying the impact of segment-ID representations and re-prompting strategies. The results demonstrate the practical value of explicit spatial memory for open-world robotic perception, while identifying areas for efficiency and global optimization improvements.

Abstract

Remembering where object segments were predicted in the past is useful for improving the accuracy and consistency of class-agnostic video segmentation algorithms. Existing video segmentation algorithms typically use either no object-level memory (e.g. FastSAM) or they use implicit memories in the form of recurrent neural network features (e.g. SAM2). In this paper, we augment both types of segmentation models using an explicit 3D memory and show that the resulting models have more accurate and consistent predictions. For this, we develop an online 3D Gaussian Splatting (3DGS) technique to store predicted object-level segments generated throughout the duration of a video. Based on this 3DGS representation, a set of fusion techniques are developed, named FastSAM-Splat and SAM2-Splat, that use the explicit 3DGS memory to improve their respective foundation models' predictions. Ablation experiments are used to validate the proposed techniques' design and hyperparameter settings. Results from both real-world and simulated benchmarking experiments show that models which use explicit 3D memories result in more accurate and consistent predictions than those which use no memory or only implicit neural network memories. Project Page: https://topipari.com/projects/FastSAM-Splat/

Explicit Memory through Online 3D Gaussian Splatting Improves Class-Agnostic Video Segmentation

TL;DR

This paper addresses the instability of class-agnostic video segmentation by introducing an explicit 3D memory via online 3D Gaussian Splatting (3DGS). It presents two memory-augmented baselines, FastSAM-Splat and SAM2-Splat, that fuse or re-prompt with past segment memories to improve accuracy and temporal consistency. Through real-world (ScanNet-MV) and simulated (MVPd) benchmarks, the approach yields notable gains in VSQ and STQ over memoryless or purely implicit-memory baselines, with ablations clarifying the impact of segment-ID representations and re-prompting strategies. The results demonstrate the practical value of explicit spatial memory for open-world robotic perception, while identifying areas for efficiency and global optimization improvements.

Abstract

Remembering where object segments were predicted in the past is useful for improving the accuracy and consistency of class-agnostic video segmentation algorithms. Existing video segmentation algorithms typically use either no object-level memory (e.g. FastSAM) or they use implicit memories in the form of recurrent neural network features (e.g. SAM2). In this paper, we augment both types of segmentation models using an explicit 3D memory and show that the resulting models have more accurate and consistent predictions. For this, we develop an online 3D Gaussian Splatting (3DGS) technique to store predicted object-level segments generated throughout the duration of a video. Based on this 3DGS representation, a set of fusion techniques are developed, named FastSAM-Splat and SAM2-Splat, that use the explicit 3DGS memory to improve their respective foundation models' predictions. Ablation experiments are used to validate the proposed techniques' design and hyperparameter settings. Results from both real-world and simulated benchmarking experiments show that models which use explicit 3D memories result in more accurate and consistent predictions than those which use no memory or only implicit neural network memories. Project Page: https://topipari.com/projects/FastSAM-Splat/

Paper Structure

This paper contains 13 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the FastSAM-Splat model. Image-level 'Predicted image segments' refers to an image-level segment output from the foundation image segmentation model (FastSAM).
  • Figure 2: Illustration of the SAM2-Splat model. Splat-segment is a segment stored and rendered by Gaussian splat. Predicted image segments-output from the FastSAM/SAM2 models.
  • Figure 3: Illustration of the SAM2-Splat re-prompting strategy.
  • Figure 4: Qualitative comparison of the FastSAM and FastSAM-Splat models showing FastSAM-Splat has fewer segment inconsistencies. False negative inconsistencies highlighted in yellow outline.
  • Figure 5: Comparison of SAM2 and SAM2-Splat on a sequence from ScanNet-MV showing the re-prompting mechanism reduces false negative flickering. False negatives highlighted in yellow outline.