Table of Contents
Fetching ...

Loci-Segmented: Improving Scene Segmentation Learning

Manuel Traub, Frederic Becker, Adrian Sauter, Sebastian Otte, Martin V. Butz

TL;DR

The paper addresses scene segmentation learning without requiring predefined backgrounds or ground-truth slot initializations. It introduces Loci-Segmented (Loci-s), a slot-based video segmentation framework with a dedicated Background Module, Scene-Relative-Depth input, and a cascaded, top-down-aware encoder–decoder whose per-slot encodings comprise Gestalt and Position codes. On MOVi and related benchmarks, Loci-s achieves a $13.59 ext{ extpercent}$ relative improvement in IoU over SAVi++ on MOVi-E and demonstrates strong generalization across a compositional scene suite, while providing interpretable latent encodings that disentangle mask, depth, and texture. The work highlights the practical value of depth cues and segmentation preprocessing for unsupervised object discovery and suggests potential as a foundation module for downstream tasks in vision-based reasoning and planning.

Abstract

Current slot-oriented approaches for compositional scene segmentation from images and videos rely on provided background information or slot assignments. We present a segmented location and identity tracking system, Loci-Segmented (Loci-s), which does not require either of this information. It learns to dynamically segment scenes into interpretable background and slot-based object encodings, separating rgb, mask, location, and depth information for each. The results reveal largely superior video decomposition performance in the MOVi datasets and in another established dataset collection targeting scene segmentation. The system's well-interpretable, compositional latent encodings may serve as a foundation model for downstream tasks.

Loci-Segmented: Improving Scene Segmentation Learning

TL;DR

The paper addresses scene segmentation learning without requiring predefined backgrounds or ground-truth slot initializations. It introduces Loci-Segmented (Loci-s), a slot-based video segmentation framework with a dedicated Background Module, Scene-Relative-Depth input, and a cascaded, top-down-aware encoder–decoder whose per-slot encodings comprise Gestalt and Position codes. On MOVi and related benchmarks, Loci-s achieves a relative improvement in IoU over SAVi++ on MOVi-E and demonstrates strong generalization across a compositional scene suite, while providing interpretable latent encodings that disentangle mask, depth, and texture. The work highlights the practical value of depth cues and segmentation preprocessing for unsupervised object discovery and suggests potential as a foundation module for downstream tasks in vision-based reasoning and planning.

Abstract

Current slot-oriented approaches for compositional scene segmentation from images and videos rely on provided background information or slot assignments. We present a segmented location and identity tracking system, Loci-Segmented (Loci-s), which does not require either of this information. It learns to dynamically segment scenes into interpretable background and slot-based object encodings, separating rgb, mask, location, and depth information for each. The results reveal largely superior video decomposition performance in the MOVi datasets and in another established dataset collection targeting scene segmentation. The system's well-interpretable, compositional latent encodings may serve as a foundation model for downstream tasks.
Paper Structure (20 sections, 4 equations, 10 figures, 8 tables)

This paper contains 20 sections, 4 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Exemplar slot-based autoregressive segmentations inferred by Loci-s, generalizing to 7-10 objects while being trained on only 4-6 objects. Shown are moving MNIST digits and dSprites, the Abstract Scene dataset, CLEVR, SHOP VRB, and a combination of GSO and HDRI-Haven, all of which were considered in a recent compositional scene understanding review Yuan:2023.
  • Figure 2: The primary Loci-s architecture features: a Hyper-ConvNeXt encoder, which generates Position and Gestalt codes slot-individually; an Update module, which adaptively fuses current encoder information with prior temporal predictions; a Transition module, which calculates object dynamics via a GateL0RD layer (a strongly gated RNN, cf. Gumbsch:2021c) and inter-slot interactions via self-attention; finally, a Decoder module computing sequential slot-wise predictions including depth estimates for the subsequent frame.
  • Figure 3: New Encoder design with the added depth information and slot-wise depth and object mask channels (in red). One encoder head processes both common (RGB frame $R^t$, depth frame $D^t$, uncertainty mask $U^t$, background mask $\hat{M}^t_{bg}$) and slot specific inputs (decoder outputs from the previous iteration: slot rgb $\hat{R}^t_k$, slot depth $\hat{D}^t_k$, amodal mask $M^{t,o}_k$, visibility mask $\hat{M}^{t,v}_k$, summed masks from other slots $\hat{M}^{t,s}_k$, the 2d Gaussian position $\hat{Q}^t_k$). In addition to the top-down feedback provided by the slot specific inputs, Hyper-ConvNeXt blocks also provide top-down feedback in the form of dynamic weight residuals computed from predicted Gestalt-Codes $\hat{G}^t_k$.
  • Figure 4: A single Hyper-ConvNeXt block within the encoder where a top-down hyper-network translates the Gestalt-Code predicted in the last iteration $\hat{G_k^{t}}$ into spatial convolutional kernel weight residuals augmenting the receptive field of the particular slot encoder to be more susceptible to the previously encoded entity.
  • Figure 5: Decoder visualization illustrating the cascaded reconstruction strategy, first decoding the mask, then the depth, and finally the RGB image of a slot-encoded entity. Specifically the Gestalt code is partitioned into three equi-length segments of 256 elements each. The Mask Decoder is implemented by a compact convolutional network; its input comprises the Mask-Gestalt Code $\hat{Gm^{t}_k}$modulated by a 2D Gaussian heatmap, which is derived from the Position Code. The Depth Decoder features a U-Net architecture with aggressive down-sampling and up-sampling pathways, altering the spatial resolution by a factor of 16. This Depth Decoder receives as input the Depth-Gestalt Code $\hat{Gd^{t}_k}$ modulated by the mask output from the Mask Decoder. The RGB Decoder operates on the same principle as the Depth Decoder but incorporates an additional input: the depth map generated by the Depth Decoder.
  • ...and 5 more figures