Table of Contents
Fetching ...

LFSamba: Marry SAM with Mamba for Light Field Salient Object Detection

Zhengyi Liu, Longzhen Wang, Xianyong Fang, Zhengzheng Tu, Linbo Wang

TL;DR

A state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced to emphasize four main insights: Efficient feature extraction, where SAM is used to extract modality-aware discriminative features,Inter-slice relation modeling, leveraging Mamba to capture long-range dependencies across multiple focal slices, thus extracting implicit depth cues.

Abstract

A light field camera can reconstruct 3D scenes using captured multi-focus images that contain rich spatial geometric information, enhancing applications in stereoscopic photography, virtual reality, and robotic vision. In this work, a state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced to emphasize four main insights: (a) Efficient feature extraction, where SAM is used to extract modality-aware discriminative features; (b) Inter-slice relation modeling, leveraging Mamba to capture long-range dependencies across multiple focal slices, thus extracting implicit depth cues; (c) Inter-modal relation modeling, utilizing Mamba to integrate all-focus and multi-focus images, enabling mutual enhancement; (d) Weakly supervised learning capability, developing a scribble annotation dataset from an existing pixel-level mask dataset, establishing the first scribble-supervised baseline for light field salient object detection.https://github.com/liuzywen/LFScribble

LFSamba: Marry SAM with Mamba for Light Field Salient Object Detection

TL;DR

A state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced to emphasize four main insights: Efficient feature extraction, where SAM is used to extract modality-aware discriminative features,Inter-slice relation modeling, leveraging Mamba to capture long-range dependencies across multiple focal slices, thus extracting implicit depth cues.

Abstract

A light field camera can reconstruct 3D scenes using captured multi-focus images that contain rich spatial geometric information, enhancing applications in stereoscopic photography, virtual reality, and robotic vision. In this work, a state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced to emphasize four main insights: (a) Efficient feature extraction, where SAM is used to extract modality-aware discriminative features; (b) Inter-slice relation modeling, leveraging Mamba to capture long-range dependencies across multiple focal slices, thus extracting implicit depth cues; (c) Inter-modal relation modeling, utilizing Mamba to integrate all-focus and multi-focus images, enabling mutual enhancement; (d) Weakly supervised learning capability, developing a scribble annotation dataset from an existing pixel-level mask dataset, establishing the first scribble-supervised baseline for light field salient object detection.https://github.com/liuzywen/LFScribble

Paper Structure

This paper contains 12 sections, 14 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Examples of multi-focus light field images. Each example consists of a set of focal slices and an all-focus image. Red boxes indicate the focused region. Compared with "SAM" finetuned exclusively on the all-focus image, "Ours" finetuned on multi-focus images are closer to the pixel-level mask annotation "GT". "Scribble" refers to the sparse annotation.
  • Figure 2: The pipeline of LFSamba.
  • Figure 3: Inter-Slice Mamba and its core component FSS2D.
  • Figure 4: Inter-Modal Mamba and its core component Slices-To-All SS2D and All-To-Slices SS2D.
  • Figure 5: The comparison of PR curves on three datasets.
  • ...and 1 more figures