Table of Contents
Fetching ...

Shift and matching queries for video semantic segmentation

Tsubasa Mizuno, Toru Tamaki

TL;DR

Experimental results on CityScapes-VPS and VSPW show significant improvements from the baselines, highlighting the method's effectiveness in enhancing segmentation quality while efficiently reusing pre-trained weights.

Abstract

Video segmentation is a popular task, but applying image segmentation models frame-by-frame to videos does not preserve temporal consistency. In this paper, we propose a method to extend a query-based image segmentation model to video using feature shift and query matching. The method uses a query-based architecture, where decoded queries represent segmentation masks. These queries should be matched before performing the feature shift to ensure that the shifted queries represent the same mask across different frames. Experimental results on CityScapes-VPS and VSPW show significant improvements from the baselines, highlighting the method's effectiveness in enhancing segmentation quality while efficiently reusing pre-trained weights.

Shift and matching queries for video semantic segmentation

TL;DR

Experimental results on CityScapes-VPS and VSPW show significant improvements from the baselines, highlighting the method's effectiveness in enhancing segmentation quality while efficiently reusing pre-trained weights.

Abstract

Video segmentation is a popular task, but applying image segmentation models frame-by-frame to videos does not preserve temporal consistency. In this paper, we propose a method to extend a query-based image segmentation model to video using feature shift and query matching. The method uses a query-based architecture, where decoded queries represent segmentation masks. These queries should be matched before performing the feature shift to ensure that the shifted queries represent the same mask across different frames. Experimental results on CityScapes-VPS and VSPW show significant improvements from the baselines, highlighting the method's effectiveness in enhancing segmentation quality while efficiently reusing pre-trained weights.

Paper Structure

This paper contains 15 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The architecture of our proposed method includes feature shift and query matching. It utilize an image segmentation model (the orange plate) frame-by-frame, which has a backbone for feature extraction, a pixel decoder, a transformer decoder for processing mask queries, and a segmentation module for prediction. The proposed query matching ensures temporal consistency even when feature shift is applied to decoded queries of different frames.
  • Figure 2: Segmentation results for CityScapes-VPS. First two rows show (a) original frames, and (b) ground truth segmentation labels. The following rows show results of (c) the baseline (no shifts, no query matching), (d) 1/32 shift with and (e) without matching, (f) 1/16 shift with and (g) without matching.
  • Figure 3: Segmentation results for VSPW. First two rows show (a) original frames, and (b) ground truth segmentation labels. The following rows show results of (c) the baseline (no shifts, no query matching), (d) 1/16 shift with and (e) without matching, (f) 1/8 shift with and (g) without matching.