Table of Contents
Fetching ...

Tokenizing Semantic Segmentation with RLE

Abhineet Singh, Justin Rozeboom, Nilanjan Ray

Abstract

This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks and then train a modified version of Pix2Seq \cite{p2s} to output these RLE tokens through autoregression. We propose novel tokenization strategies to compress the length of the token sequence to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our proposed models on two datasets to show that they are competitive with the state of the art in spite of being bottlenecked by our limited computational resources.

Tokenizing Semantic Segmentation with RLE

Abstract

This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks and then train a modified version of Pix2Seq \cite{p2s} to output these RLE tokens through autoregression. We propose novel tokenization strategies to compress the length of the token sequence to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our proposed models on two datasets to show that they are competitive with the state of the art in spite of being bottlenecked by our limited computational resources.
Paper Structure (34 sections, 1 equation, 19 figures, 14 tables)

This paper contains 34 sections, 1 equation, 19 figures, 14 tables.

Figures (19)

  • Figure 1: Generating RLE sequence for a $2\times 5$ binary mask using row-major flattening. Note that the start indices use 0-based indexing instead of the 1-based indexing commonly used in RLE.
  • Figure 2: Visualization of RLE tokenization of binary segmentation masks with row-major (top) and column-major (bottom) flattening of the masks. Each figure shows (from left to right) the source image patch with the foreground mask drawn on it in yellow, binary version of this mask with the already tokenized segment in yellow, and the corresponding tokens. The run that is currently being tokenized is shown in purple. Animated versions of these figures are available https://webdocs.cs.ualberta.ca/ asingh1/p2s#seg_binary_row_major and https://webdocs.cs.ualberta.ca/ asingh1/p2s#seg_binary_column_major. Best viewed under high magnification.
  • Figure 3: An example of the sliding window patch extraction and mask subsampling process on an image from the IPSC dataset. The top row shows (from left right) source image resized to $2560\times 2560$ with patch location shown by the blue box, corresponding mask at its full resolution of $2560\times 2560$, and this mask subsampled by a factor of 8 to $320\times 320$. The bottom row shows (from left right) $640\times 640$ patch corresponding to the blue box, corresponding patch mask at its full resolution $640\times 640$, and this mask subsampled by a factor of 8 to $80\times 80$. This subsampled $80\times 80$ mask is the one that is used for generating the RLE sequence. An animated version of this figure is available https://webdocs.cs.ualberta.ca/ asingh1/p2s#sliding_window_patches.
  • Figure 4: Visualization of LAC tokenization for multi-class segmentation. The figure shows (from left to right) the source image patch with the two classes in red and green, binary version of this mask with the already tokenized segment in red or green depending on the class, and the corresponding tokens. The run that is currently being tokenized is shown in purple. The LAC tokens are shown here as concatenations of class name and length but each such combination represents a single unique token. Animated version of this figure is available https://webdocs.cs.ualberta.ca/ asingh1/p2s#seg_lac. Best viewed under high magnification.
  • Figure 5: Visualization of TAC tokenization for multi-class video segmentation with $N=2$. The top row shows (from left to right) $F_1$, $F_2$, full resolution TAC mask, and subsampled TAC mask. The TAC masks show 8 TAC classes whose colors are shown at the top. The bottom row shows subsampled $F_1$ and $F_2$ masks, partially colored with TAC colors for runs whose tokens are shown on the right. Tokens are colored according to the TAC class in each run except the current one that is shown in purple. Animated version of this figure is available https://webdocs.cs.ualberta.ca/ asingh1/p2s#vid_seg_multi_class_tac. Best viewed under high magnification.
  • ...and 14 more figures