Table of Contents
Fetching ...

Towards Real-Time Open-Vocabulary Video Instance Segmentation

Bin Yan, Martin Sundermeyer, David Joseph Tan, Huchuan Lu, Federico Tombari

TL;DR

This work tackles real-time open-vocabulary video instance segmentation (OV-VIS) by introducing TROY-VIS, a framework that significantly accelerates OV-VIS without sacrificing accuracy. It identifies text encoding, feature interaction, and instance decoding as primary bottlenecks and proposes three innovations: Decoupled Attention Feature Enhancer, Flash Embedding Memory, and Kernel Interpolation, plus a streamlined training strategy. The result is a real-time OV-VIS system that achieves up to $20\times$ speedups and reaches around $25$ FPS on challenging benchmarks like BURST and LV-VIS, while delivering state-of-the-art or competitive accuracy (HOTA and mAP) in zero-shot settings. These advances enable practical OV-VIS deployment in dynamic environments such as mobile robotics and augmented reality, where fast, open-category perception is crucial.

Abstract

In this paper, we address the challenge of performing open-vocabulary video instance segmentation (OV-VIS) in real-time. We analyze the computational bottlenecks of state-of-the-art foundation models that performs OV-VIS, and propose a new method, TROY-VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks, BURST and LV-VIS, running 20x faster than GLEE-Lite (25 FPS v.s. 1.25 FPS) with comparable or even better accuracy. These results demonstrate TROY-VIS's potential for real-time applications in dynamic environments such as mobile robotics and augmented reality. Code and model will be released at https://github.com/google-research/troyvis.

Towards Real-Time Open-Vocabulary Video Instance Segmentation

TL;DR

This work tackles real-time open-vocabulary video instance segmentation (OV-VIS) by introducing TROY-VIS, a framework that significantly accelerates OV-VIS without sacrificing accuracy. It identifies text encoding, feature interaction, and instance decoding as primary bottlenecks and proposes three innovations: Decoupled Attention Feature Enhancer, Flash Embedding Memory, and Kernel Interpolation, plus a streamlined training strategy. The result is a real-time OV-VIS system that achieves up to speedups and reaches around FPS on challenging benchmarks like BURST and LV-VIS, while delivering state-of-the-art or competitive accuracy (HOTA and mAP) in zero-shot settings. These advances enable practical OV-VIS deployment in dynamic environments such as mobile robotics and augmented reality, where fast, open-category perception is crucial.

Abstract

In this paper, we address the challenge of performing open-vocabulary video instance segmentation (OV-VIS) in real-time. We analyze the computational bottlenecks of state-of-the-art foundation models that performs OV-VIS, and propose a new method, TROY-VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks, BURST and LV-VIS, running 20x faster than GLEE-Lite (25 FPS v.s. 1.25 FPS) with comparable or even better accuracy. These results demonstrate TROY-VIS's potential for real-time applications in dynamic environments such as mobile robotics and augmented reality. Code and model will be released at https://github.com/google-research/troyvis.

Paper Structure

This paper contains 23 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Performance and speed comparison on the LV-VIS LVVIS benchmark. TROY-VIS is the only method that runs in real-time. Compared with GLEE-Lite GLEE TROY-VIS runs $20\times$ faster while achieving better results. TROY-VIS surpasses OVSeg-R50 LVVIS, the previously fastest model, by 6.7 AP.
  • Figure 2: Architecture comparison between the original feature enhancer and our decoupled attention feature enhancer. In our design, a fast modality attention and an efficient scale attention are used to replace the heavy modality-scale hybrid attention.
  • Figure 3: Illustration of kernel interpolation. Cuboids represent instance kernels and kernels from the same frame are in the same color. Besides, accurate kernels on key frames and proxy kernels from non-key frames are circled in solid and dashed lines respectively. We explore three types of interpolation methods: linear, nearest neighbor (NN), and causal NN interpolation.
  • Figure 4: Qualitative results of TROY-VIS on challenging indoor and outdoor scenarios. Best viewed in color with zoom-in.