Table of Contents
Fetching ...

Memory Efficient Matting with Adaptive Token Routing

Yiheng Lin, Yihan Hu, Chenyi Zhang, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei

TL;DR

MEMatte addresses the memory bottleneck of ViT-based image matting on high-resolution inputs by introducing adaptive per-token routing prior to global attention, a Batch-constrained Adaptive Token Routing (BATR) mechanism, and a Lightweight Token Refinement Module (LTRM) to process non-informative tokens. The framework dynamically allocates computation between a global-attention path and a lightweight refinement path, guided by a distillation loss from a ViTMatte teacher and a target compression degree to control token routing. A new ultra high-resolution dataset, UHR-395, enables evaluation at average resolutions around $4872\times6017$, and MEMatte achieves substantial memory ($\approx$88%) and latency ($\approx$50%) reductions while delivering state-of-the-art matting performance on high-resolution and real-world data. These contributions offer practical scalability for high-fidelity matting in real-world applications and establish a challenging benchmark for future work.

Abstract

Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a \textbf{m}emory-\textbf{e}fficient \textbf{m}atting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of $4872\times6017$. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark. Our code is available at https://github.com/linyiheng123/MEMatte.

Memory Efficient Matting with Adaptive Token Routing

TL;DR

MEMatte addresses the memory bottleneck of ViT-based image matting on high-resolution inputs by introducing adaptive per-token routing prior to global attention, a Batch-constrained Adaptive Token Routing (BATR) mechanism, and a Lightweight Token Refinement Module (LTRM) to process non-informative tokens. The framework dynamically allocates computation between a global-attention path and a lightweight refinement path, guided by a distillation loss from a ViTMatte teacher and a target compression degree to control token routing. A new ultra high-resolution dataset, UHR-395, enables evaluation at average resolutions around , and MEMatte achieves substantial memory (88%) and latency (50%) reductions while delivering state-of-the-art matting performance on high-resolution and real-world data. These contributions offer practical scalability for high-fidelity matting in real-world applications and establish a challenging benchmark for future work.

Abstract

Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a \textbf{m}emory-\textbf{e}fficient \textbf{m}atting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of . This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark. Our code is available at https://github.com/linyiheng123/MEMatte.

Paper Structure

This paper contains 16 sections, 13 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of the importance of full-resolution input in image matting. The crop-and-stitch manner introduces artifacts, while downsampling manner causes distortion.
  • Figure 2: Memory Usage / Image Resolution. OOM denotes an out-of-memory error encountered on an RTX 3090 GPU. The huge increase in memory usage from 4K to 8K of MEMatte is due to the 4x tokens and the quadratic complexity of attention mechanisms. Despite this, MEMatte is capable of processing 8K images on the RTX 3090.
  • Figure 3: Overall framework of the proposed MEMatte. The router module is inserted before global attention to predict the routing probability $p^{b,m}$ for each token. The BATR then makes the routing decision $\delta^{b,m}$ based on $p^{b,m}$.
  • Figure 4: Qualitative comparison of the results on the UHR-395 test set. $D$ denotes downsampling the input and $P$ indicates dividing the input into patches. The resolution of each image is on the left.
  • Figure 5: Visualization of the token routing. The retained tokens are routed to global attention branch, while the gray tokens are routed to the LTRM branch. More visualization results are shown in supplementary materials.
  • ...and 1 more figures