Learning Trimaps via Clicks for Image Matting
Chenyi Zhang, Yihan Hu, Henghui Ding, Humphrey Shi, Yao Zhao, Yunchao Wei
TL;DR
The paper addresses the bottleneck of manual trimap annotation in image matting by proposing Click2Trimap, an interactive model that converts sparse user clicks into high-quality trimaps and alpha mattes. It introduces an Integrative Three-class Training Strategy (ITTS) and a Conditioned Unknown Prioritized Simulation (CUPS) to simulate realistic user interactions and emphasize unknown-region recall. When integrated with trimap-based matting models, Click2Trimap achieves comparable alpha matte accuracy to manual trimaps while dramatically reducing interaction cost, including an average of ~5 seconds per image in user studies. The approach also demonstrates strong scalability to video matting and robustness across diverse object types and real-world data.
Abstract
Despite significant advancements in image matting, existing models heavily depend on manually-drawn trimaps for accurate results in natural image scenarios. However, the process of obtaining trimaps is time-consuming, lacking user-friendliness and device compatibility. This reliance greatly limits the practical application of all trimap-based matting methods. To address this issue, we introduce Click2Trimap, an interactive model capable of predicting high-quality trimaps and alpha mattes with minimal user click inputs. Through analyzing real users' behavioral logic and characteristics of trimaps, we successfully propose a powerful iterative three-class training strategy and a dedicated simulation function, making Click2Trimap exhibit versatility across various scenarios. Quantitative and qualitative assessments on synthetic and real-world matting datasets demonstrate Click2Trimap's superior performance compared to all existing trimap-free matting methods. Especially, in the user study, Click2Trimap achieves high-quality trimap and matting predictions in just an average of 5 seconds per image, demonstrating its substantial practical value in real-world applications.
