Submodular video object proposal selection for semantic object segmentation
Tinghuai Wang
TL;DR
The paper tackles semantic video object segmentation under weak supervision by learning a data-driven, spatio-temporal representation that aggregates multiple instance proposals across frames. It introduces a submodular track selection method with a facility-location term $\\mathcal{H}(\\mathcal{D})$ and a discriminative term $\\mathcal{P}(\\mathcal{D})$ to prune noisy proposals, with the overall objective $\\mathcal{E}(\\mathcal{D})=\\mathcal{H}(\\mathcal{D})+\\mathcal{P}(\\mathcal{D})$ optimized greedily. Object segmentation is performed on a space-time superpixel graph via an energy $E(x)$ combining color and semantic unary potentials and a pairwise term, solved with alpha expansion. Experiments on YouTube-Objects show competitive improvements over state-of-the-art baselines, demonstrating that submodular track selection and cross-frame proposal aggregation can effectively exploit long-range context using pre-trained image recognizers.
Abstract
Learning a data-driven spatio-temporal semantic representation of the objects is the key to coherent and consistent labelling in video. This paper proposes to achieve semantic video object segmentation by learning a data-driven representation which captures the synergy of multiple instances from continuous frames. To prune the noisy detections, we exploit the rich information among multiple instances and select the discriminative and representative subset. This selection process is formulated as a facility location problem solved by maximising a submodular function. Our method retrieves the longer term contextual dependencies which underpins a robust semantic video object segmentation algorithm. We present extensive experiments on a challenging dataset that demonstrate the superior performance of our approach compared with the state-of-the-art methods.
