Lifelong Learning Using a Dynamically Growing Tree of Sub-networks for Domain Generalization in Video Object Segmentation
Islam Osman, Mohamed S. Shehata
TL;DR
The paper tackles domain generalization in video object segmentation by addressing catastrophic forgetting when learning from multiple domains. It introduces a dynamically growing tree of sub-networks (DGT) that performs task-specific network generation via path-based parameter replacement, coupled with a lifelong learning procedure that uses Fisher-information-based weighting to mitigate forgetting. Across single-domain, multi-domain sequential, and few-shot out-of-domain evaluations, DGT achieves state-of-the-art or competitive results, with notable gains in F-score and reduced forgetting, while revealing trade-offs in model size and initialization time. The approach has practical impact for robust VOS in real-world, multi-domain scenarios and offers a scalable framework for continual domain adaptation in vision tasks.
Abstract
Current state-of-the-art video object segmentation models have achieved great success using supervised learning with massive labeled training datasets. However, these models are trained using a single source domain and evaluated using videos sampled from the same source domain. When these models are evaluated using videos sampled from a different target domain, their performance degrades significantly due to poor domain generalization, i.e., their inability to learn from multi-domain sources simultaneously using traditional supervised learning. In this paper, We propose a dynamically growing tree of sub-networks (DGT) to learn effectively from multi-domain sources. DGT uses a novel lifelong learning technique that allows the model to continuously and effectively learn from new domains without forgetting the previously learned domains. Hence, the model can generalize to out-of-domain videos. The proposed work is evaluated using single-source in-domain (traditional video object segmentation), multi-source in-domain, and multi-source out-of-domain video object segmentation. The results of DGT show a single source in-domain performance gain of 0.2% and 3.5% on the DAVIS16 and DAVIS17 datasets, respectively. However, when DGT is evaluated using in-domain multi-sources, the results show superior performance compared to state-of-the-art video object segmentation and other lifelong learning techniques with an average performance increase in the F-score of 6.9% with minimal catastrophic forgetting. Finally, in the out-of-domain experiment, the performance of DGT is 2.7% and 4% better than state-of-the-art in 1 and 5-shots, respectively.
