Table of Contents
Fetching ...

Lifelong Learning Using a Dynamically Growing Tree of Sub-networks for Domain Generalization in Video Object Segmentation

Islam Osman, Mohamed S. Shehata

TL;DR

The paper tackles domain generalization in video object segmentation by addressing catastrophic forgetting when learning from multiple domains. It introduces a dynamically growing tree of sub-networks (DGT) that performs task-specific network generation via path-based parameter replacement, coupled with a lifelong learning procedure that uses Fisher-information-based weighting to mitigate forgetting. Across single-domain, multi-domain sequential, and few-shot out-of-domain evaluations, DGT achieves state-of-the-art or competitive results, with notable gains in F-score and reduced forgetting, while revealing trade-offs in model size and initialization time. The approach has practical impact for robust VOS in real-world, multi-domain scenarios and offers a scalable framework for continual domain adaptation in vision tasks.

Abstract

Current state-of-the-art video object segmentation models have achieved great success using supervised learning with massive labeled training datasets. However, these models are trained using a single source domain and evaluated using videos sampled from the same source domain. When these models are evaluated using videos sampled from a different target domain, their performance degrades significantly due to poor domain generalization, i.e., their inability to learn from multi-domain sources simultaneously using traditional supervised learning. In this paper, We propose a dynamically growing tree of sub-networks (DGT) to learn effectively from multi-domain sources. DGT uses a novel lifelong learning technique that allows the model to continuously and effectively learn from new domains without forgetting the previously learned domains. Hence, the model can generalize to out-of-domain videos. The proposed work is evaluated using single-source in-domain (traditional video object segmentation), multi-source in-domain, and multi-source out-of-domain video object segmentation. The results of DGT show a single source in-domain performance gain of 0.2% and 3.5% on the DAVIS16 and DAVIS17 datasets, respectively. However, when DGT is evaluated using in-domain multi-sources, the results show superior performance compared to state-of-the-art video object segmentation and other lifelong learning techniques with an average performance increase in the F-score of 6.9% with minimal catastrophic forgetting. Finally, in the out-of-domain experiment, the performance of DGT is 2.7% and 4% better than state-of-the-art in 1 and 5-shots, respectively.

Lifelong Learning Using a Dynamically Growing Tree of Sub-networks for Domain Generalization in Video Object Segmentation

TL;DR

The paper tackles domain generalization in video object segmentation by addressing catastrophic forgetting when learning from multiple domains. It introduces a dynamically growing tree of sub-networks (DGT) that performs task-specific network generation via path-based parameter replacement, coupled with a lifelong learning procedure that uses Fisher-information-based weighting to mitigate forgetting. Across single-domain, multi-domain sequential, and few-shot out-of-domain evaluations, DGT achieves state-of-the-art or competitive results, with notable gains in F-score and reduced forgetting, while revealing trade-offs in model size and initialization time. The approach has practical impact for robust VOS in real-world, multi-domain scenarios and offers a scalable framework for continual domain adaptation in vision tasks.

Abstract

Current state-of-the-art video object segmentation models have achieved great success using supervised learning with massive labeled training datasets. However, these models are trained using a single source domain and evaluated using videos sampled from the same source domain. When these models are evaluated using videos sampled from a different target domain, their performance degrades significantly due to poor domain generalization, i.e., their inability to learn from multi-domain sources simultaneously using traditional supervised learning. In this paper, We propose a dynamically growing tree of sub-networks (DGT) to learn effectively from multi-domain sources. DGT uses a novel lifelong learning technique that allows the model to continuously and effectively learn from new domains without forgetting the previously learned domains. Hence, the model can generalize to out-of-domain videos. The proposed work is evaluated using single-source in-domain (traditional video object segmentation), multi-source in-domain, and multi-source out-of-domain video object segmentation. The results of DGT show a single source in-domain performance gain of 0.2% and 3.5% on the DAVIS16 and DAVIS17 datasets, respectively. However, when DGT is evaluated using in-domain multi-sources, the results show superior performance compared to state-of-the-art video object segmentation and other lifelong learning techniques with an average performance increase in the F-score of 6.9% with minimal catastrophic forgetting. Finally, in the out-of-domain experiment, the performance of DGT is 2.7% and 4% better than state-of-the-art in 1 and 5-shots, respectively.
Paper Structure (17 sections, 7 equations, 7 figures, 7 tables)

This paper contains 17 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Process of DGT in the testing phase. First, an agent requests a suitable network from DGT for the new video using a labeled reference frame. Then, DGT generates a task-specific network by selecting a suitable node for each layer of the network. Finally, the generated network is sent to the agent to segment the new video frames.
  • Figure 2: Architecture of the proposed network.
  • Figure 3: Visualization of building a base DGT given $6$ videos and a decoder with $4$ layers. On the left side is coarse to fine clustering of videos. On the right side is the produced DGT.
  • Figure 4: Visualization of the full DGT-Net$_L$ as a circular tree after training using YT-VOS18, CDNet, and DAVIS17. The node in the middle is the root of the tree, and the nodes in the first inner circle are children of the root node. The node color defines at which stage the node was added. Blue nodes are added during the initial phase using YT-VOS18. Red nodes are added during the lifelong learning phase using CDNet. Finally, black nodes are added during the few-shot learning phase using DAVIS17. The visualization is made using the ETE toolkit ete.
  • Figure 5: Visual results of our proposed model against state-of-the-art video object segmentation models
  • ...and 2 more figures