Table of Contents
Fetching ...

Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment

Hanchen Xie, Jiageng Zhu, Mahyar Khayatkhoei, Jiazhi Li, Wael AbdAlmageed

TL;DR

A novel learning framework is proposed, Look, Learn and Leverage (L$^3$), which decomposes the learning process into three distinct phases and systematically utilize the class-agnostic segmentation masks as the common symbolic space to align visual domains.

Abstract

Modern deep learning models have demonstrated outstanding performance on discovering the underlying mechanisms when both visual appearance and intrinsic relations (e.g., causal structure) data are sufficient, such as Disentangled Representation Learning (DRL), Causal Representation Learning (CRL) and Visual Question Answering (VQA) methods. However, generalization ability of these models is challenged when the visual domain shifts and the relations data is absent during finetuning. To address this challenge, we propose a novel learning framework, Look, Learn and Leverage (L$^3$), which decomposes the learning process into three distinct phases and systematically utilize the class-agnostic segmentation masks as the common symbolic space to align visual domains. Thus, a relations discovery model can be trained on the source domain, and when the visual domain shifts and the intrinsic relations are absent, the pretrained relations discovery model can be directly reused and maintain a satisfactory performance. Extensive performance evaluations are conducted on three different tasks: DRL, CRL and VQA, and show outstanding results on all three tasks, which reveals the advantages of L$^3$.

Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment

TL;DR

A novel learning framework is proposed, Look, Learn and Leverage (L), which decomposes the learning process into three distinct phases and systematically utilize the class-agnostic segmentation masks as the common symbolic space to align visual domains.

Abstract

Modern deep learning models have demonstrated outstanding performance on discovering the underlying mechanisms when both visual appearance and intrinsic relations (e.g., causal structure) data are sufficient, such as Disentangled Representation Learning (DRL), Causal Representation Learning (CRL) and Visual Question Answering (VQA) methods. However, generalization ability of these models is challenged when the visual domain shifts and the relations data is absent during finetuning. To address this challenge, we propose a novel learning framework, Look, Learn and Leverage (L), which decomposes the learning process into three distinct phases and systematically utilize the class-agnostic segmentation masks as the common symbolic space to align visual domains. Thus, a relations discovery model can be trained on the source domain, and when the visual domain shifts and the intrinsic relations are absent, the pretrained relations discovery model can be directly reused and maintain a satisfactory performance. Extensive performance evaluations are conducted on three different tasks: DRL, CRL and VQA, and show outstanding results on all three tasks, which reveals the advantages of L.
Paper Structure (27 sections, 13 equations, 9 figures, 6 tables)

This paper contains 27 sections, 13 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Conventional intrinsic relations discovery models suffer from visual-domain shift challenge, and they can not train without intrinsic relations. The proposed framework, Look, Learn and Leverage (L$^3$), seeks to address the challenge via symbolic alignment. Look phase maps raw visual input from source domain to a common symbolic space where the relations discovery module is Learned. Then, the pretrained module can be Leveraged on the target domain with the respective Look phase.
  • Figure 2: Look, Learn and Leverage (L$^3$) framework.
  • Figure 3: Performance of DRL task on MPI3D dataset. Normalized results are reported and the original results are in Appendix.
  • Figure 4: (a): Performance of VQA task when visual domain shifts from VQAv2 to TDW-VQA. Various diffusion timesteps are introduced to the TDW-VQA visual feature to increase the distribution shift. (b): Ablation study of feature alignment in Leverage step; Normalized results are reported.
  • Figure 5: Visualization of $m^i$ and $x^i$ reconstruction on MPI3D dataset. Source domain (red box) is Real and target domain (blue box) is Toy. Baseline fail to make any meaningful output on the target domain due to visual-domain shift, whereas L$^3$ has meaningful output on both source and target domain. L$^3$'s outputs follow the source domain which reveals the advantage of Leverage alignment.
  • ...and 4 more figures