Table of Contents
Fetching ...

Towards Robust and Fair Vision Learning in Open-World Environments

Thanh-Dat Truong

TL;DR

This dissertation tackles fairness and robustness in vision learning under open-world conditions by integrating four intertwined lines: BiMaL-based fairness domain adaptation to reduce cross-domain biases without pixel-level independence assumptions; an Open-world Fairness Continual Learning framework that handles unseen classes while mitigating forgetting; a Geometry-based Cross-view framework for learning view-invariant representations across disparate camera perspectives; and Transformer-centered multimodal/temporal learning with diffusion-informed domain generalization to fortify large-scale visual foundation models. BiMaL provides a principled, invertible mapping to model cross-domain segmentation structure and yields a target-domain loss that upper-bounds entropy-based objectives, while FREDOM formalizes fairness over class distributions with a Conditional Structure Network. EAGLE extends cross-view learning to unpaired data using geodesic-flow metrics and view-conditioned prompts, and DirecFormer delivers directed attention for robust temporal understanding in videos. Collectively, the work delivers state-of-the-art performance across semantic segmentation, cross-view adaptation, open-world continual learning, and foundation-model generalization, with practical implications for autonomous systems and large multimodal models.

Abstract

The dissertation presents four key contributions toward fairness and robustness in vision learning. First, to address the problem of large-scale data requirements, the dissertation presents a novel Fairness Domain Adaptation approach derived from two major novel research findings of Bijective Maximum Likelihood and Fairness Adaptation Learning. Second, to enable the capability of open-world modeling of vision learning, this dissertation presents a novel Open-world Fairness Continual Learning Framework. The success of this research direction is the result of two research lines, i.e., Fairness Continual Learning and Open-world Continual Learning. Third, since visual data are often captured from multiple camera views, robust vision learning methods should be capable of modeling invariant features across views. To achieve this desired goal, the research in this thesis will present a novel Geometry-based Cross-view Adaptation framework to learn robust feature representations across views. Finally, with the recent increase in large-scale videos and multimodal data, understanding the feature representations and improving the robustness of large-scale visual foundation models is critical. Therefore, this thesis will present novel Transformer-based approaches to improve the robust feature representations against multimodal and temporal data. Then, a novel Domain Generalization Approach will be presented to improve the robustness of visual foundation models. The research's theoretical analysis and experimental results have shown the effectiveness of the proposed approaches, demonstrating their superior performance compared to prior studies. The contributions in this dissertation have advanced the fairness and robustness of machine vision learning.

Towards Robust and Fair Vision Learning in Open-World Environments

TL;DR

This dissertation tackles fairness and robustness in vision learning under open-world conditions by integrating four intertwined lines: BiMaL-based fairness domain adaptation to reduce cross-domain biases without pixel-level independence assumptions; an Open-world Fairness Continual Learning framework that handles unseen classes while mitigating forgetting; a Geometry-based Cross-view framework for learning view-invariant representations across disparate camera perspectives; and Transformer-centered multimodal/temporal learning with diffusion-informed domain generalization to fortify large-scale visual foundation models. BiMaL provides a principled, invertible mapping to model cross-domain segmentation structure and yields a target-domain loss that upper-bounds entropy-based objectives, while FREDOM formalizes fairness over class distributions with a Conditional Structure Network. EAGLE extends cross-view learning to unpaired data using geodesic-flow metrics and view-conditioned prompts, and DirecFormer delivers directed attention for robust temporal understanding in videos. Collectively, the work delivers state-of-the-art performance across semantic segmentation, cross-view adaptation, open-world continual learning, and foundation-model generalization, with practical implications for autonomous systems and large multimodal models.

Abstract

The dissertation presents four key contributions toward fairness and robustness in vision learning. First, to address the problem of large-scale data requirements, the dissertation presents a novel Fairness Domain Adaptation approach derived from two major novel research findings of Bijective Maximum Likelihood and Fairness Adaptation Learning. Second, to enable the capability of open-world modeling of vision learning, this dissertation presents a novel Open-world Fairness Continual Learning Framework. The success of this research direction is the result of two research lines, i.e., Fairness Continual Learning and Open-world Continual Learning. Third, since visual data are often captured from multiple camera views, robust vision learning methods should be capable of modeling invariant features across views. To achieve this desired goal, the research in this thesis will present a novel Geometry-based Cross-view Adaptation framework to learn robust feature representations across views. Finally, with the recent increase in large-scale videos and multimodal data, understanding the feature representations and improving the robustness of large-scale visual foundation models is critical. Therefore, this thesis will present novel Transformer-based approaches to improve the robust feature representations against multimodal and temporal data. Then, a novel Domain Generalization Approach will be presented to improve the robustness of visual foundation models. The research's theoretical analysis and experimental results have shown the effectiveness of the proposed approaches, demonstrating their superior performance compared to prior studies. The contributions in this dissertation have advanced the fairness and robustness of machine vision learning.

Paper Structure

This paper contains 110 sections, 98 equations, 40 figures, 54 tables.

Figures (40)

  • Figure 1.1: The Challenges in Current Machine Vision Framework.
  • Figure 1.2: Overview of Research Towards Fairness and Robustness Learning. Unsupervised Domain Adaptation dat2021bimal_iccvtruong2022otadaptnguyen2022selfjalata2022eqadap. Fairness Domain Adaptation Truong:CVPR:2023FREDOMtruong2023comal. Fairness Continual Learning truong2022condatruong2023fairness. Open-world Fairness Continual Learning truong2024falcon. Robust Representation duong2020vec2facetruong2023liaad. Multi-Modality Learning truong2021right2talknguyen2023insect. Efficient Temporal Learning truong2021direcformer. Cross-View Robust Representation truong2023croviatruong2023crosstruong2024eagle, and Robust Foundation Model Learning truong2024edsamnguyen2023insect.
  • Figure 3.3: Two images have the same entropy but one has a poor prediction (a top image) and one has an better prediction (a bottom image). Columns 1 and 2 are an input image and a ground truth. Columns 3 and 4 are an entropy map and a prediction of AdvEnt vu2019advent. Column 5 is the results of our proposed method. The two predictions produced by AdvEnt have similar entropy scores ($0.13$ and $0.14$). Meanwhile, the BiMaL value of the bottom prediction ($0.06$) is smaller than the top prediction ($0.14$). Our results in the last column, which have better BiMaL values than AdvEnt, can well model the structure of an image. In particular, our results have sharper results of a barrier and a rider (white dash box), and a clear boundary between road and sidewalk.
  • Figure 3.4: The Proposed Framework. The RGB image input is firstly forwarded to a deep semantic segmentation network to produce a segmentation map. The supervised loss is employed on the source training samples. Meanwhile, the predicted segmentation on target training samples will be mapped to the latent space to compute the Bijective Maximum Likelihood loss. The bijective mapping network is trained on the ground-truth images of the source domain.
  • Figure 3.5: Ablative semantic segmentation performance mIoU (%) on the effectiveness of the proposed BiMaL loss.
  • ...and 35 more figures