$φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

Thanh-Dat Truong; Huu-Thien Tran; Jackson Cothren; Bhiksha Raj; Khoa Luu

$φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

Thanh-Dat Truong, Huu-Thien Tran, Jackson Cothren, Bhiksha Raj, Khoa Luu

TL;DR

A new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals and achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.

Abstract

Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused the imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or $φ$-DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new $φ$-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable $φ$-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed $φ$-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.

$φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

TL;DR

Abstract

-DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new

-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable

-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed

-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.

Paper Structure (21 sections, 7 theorems, 49 equations, 5 figures, 7 tables)

This paper contains 21 sections, 7 theorems, 49 equations, 5 figures, 7 tables.

Introduction
Related Work
The Proposed $\phi$-DPO Approach
Direct Preference Optimization to Continual Learning in LMMs
DPO as Continual Learning
Theoretical Analysis of Direct Preference Optimization in Forgetting Mitigation
Fairness DPO in Continual Learning
DPO Data in Continual Learning Benchmark
Experimental Results
Benchmarks, Metrics, and Implementation
Main Results
Ablation Study
Conclusions and Limitations
Proof of Lemmas
Proof of Lemma \ref{['lemma:lower-bound-kl']}
...and 6 more sections

Key Result

Lemma 1

Lower Bound of KL Divergence Governed by DPO Loss. The lower bound of the $D_{\mathrm{KL}}(\pi_{t-1}\|\pi_t)$ is governed by the DPO loss as follows: where $C_{\mathrm{lower}}$ is a constant number.

Figures (5)

Figure 1: Our Fairness DPO ($\phi$-DPO) approach to Continual Learning in LMMs. Prior continual learning methods, e.g., LoRA, struggle under imbalanced multimodal data and suffer from catastrophic forgetting. The vanilla DPO is still influenced by the imbalanced data distributions. Our $\phi$-DPO approach can (1) mitigate forgetting, (2) adapt continuously to new learning tasks, and (3) maintain robustness under data imbalance.
Figure 2: The Imbalanced Distribution of Multimodal Continual Learning Benchmarks. The distribution of samples across ScienceQA topics is highly skewed, i.e. categories with fewer training examples (e.g. Grammar, Phonological Awareness, Word Study) exhibit significantly lower accuracy, while topics with richer data (e.g. Biology, Physics) achieve stronger performance.
Figure 3: ScienceQA, Grounding, and OCR-VQA introduce progressively shifting visual distributions and alignment objectives, creating modality imbalance across tasks.
Figure 4: Our Proposed Continual Learning Approach via Fairness DPO for Large Multimodal Models. Traditional reinforcement learning with human feedback (RLHF) method optimize models through explicit reward maximization. Our framework instead reformulates RLHF as Direct Preference Optimization (DPO). The Fairness DPO loss mitigate the gradient biased under the imbalanced data.
Figure 5: Example of Our DPO Data in the Continual Learning Benchmark. Best viewed in color.

Theorems & Definitions (7)

Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
Lemma 6
Lemma 7

$φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

TL;DR

Abstract

$φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)