DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization
Amitava Das, Suranjana Trivedy, Danush Khanna, Rajarshi Roy, Gurpreet Singh, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aishwarya Naresh Reganti, Aman Chadha
TL;DR
DPO-Kernels advances direct preference optimization by injecting kernelized representations and embedding-based semantics into the alignment objective. It broadens the divergence toolbox beyond KL with Jensen–Shannon, Hellinger, Wasserstein, and others, and introduces data-driven selection to pick kernel-divergence pairs, plus a Hierarchical Mixture of Kernels to balance local and global structure. Empirically, it achieves state-of-the-art generalization across 12 diverse datasets for factuality, safety, reasoning, and instruction following, while grounding analysis in Heavy-Tailed Self-Regularization. The framework emphasizes robustness, interpretability, and potential multimodal extensions, albeit with higher computational costs and notable ethical considerations that warrant careful mitigation.
Abstract
The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key contributions: (i) Kernelized Representations with polynomial, RBF, Mahalanobis, and spectral kernels for richer transformations, plus a hybrid loss combining embedding-based and probability-based objectives; (ii) Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya, Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven Selection metrics that automatically choose the best kernel-divergence pair; and (iv) a Hierarchical Mixture of Kernels for both local precision and global modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance in factuality, safety, reasoning, and instruction following. Grounded in Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization for LLMs, offering a comprehensive resource for further alignment research.
