Table of Contents
Fetching ...

YINYANG-ALIGN: Benchmarking Contradictory Objectives and Proposing Multi-Objective Optimization based DPO for Text-to-Image Alignment

Amitava Das, Yaswanth Narsupalli, Gurpreet Singh, Vinija Jain, Vasu Sharma, Suranjana Trivedy, Aman Chadha, Amit Sheth

TL;DR

YinYangAlign tackles the problem of robust Text-to-Image alignment by introducing six inherently conflicting objectives and a holistic benchmark to evaluate how T2I models balance them. It proposes Contradictory Alignment Optimization (CAO), a multi-objective extension of Direct Preference Optimization (DPO) that couples per-axiom losses with a global synergy term and a Synergy Jacobian to stabilize gradient interactions, enabling Pareto-aware optimization. The framework is illustrated with a detailed annotation pipeline and a rich loss design that includes Artistic Freedom, Faithfulness to Prompt, Emotional Impact, Originality vs Referentiality, Verifiability, and Cultural Sensitivity, with Sinkhorn regularization and CLIP-based references underpinning key components. Empirical results show single-axiom DPO degrades other objectives, while CAO achieves balanced improvements across all six axes and exposes Pareto-front trade-offs, demonstrating improved multi-objective alignment for practical, ethical, and creative T2I applications. The YinYangAlign benchmark and CAO framework thus provide a scalable, interpretable pathway toward ethically aware, user-tailored, and high-fidelity text-to-image generation in real-world settings.

Abstract

Precise alignment in Text-to-Image (T2I) systems is crucial to ensure that generated visuals not only accurately encapsulate user intents but also conform to stringent ethical and aesthetic benchmarks. Incidents like the Google Gemini fiasco, where misaligned outputs triggered significant public backlash, underscore the critical need for robust alignment mechanisms. In contrast, Large Language Models (LLMs) have achieved notable success in alignment. Building on these advancements, researchers are eager to apply similar alignment techniques, such as Direct Preference Optimization (DPO), to T2I systems to enhance image generation fidelity and reliability. We present YinYangAlign, an advanced benchmarking framework that systematically quantifies the alignment fidelity of T2I systems, addressing six fundamental and inherently contradictory design objectives. Each pair represents fundamental tensions in image generation, such as balancing adherence to user prompts with creative modifications or maintaining diversity alongside visual coherence. YinYangAlign includes detailed axiom datasets featuring human prompts, aligned (chosen) responses, misaligned (rejected) AI-generated outputs, and explanations of the underlying contradictions.

YINYANG-ALIGN: Benchmarking Contradictory Objectives and Proposing Multi-Objective Optimization based DPO for Text-to-Image Alignment

TL;DR

YinYangAlign tackles the problem of robust Text-to-Image alignment by introducing six inherently conflicting objectives and a holistic benchmark to evaluate how T2I models balance them. It proposes Contradictory Alignment Optimization (CAO), a multi-objective extension of Direct Preference Optimization (DPO) that couples per-axiom losses with a global synergy term and a Synergy Jacobian to stabilize gradient interactions, enabling Pareto-aware optimization. The framework is illustrated with a detailed annotation pipeline and a rich loss design that includes Artistic Freedom, Faithfulness to Prompt, Emotional Impact, Originality vs Referentiality, Verifiability, and Cultural Sensitivity, with Sinkhorn regularization and CLIP-based references underpinning key components. Empirical results show single-axiom DPO degrades other objectives, while CAO achieves balanced improvements across all six axes and exposes Pareto-front trade-offs, demonstrating improved multi-objective alignment for practical, ethical, and creative T2I applications. The YinYangAlign benchmark and CAO framework thus provide a scalable, interpretable pathway toward ethically aware, user-tailored, and high-fidelity text-to-image generation in real-world settings.

Abstract

Precise alignment in Text-to-Image (T2I) systems is crucial to ensure that generated visuals not only accurately encapsulate user intents but also conform to stringent ethical and aesthetic benchmarks. Incidents like the Google Gemini fiasco, where misaligned outputs triggered significant public backlash, underscore the critical need for robust alignment mechanisms. In contrast, Large Language Models (LLMs) have achieved notable success in alignment. Building on these advancements, researchers are eager to apply similar alignment techniques, such as Direct Preference Optimization (DPO), to T2I systems to enhance image generation fidelity and reliability. We present YinYangAlign, an advanced benchmarking framework that systematically quantifies the alignment fidelity of T2I systems, addressing six fundamental and inherently contradictory design objectives. Each pair represents fundamental tensions in image generation, such as balancing adherence to user prompts with creative modifications or maintaining diversity alongside visual coherence. YinYangAlign includes detailed axiom datasets featuring human prompts, aligned (chosen) responses, misaligned (rejected) AI-generated outputs, and explanations of the underlying contradictions.

Paper Structure

This paper contains 161 sections, 92 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: The figure illustrates six core trade-offs (e.g., Faithfulness vs. Freedom, Emotional Impact vs. Neutrality), highlighting key conflicts and their implications.
  • Figure 2: Illustrative examples of all six contradictory alignment axioms, with each row highlighting specific trade-offs between competing objectives (e.g., Faithfulness to Prompt vs. Artistic Freedom, Emotional Impact vs. Neutrality). Chosen and rejected outputs demonstrate the inherent tensions during text-to-image generation, underscoring the need for a multi-objective optimization framework. Examples of Originality vs. Referentiality are inspired by recent https://www.wired.com/story/ai-art-copyright-matthew-allen/. The Verifiability vs. Artistic Freedom case reflects incidents like the dissemination of a fake Pentagon explosion image by ‘verified’ Twitter accounts, causing confusion https://www.cnn.com/2023/05/22/tech/twitter-fake-image-pentagon-explosion/index.html. To mitigate misinformation caused harm, the system should avoid unverifiable content or produce subdued visuals when necessary. Lastly, the https://www.theguardian.com/technology/2024/feb/22/google-pauses-ai-generated-images-of-people-after-ethnicity-criticism underscores the need for Cultural Sensitivity in T2I systems, inspiring our Cultural Sensitivity vs. Artistic Freedom example. cf \ref{['fig:slider_selection']} depicts controls and \ref{['fig:slider_selection_image_variations_1']} and \ref{['fig:slider_selection_image_variations_2']} resultant genrations with varied control on generations.
  • Figure 3: Illustrative example of aligning T2I models with Faithfulness to Prompt vs. Artistic Freedom. The chosen outputs adhere closely to the prompt, depicting a highly detailed and accurate portrait of Albert Einstein in a realistic oil painting style, while the rejected outputs deviate significantly, introducing surreal or unrelated elements. This highlights the importance of balancing prompt adherence with artistic flexibility in alignment optimization.
  • Figure 4: Annotation Agreement Heatmap: The VLM column represents the kappa score indicating the average agreement between GPT-4o and LLaVA across all axioms. Columns (H1--H10) correspond to the kappa scores measuring the agreement between each specific human annotator and the consolidated VLM annotations. Higher scores (darker blue) signify stronger agreement, while lower scores (lighter shades) highlight areas of disagreement.
  • Figure 5: Visualization of error loss surface tension for six axiom pairs in YinYang alignment. Each pair highlights the inherent trade-offs between competing objectives using a 3D surface plot (left) and a 2D contour plot (right). Blue regions represent synergy (low tension), red regions indicate conflict (high tension), while Green markers highlight "sweet spots" where the tension is minimal. The first axiom pair, Faithfulness to Prompt vs. Artistic Freedom, shows sweet spots centered around moderate values, suggesting balanced trade-offs. For Emotional Impact vs. Neutrality, sweet spots are sparse, reflecting the difficulty in balancing emotional engagement and neutrality. The axiom pair Visual Realism vs. Artistic Freedom shows distributed sweet spots, indicating achievable trade-offs between realism and creative freedom. In Originality vs. Referentiality, sweet spots are concentrated, emphasizing the challenge of balancing uniqueness and references. The pair Verifiability vs. Artistic Freedom has central sweet spots, suggesting harmony between factual accuracy and creative expression. Lastly, Cultural Sensitivity vs. Artistic Freedom shows fewer sweet spots, reflecting the complexity of respecting cultural norms while granting artistic liberties. This visualization underscores the inherent trade-offs in T2I systems and identifies potential areas of optimization for aligning competing objectives.
  • ...and 10 more figures