Table of Contents
Fetching ...

CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

TL;DR

CARINOX tackles compositional misalignment in text-to-image diffusion by fusing reward-guided initial-noise optimization with multi-seed exploration in an inference-time framework. It leverages a one-step diffusion backbone to enable clean gradient propagation, with per-reward gradient clipping and a latent-space regularization term $K(oldsymbol{ oise})$ to stabilize updates, while a correlation-guided reward selection (based on human judgments) yields a robust composite objective. The method samples $N=5$ seeds and optimizes each for $T$ steps, with total compute scaling as $N\times T$, selecting the best candidate via a combined reward $R(I,p)$ that emphasizes compositional fidelity. Across T2I-CompBench++ and HRS, CARINOX outperforms optimization- and exploration-only baselines and maintains image quality and diversity, with ablations validating the necessity of gradient clipping, regularization, and the joint exploration-optimization pipeline.

Abstract

Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/.

CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

TL;DR

CARINOX tackles compositional misalignment in text-to-image diffusion by fusing reward-guided initial-noise optimization with multi-seed exploration in an inference-time framework. It leverages a one-step diffusion backbone to enable clean gradient propagation, with per-reward gradient clipping and a latent-space regularization term to stabilize updates, while a correlation-guided reward selection (based on human judgments) yields a robust composite objective. The method samples seeds and optimizes each for steps, with total compute scaling as , selecting the best candidate via a combined reward that emphasizes compositional fidelity. Across T2I-CompBench++ and HRS, CARINOX outperforms optimization- and exploration-only baselines and maintains image quality and diversity, with ablations validating the necessity of gradient clipping, regularization, and the joint exploration-optimization pipeline.

Abstract

Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/.

Paper Structure

This paper contains 54 sections, 15 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Qualitative results on T2I-CompBench++, showing that CARINOX faithfully captures compositional details such as counts, spatial arrangements, and attribute bindings.
  • Figure 2: Limitations of optimization (a) and exploration (b) when applied in isolation. Optimization often fails to capture attributes or relations despite refinement, while exploration struggles to reliably recover all prompt elements even with multiple seeds.
  • Figure 3: Overview of the CARINOX framework. (a) Optimization: An initial noise is refined through iterative updates guided by multiple reward functions, with per-reward gradient clipping and latent regularization ensuring stable alignment with the prompt. (b) Exploration: Several noise candidates are sampled and independently optimized, and the final image is chosen via best-of-$N$ selection, combining exploration diversity with optimization precision.
  • Figure 4: Qualitative results on the HRS benchmark, where CARINOX produces coherent, visually expressive outputs with accurate style and text rendering.
  • Figure 5: Effect of optimization iterations (a) and exploration seeds (b) on T2I-CompBench++. Performance improves with more iterations and seeds but saturates beyond 50 iterations and 5 seeds, motivating their use as CARINOX defaults for balanced efficiency and alignment.
  • ...and 10 more figures