CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration
Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
TL;DR
CARINOX tackles compositional misalignment in text-to-image diffusion by fusing reward-guided initial-noise optimization with multi-seed exploration in an inference-time framework. It leverages a one-step diffusion backbone to enable clean gradient propagation, with per-reward gradient clipping and a latent-space regularization term $K(oldsymbol{ oise})$ to stabilize updates, while a correlation-guided reward selection (based on human judgments) yields a robust composite objective. The method samples $N=5$ seeds and optimizes each for $T$ steps, with total compute scaling as $N\times T$, selecting the best candidate via a combined reward $R(I,p)$ that emphasizes compositional fidelity. Across T2I-CompBench++ and HRS, CARINOX outperforms optimization- and exploration-only baselines and maintains image quality and diversity, with ablations validating the necessity of gradient clipping, regularization, and the joint exploration-optimization pipeline.
Abstract
Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/.
