GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li; Sihan Yang; Yushuo Guan; Ruichuan An; Xinlong Chen; Yang Shi; Pengfei Wan; Wentao Zhang; Yuanxing zhang

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang

TL;DR

The authors address the bottleneck of evaluating and adapting text encoders for diffusion-based generation by introducing TED-6K, a text-only benchmark, and GRAN-TED, a two-stage training paradigm that specializes a multimodal LLM for visual synthesis.TED-6K provides a robust, cost-efficient proxy to predict downstream T2I/T2V performance, using a sentence-level context aggregator to map text representations to a single conditioning vector across diverse architectures.GRAN-TED combines targeted fine-tuning on visual data with a learnable, layer-wise feature weighting mechanism to fuse hierarchical encoder features, stabilized by freezing the weights after an initial joint optimization phase.Experiments show GRAN-TED achieves state-of-the-art results on TED-6K, demonstrates strong correlations with downstream generation quality, and delivers measurable gains in both text-to-image and text-to-video tasks, offering a practical path to robust, aligned, and nuanced text embeddings for diffusion models.

Abstract

The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

TL;DR

Abstract

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)