Table of Contents
Fetching ...

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang

TL;DR

The authors address the bottleneck of evaluating and adapting text encoders for diffusion-based generation by introducing TED-6K, a text-only benchmark, and GRAN-TED, a two-stage training paradigm that specializes a multimodal LLM for visual synthesis.TED-6K provides a robust, cost-efficient proxy to predict downstream T2I/T2V performance, using a sentence-level context aggregator to map text representations to a single conditioning vector across diverse architectures.GRAN-TED combines targeted fine-tuning on visual data with a learnable, layer-wise feature weighting mechanism to fuse hierarchical encoder features, stabilized by freezing the weights after an initial joint optimization phase.Experiments show GRAN-TED achieves state-of-the-art results on TED-6K, demonstrates strong correlations with downstream generation quality, and delivers measurable gains in both text-to-image and text-to-video tasks, offering a practical path to robust, aligned, and nuanced text embeddings for diffusion models.

Abstract

The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

TL;DR

The authors address the bottleneck of evaluating and adapting text encoders for diffusion-based generation by introducing TED-6K, a text-only benchmark, and GRAN-TED, a two-stage training paradigm that specializes a multimodal LLM for visual synthesis.TED-6K provides a robust, cost-efficient proxy to predict downstream T2I/T2V performance, using a sentence-level context aggregator to map text representations to a single conditioning vector across diverse architectures.GRAN-TED combines targeted fine-tuning on visual data with a learnable, layer-wise feature weighting mechanism to fuse hierarchical encoder features, stabilized by freezing the weights after an initial joint optimization phase.Experiments show GRAN-TED achieves state-of-the-art results on TED-6K, demonstrates strong correlations with downstream generation quality, and delivers measurable gains in both text-to-image and text-to-video tasks, offering a practical path to robust, aligned, and nuanced text embeddings for diffusion models.

Abstract

The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

Paper Structure

This paper contains 33 sections, 13 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: An overview of our complete framework, integrating our evaluation and development pipelines. Figure (a) illustrates the TED Evaluation Framework, consisting of the TED-6K benchmark and a context aggregator to assess the representational capabilities of text encoders. Figure (b) shows the construction of our TED Encoder, where Qwen3-VL-8B-Instruct is fine-tuned on a curated VQA and captioning dataset to specialize the MLLM. Figure (c) depicts our final GRAN-TED solution, which incorperates a learnable layer-wise weighting module to generate GRAN-TED for diffusion models.
  • Figure 2: Left. The data construction pipeline for TED-6K, consisting of four stages: (1) Data Curation and Filtering; (2) Base Caption Generation; (3) Semantic Pair Construction; (4) Human Verification. Right. The data Composition of the TED-6K dataset.
  • Figure 3: The context aggregator architecture and its training&inference process. (a) training process of the context aggregator. (b) the inference process during evaluation on TED-6K.
  • Figure 4: Dynamics of the learnable layer weights over the course of continuous training (i.e., without the two-step strategy). The weight values shown are normalized via Softmax.
  • Figure 5: Examples of visual question answering (VQA) training samples, covering both image-based and video-based settings.
  • ...and 8 more figures