Table of Contents
Fetching ...

TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers

Yihua Liu, Fanjiang Ye, Bowen Lin, Rongyu Fang, Chengming Zhang

TL;DR

This work identifies the core factor for prompt information loss, and introduces a text anchoring mechanism to correct the imbalance between text and image tokens, and proposes TIDE, a training-free text-to-image extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead.

Abstract

Diffusion Transformer (DiT) faces challenges when generating images with higher resolution compared at training resolution, causing especially structural degradation due to attention dilution. Previous approaches attempt to mitigate this by sharpening attention distributions, but fail to preserve fine-grained semantic details and introduce obvious artifacts. In this work, we analyze the characteristics of DiTs and propose TIDE, a training-free text-to-image (T2I) extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead. We identify the core factor for prompt information loss, and introduce a text anchoring mechanism to correct the imbalance between text and image tokens. To further eliminate artifacts, we design a dynamic temperature control mechanism that leverages the pattern of spectral progression in the diffusion process. Extensive evaluations demonstrate that TIDE delivers high-quality resolution extrapolation capability and integrates seamlessly with existing state-of-the-art methods.

TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers

TL;DR

This work identifies the core factor for prompt information loss, and introduces a text anchoring mechanism to correct the imbalance between text and image tokens, and proposes TIDE, a training-free text-to-image extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead.

Abstract

Diffusion Transformer (DiT) faces challenges when generating images with higher resolution compared at training resolution, causing especially structural degradation due to attention dilution. Previous approaches attempt to mitigate this by sharpening attention distributions, but fail to preserve fine-grained semantic details and introduce obvious artifacts. In this work, we analyze the characteristics of DiTs and propose TIDE, a training-free text-to-image (T2I) extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead. We identify the core factor for prompt information loss, and introduce a text anchoring mechanism to correct the imbalance between text and image tokens. To further eliminate artifacts, we design a dynamic temperature control mechanism that leverages the pattern of spectral progression in the diffusion process. Extensive evaluations demonstrate that TIDE delivers high-quality resolution extrapolation capability and integrates seamlessly with existing state-of-the-art methods.
Paper Structure (21 sections, 21 equations, 12 figures, 5 tables)

This paper contains 21 sections, 21 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Collage of multi-resolution results generated by TIDE. Prompts are from DrawBench and Aesthetic-4K. Zoom in for details.
  • Figure 2: Overview of our proposed TIDE framework. (a) Visual comparison at $4096 \times 4096$ resolution. While existing methods (left) suffer from information loss and repetitive artifacts, our approach (right) preserves prompt fidelity and generates realistic details. (b) Dynamic Temperature Control method, which dynamically adjusts the attention temperature across time-steps to eliminate high-frequency artifacts. (c) Text Anchoring mechanism, which counteracts attention dilution by reinforcing the cross-attention scores between image queries and text keys, recovering the influence of text tokens.
  • Figure 3: Visual comparison of subject vanishing issues at 4K resolution. We compare four extrapolation baselines at $4096 \times 4096$ with the native resolution $1024 \times 1024$. While the model at native resolution produces coherent subjects, baseline methods except YaRN suffer from severe subject vanishing issue, and YaRN's content richness has also noticeably declined.
  • Figure 4: Visualization of text token influence decay. We visualize the spatial influence of text prompts at the early sampling stage, revealing that text guidance influence diminishes as the target resolution scales to $4096 \times 4096$. Specifically, Direct Extrapolation, NTK-Aware Interpolation and NTK-by-Parts Interpolation exhibit a near-total loss of spatial influence, providing a mechanistic explanation for the subject vanishing issue. YaRN only partially recovers token influence in the central region, resulting in suboptimal generation quality.
  • Figure 5: Qualitative comparison with baseline method. We compare TIDE against baseline methods. The input prompts correspond to the rows: 2K Prompt 1: "A Moai statue gazes upward against a starry night sky, with a colorful sunset illuminating the horizon."; 2K Prompt 2: "New York Skyline with 'Diffusion' written with fireworks on the sky."; 4K Prompt 1: "A futuristic spaceship flies over a lush landscape featuring rocky cliffs and a settlement with domed buildings near a serene body of water surrounded by greenery."; 4K Prompt 2: "Narrow cobblestone alley featuring colorful buildings with green and orange facades, potted plants lining the street, and a puddle reflecting the scene.".
  • ...and 7 more figures