CHAI: CacHe Attention Inference for text2video

Joel Mathew Cherian; Ashutosh Muralidhara Bharadwaj; Vima Gupta; Anand Padmanabha Iyer

CHAI: CacHe Attention Inference for text2video

Joel Mathew Cherian, Ashutosh Muralidhara Bharadwaj, Vima Gupta, Anand Padmanabha Iyer

TL;DR

This work introduces Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents across cross-inference latents and shows that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps.

Abstract

Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.

CHAI: CacHe Attention Inference for text2video

TL;DR

Abstract

Paper Structure (22 sections, 10 figures)

This paper contains 22 sections, 10 figures.

Introduction
Preliminaries
Diffusion Models for Text-to-Video
Intra-Inference Caching
AdaCache
Cross-Inference Caching for Text-to-Video Diffusion Models
Methodology
Entity-Level Similarity
Cache Attention
Chai Design
Evaluation
Video Quality and Latency Evaluation
Performance under constrained scenarios
Storage Overhead and Cache Management
Intra-Inference vs. Cross-Inference Caching
...and 7 more sections

Figures (10)

Figure 1: Feature distance between latents produced by adjacent denoising steps in a single text-to-video inference. The highlighted region indicates steps that are skipped by intra-inference caching approaches due to low degree of difference.
Figure 2: Cache hit rate (%) vs. cache size on 2000 unseen VidProM prompts. Cached and unseen prompts show little overall similarity, but they share common entities and thus achieve a higher entity-similarity-based cache hit rate.
Figure 3: OpenSora STDiT block with the new Cache Attention layer capable of leveraging cross-inference entity reuse. When a cache hit occurs, the Cache Attention layer uses the latent cache as input to the key and value vectors to accelerate inference.
Figure 4: During cross-inference reuse, the STDiT blocks are scheduled to use the latent cache for the first block in the 2nd, 3rd, and 4th denoising steps.
Figure 5: Chai design. Each new prompt is compared against previous ones. On a cache hit, it retrieves and reuses stored latents to enable faster inference with fewer denoising steps. On a cache miss, it performs full inference and caches new latents for future use. The cache policy engine manages storage by evicting older latents once the cache exceeds a set limit.
...and 5 more figures

CHAI: CacHe Attention Inference for text2video

TL;DR

Abstract

CHAI: CacHe Attention Inference for text2video

Authors

TL;DR

Abstract

Table of Contents

Figures (10)