Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers

Jackie Lin; Jiaqi Su; Nishit Anand; Zeyu Jin; Minje Kim; Paris Smaragdis

Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers

Jackie Lin, Jiaqi Su, Nishit Anand, Zeyu Jin, Minje Kim, Paris Smaragdis

TL;DR

Gencho addresses blind RIR estimation by learning a diffusion-transformer that outputs complex RIR spectrograms conditioned on a structure-aware encoding of reverberant input. A two-channel encoder isolates early reflections from the late tail to provide robust conditioning, while the diffusion decoder yields diverse, perceptually realistic RIRs. The approach improves generalization over non-generative baselines and extends to text-conditioned RIR generation, enabling semantic control of acoustic environments. This framework supports flexible acoustic matching, RIR completion, and multi-modal acoustic simulation in real-world pipelines.

Abstract

Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics. We further demonstrate its application to text-conditioned RIR generation, highlighting Gencho's versatility for controllable acoustic simulation and generative audio tasks.

Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers

TL;DR

Abstract

Paper Structure (11 sections, 3 figures, 1 table)

This paper contains 11 sections, 3 figures, 1 table.

Introduction
Method
Blind Room Impulse Response Estimation
Reverberation Structure-Aware Audio Encoder
Diffusion-based Generative Decoder
Experiments
Experiment Setup
Datasets
Results and Discussion
Text-to-RIR Generation
Conclusion

Figures (3)

Figure 1: (a) Model architecture of Gencho, the proposed generative estimator. (b) The non-generative FiNS-based baseline.
Figure 2: Distribution of T60 vs EDT, and T60 vs DRR of the evaluation, our generated, and the FiNS layernorm generated samples.
Figure 3: Violin plot. x-axis is short text prompts perceptually related to reverberance. y-axis is the DRR and T60 of generated RIRs.

Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers

TL;DR

Abstract

Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)