The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks
Bum Jun Kim, Yoshinobu Kawahara, Sang Woo Kim
TL;DR
The paper identifies a vulnerability in modern time-dependent neural networks whereby timestep embeddings can vanish under normalization, erasing time-awareness in NODE and diffusion-model architectures. It analyzes ConcatConv-based time injection in NODE and sinusoidal MLP-based embeddings in diffusion models, revealing a channel-wise scalar offset that can be canceled by normalization. The authors propose three remedies—positional timestep embedding, zero bias initialization on the operand branch with nonzero bias for the timestep branch, and reducing the number of GN groups—to preserve alive time-dependency. Through NODE and diffusion-model experiments on CIFAR datasets, these strategies yield tangible improvements in accuracy and generative metrics (e.g., FID/IS) without increasing computational burden. The findings offer practical guidelines to enhance time-awareness in modern time-dependent neural networks and challenge prevailing architectural choices.
Abstract
Dynamical systems are often time-varying, whose modeling requires a function that evolves with respect to time. Recent studies such as the neural ordinary differential equation proposed a time-dependent neural network, which provides a neural network varying with respect to time. However, we claim that the architectural choice to build a time-dependent neural network significantly affects its time-awareness but still lacks sufficient validation in its current states. In this study, we conduct an in-depth analysis of the architecture of modern time-dependent neural networks. Here, we report a vulnerability of vanishing timestep embedding, which disables the time-awareness of a time-dependent neural network. Furthermore, we find that this vulnerability can also be observed in diffusion models because they employ a similar architecture that incorporates timestep embedding to discriminate between different timesteps during a diffusion process. Our analysis provides a detailed description of this phenomenon as well as several solutions to address the root cause. Through experiments on neural ordinary differential equations and diffusion models, we observed that ensuring alive time-awareness via proposed solutions boosted their performance, which implies that their current implementations lack sufficient time-dependency.
