The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

Bum Jun Kim; Yoshinobu Kawahara; Sang Woo Kim

The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

Bum Jun Kim, Yoshinobu Kawahara, Sang Woo Kim

TL;DR

The paper identifies a vulnerability in modern time-dependent neural networks whereby timestep embeddings can vanish under normalization, erasing time-awareness in NODE and diffusion-model architectures. It analyzes ConcatConv-based time injection in NODE and sinusoidal MLP-based embeddings in diffusion models, revealing a channel-wise scalar offset that can be canceled by normalization. The authors propose three remedies—positional timestep embedding, zero bias initialization on the operand branch with nonzero bias for the timestep branch, and reducing the number of GN groups—to preserve alive time-dependency. Through NODE and diffusion-model experiments on CIFAR datasets, these strategies yield tangible improvements in accuracy and generative metrics (e.g., FID/IS) without increasing computational burden. The findings offer practical guidelines to enhance time-awareness in modern time-dependent neural networks and challenge prevailing architectural choices.

Abstract

Dynamical systems are often time-varying, whose modeling requires a function that evolves with respect to time. Recent studies such as the neural ordinary differential equation proposed a time-dependent neural network, which provides a neural network varying with respect to time. However, we claim that the architectural choice to build a time-dependent neural network significantly affects its time-awareness but still lacks sufficient validation in its current states. In this study, we conduct an in-depth analysis of the architecture of modern time-dependent neural networks. Here, we report a vulnerability of vanishing timestep embedding, which disables the time-awareness of a time-dependent neural network. Furthermore, we find that this vulnerability can also be observed in diffusion models because they employ a similar architecture that incorporates timestep embedding to discriminate between different timesteps during a diffusion process. Our analysis provides a detailed description of this phenomenon as well as several solutions to address the root cause. Through experiments on neural ordinary differential equations and diffusion models, we observed that ensuring alive time-awareness via proposed solutions boosted their performance, which implies that their current implementations lack sufficient time-dependency.

The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 7 figures, 5 tables)

This paper contains 24 sections, 4 equations, 7 figures, 5 tables.

Introduction
Vanishing Timestep Embedding
Background: How NODE incorporates time-awareness
Problem statement: Timestep embedding is inherently prone to vanish
Solutions
Avoid channel-wise scalar offset: Positional timestep embedding
Control relative variance: Zero bias initialization in convolutions
Increase degree of freedom in normalization unit: Decrease the number of groups
Experiments
Experiments on NODE
Model
Hyperparamters
Do not use BN.
Positional timestep embedding improves performance.
Apply zero bias initialization to the convolution branch.
...and 9 more sections

Figures (7)

Figure 1: In ConcatConv operation, applying a convolutional kernel $\mathbf{W}_{C+1}^k$ to $t\mathbf{J}$ is equivalent to using $t\mathbf{v}^k$ that has the same element spatially
Figure 2: Illustration of vanishing timestep embedding. An additive scalar offset is simply canceled out by the subsequent mean-std normalization.
Figure 3: To avoid the use of scalar offset, we should ensure that each normalization unit has several elements of timestep embedding more than one in each channel, which would not be canceled out by the subsequent normalization
Figure 4: Injecting positional timestep embedding enables a spatial degree of freedom, which is not canceled out by the subsequent normalization
Figure 5: Diffusion models compute sine and cosine from different frequencies and positions, which are fed to MLP to produce timestep embedding $\tilde{\mathbf{v}}_t$. We propose adding another branch to obtain positional timestep embedding $\tilde{\mathbf{p}}_t$ from the sinusoidal.
...and 2 more figures

The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

TL;DR

Abstract

The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)