Table of Contents
Fetching ...

ResBit: Residual Bit Vector for Categorical Values

Masane Fuchi, Amar Zanashir, Hiroto Minami, Tomohiro Takagi

TL;DR

Residual Bit Vectors (ResBit) is an extension of analog bits and overcomes limitations of analog bits when applied to tabular data generation, and reveals the multinomial diffusion faces the mode collapse phenomenon when the cardinality is high.

Abstract

One-hot vectors, a common method for representing discrete/categorical data, in machine learning are widely used because of their simplicity and intuitiveness. However, one-hot vectors suffer from a linear increase in dimensionality, posing computational and memory challenges, especially when dealing with datasets containing numerous categories. In this paper, we focus on tabular data generation, and reveal the multinomial diffusion faces the mode collapse phenomenon when the cardinality is high. Moreover, due to the limitations of one-hot vectors, the training phase takes time longer in such a situation. To address these issues, we propose Residual Bit Vectors (ResBit), a technique for densely representing categorical data. ResBit is an extension of analog bits and overcomes limitations of analog bits when applied to tabular data generation. Our experiments demonstrate that ResBit not only accelerates training but also maintains performance when compared with the situations before applying ResBit. Furthermore, our results indicate that many existing methods struggle with high-cardinality data, underscoring the need for lower-dimensional representations, such as ResBit and latent vectors.

ResBit: Residual Bit Vector for Categorical Values

TL;DR

Residual Bit Vectors (ResBit) is an extension of analog bits and overcomes limitations of analog bits when applied to tabular data generation, and reveals the multinomial diffusion faces the mode collapse phenomenon when the cardinality is high.

Abstract

One-hot vectors, a common method for representing discrete/categorical data, in machine learning are widely used because of their simplicity and intuitiveness. However, one-hot vectors suffer from a linear increase in dimensionality, posing computational and memory challenges, especially when dealing with datasets containing numerous categories. In this paper, we focus on tabular data generation, and reveal the multinomial diffusion faces the mode collapse phenomenon when the cardinality is high. Moreover, due to the limitations of one-hot vectors, the training phase takes time longer in such a situation. To address these issues, we propose Residual Bit Vectors (ResBit), a technique for densely representing categorical data. ResBit is an extension of analog bits and overcomes limitations of analog bits when applied to tabular data generation. Our experiments demonstrate that ResBit not only accelerates training but also maintains performance when compared with the situations before applying ResBit. Furthermore, our results indicate that many existing methods struggle with high-cardinality data, underscoring the need for lower-dimensional representations, such as ResBit and latent vectors.
Paper Structure (43 sections, 12 equations, 13 figures, 29 tables, 2 algorithms)

This paper contains 43 sections, 12 equations, 13 figures, 29 tables, 2 algorithms.

Figures (13)

  • Figure 1: Overview of the proposed ResBit (left) and application example to TabDDPM (right).
  • Figure 2: Plot of the total cardinalities of each data.
  • Figure 3: Overview of the "out of index" problem in the states of the U.S. example.
  • Figure 4: ResBit in the states of the U.S. example.
  • Figure 5: Comparison of the dimensionality between one-hot and ResBit when the number of classes are increased to $10^6$.
  • ...and 8 more figures