Table of Contents
Fetching ...

Modular Linear Tokenization (MLT)

Tcharlies Schmitz

TL;DR

This work tackles high-cardinality categorical representation by introducing Modular Linear Tokenization (MLT), a reversible and collision-free encoding built on base-$p$ digits and an invertible matrix over the finite field $ obreak Z_{p}$. The method yields a bijective mapping $t = (M \cdot v) \bmod p$ with explicit dimensionality control via parameters $p$ and $n$, where $p^{n} > V$. It provides encoding/decoding algorithms, guidelines for choosing $p$ and $n$, and demonstrates significant output-cost reductions while maintaining competitive predictive performance, as shown on MovieLens 20M. The authors release an open-source implementation and show that MLT can approach the performance of supervised embeddings with far fewer learned parameters, offering a reproducible and scalable alternative for high-cardinality tokenization. This work opens avenues for hybrid architectures combining deterministic tokenization with learned compression and extends applicability to large-scale tabular and graph-based learning, where efficiency and auditability are crucial.

Abstract

This paper introduces Modular Linear Tokenization (MLT), a reversible and deterministic technique for encoding high-cardinality categorical identifiers into compact numerical vectors. Unlike traditional hashing or one-hot encodings, MLT preserves bijective mappings by leveraging modular arithmetic over finite fields and invertible linear transformations. The method offers explicit control of dimensionality and computational scalability while maintaining full reversibility, even for millions of identifiers. Experimental results on the MovieLens 20M dataset show that MLT achieves comparable predictive performance to supervised embeddings while requiring significantly fewer parameters and lower training cost. An open-source implementation of MLT is available on PyPI (https://pypi.org/project/light-mlt/) and GitHub (https://github.com/tcharliesschmitz/light-mlt).

Modular Linear Tokenization (MLT)

TL;DR

This work tackles high-cardinality categorical representation by introducing Modular Linear Tokenization (MLT), a reversible and collision-free encoding built on base- digits and an invertible matrix over the finite field . The method yields a bijective mapping with explicit dimensionality control via parameters and , where . It provides encoding/decoding algorithms, guidelines for choosing and , and demonstrates significant output-cost reductions while maintaining competitive predictive performance, as shown on MovieLens 20M. The authors release an open-source implementation and show that MLT can approach the performance of supervised embeddings with far fewer learned parameters, offering a reproducible and scalable alternative for high-cardinality tokenization. This work opens avenues for hybrid architectures combining deterministic tokenization with learned compression and extends applicability to large-scale tabular and graph-based learning, where efficiency and auditability are crucial.

Abstract

This paper introduces Modular Linear Tokenization (MLT), a reversible and deterministic technique for encoding high-cardinality categorical identifiers into compact numerical vectors. Unlike traditional hashing or one-hot encodings, MLT preserves bijective mappings by leveraging modular arithmetic over finite fields and invertible linear transformations. The method offers explicit control of dimensionality and computational scalability while maintaining full reversibility, even for millions of identifiers. Experimental results on the MovieLens 20M dataset show that MLT achieves comparable predictive performance to supervised embeddings while requiring significantly fewer parameters and lower training cost. An open-source implementation of MLT is available on PyPI (https://pypi.org/project/light-mlt/) and GitHub (https://github.com/tcharliesschmitz/light-mlt).

Paper Structure

This paper contains 5 sections, 4 equations, 2 tables.