Table of Contents
Fetching ...

AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions

Michael A. Alcorn

TL;DR

AQuaMaM tackles the challenge of learning complex, multimodal distributions on the rotation group $ ext{SO}(3)$ by proposing an autoregressive quaternion-language model implemented on a Transformer. By representing rotations through projected unit quaternions and partitioning each component into mixtures of uniform bins, the model enforces the unit-norm constraint and enables exact likelihoods in a single forward pass, avoiding the slow IPDF inference that requires many MLP evaluations. The approach reframes density learning as a three-token language modeling task, yielding a cubic scaling in the number of bins and enabling parameter-efficient, fast predictions. Empirical results on toy and die datasets show that AQuaMaM achieves higher log-likelihoods, more accurate sampling aligned with the true distributions, and substantially higher throughput (e.g., 52× faster on a single GPU) compared to IPDF, with strong generalization to multi-modal pose uncertainty. Overall, AQuaMaM offers a scalable, precise method for rapid, high-fidelity $ ext{SO}(3)$ distribution estimation with broad applicability to 3D pose reasoning and related domains.

Abstract

Accurately modeling complex, multimodal distributions for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires $N$ forward passes through the network's final multilayer perceptron (where $N$ places an upper bound on the likelihood that can be calculated by the model), which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. When trained on an "infinite" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. When trained on a constructed dataset of 500,000 renders of a die in different rotations, AQuaMaM reaches a test log-likelihood 14% higher than IPDF. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52$\times$ faster on a single GPU, and converges in a similar amount of time during training.

AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions

TL;DR

AQuaMaM tackles the challenge of learning complex, multimodal distributions on the rotation group by proposing an autoregressive quaternion-language model implemented on a Transformer. By representing rotations through projected unit quaternions and partitioning each component into mixtures of uniform bins, the model enforces the unit-norm constraint and enables exact likelihoods in a single forward pass, avoiding the slow IPDF inference that requires many MLP evaluations. The approach reframes density learning as a three-token language modeling task, yielding a cubic scaling in the number of bins and enabling parameter-efficient, fast predictions. Empirical results on toy and die datasets show that AQuaMaM achieves higher log-likelihoods, more accurate sampling aligned with the true distributions, and substantially higher throughput (e.g., 52× faster on a single GPU) compared to IPDF, with strong generalization to multi-modal pose uncertainty. Overall, AQuaMaM offers a scalable, precise method for rapid, high-fidelity distribution estimation with broad applicability to 3D pose reasoning and related domains.

Abstract

Accurately modeling complex, multimodal distributions for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires forward passes through the network's final multilayer perceptron (where places an upper bound on the likelihood that can be calculated by the model), which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. When trained on an "infinite" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. When trained on a constructed dataset of 500,000 renders of a die in different rotations, AQuaMaM reaches a test log-likelihood 14% higher than IPDF. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52 faster on a single GPU, and converges in a similar amount of time during training.
Paper Structure (28 sections, 21 equations, 11 figures)

This paper contains 28 sections, 21 equations, 11 figures.

Figures (11)

  • Figure 1: When minimizing the unimodal Bingham loss for the two rotations ${\bm{R}}_{1}$ and ${\bm{R}}_{2}$, the maximum likelihood estimate $\widehat{{\bm{R}}}$ is a rotation that was never observed in the dataset. Note, the die images are for demonstration purposes only, i.e., no images were used during optimization. ${\bm{R}}_{0}$ is the identity rotation.
  • Figure 2: Because there is a bijective mapping between the unit disk $B^{2} = \{{\bm{u}} \in \mathbb{R}^{2}: \lVert {\bm{u}} \rVert < 1\}$ and the unit hemisphere $\widetilde{S}^{2} = \{{\bm{v}} \in \mathbb{R}^{3}: \lVert {\bm{v}} \rVert = 1, z > 0\}$, the challenging task of estimating a distribution on the curved $\widetilde{S}^{2}$ manifold can be simplified to estimating a distribution on the non-curved $B^{2}$. (a) Here, the true distribution on $\widetilde{S}^{2}$ is a uniform distribution, i.e., each point has a density of $\frac{1}{2\pi}$ (because $\widetilde{S}^{2}$ has a surface area of $2\pi$). (b) Points that are uniformly sampled from $\widetilde{S}^{2}$ and then projected onto $B^{2}$ are more concentrated towards the edges of $B^{2}$ due to the curvature of $\widetilde{S}^{2}$. If we model the distribution of $(x, y)$ coordinates on $B^{2}$ as a mixture of uniform distributions, we can calculate $p(x, y, z)$ by dividing $p(x, y)$ by the area of the parallelogram defined by the Jacobian located at $(x, y, z)$ on the hemisphere. (c) The $p(x, y, z)$ calculated through this procedure are generally quite close to the expected density. The mean density $\mu$ of the 1,000 points shown in (c) is 0.154 (compared to 0.159 for the true density). A similar procedure is used by AQuaMaM to obtain the probability of a unit quaternion $p({\bm{q}})$ while only modeling the first three components of ${\bm{q}}$: $q_{x}$, $q_{y}$, and $q_{z}$. See Section \ref{['sec:hemisphere']} for additional details on the $\widetilde{S}^{2}$ example.
  • Figure 3: When modeling the conditional distribution $p(q_{y}|q_{x})$ as a mixture of uniform distributions, the geometric constraints of the unit quaternion are easily enforced. Here, I focus on non-negative bins for clarity, i.e., intervals $[a_{i}, b_{i})$ where $0 \le a < b \le 1$, but the same logic applies to negative bins. Given $q_{x} = 0.7$, we know that $|q_{y}| \le \sqrt{1 - 0.7^{2}}$ because ${\bm{q}}$ has a unit norm. As a result, the mixture proportion $\pi_{i}$ for any bin where $\sqrt{1 - 0.7^{2}} < a_{i}$must be zero. AQuaMaM enforces this constraint by assigning a value of $-\infty$ to the output scores for "strictly illegal bins" during training.For the remaining bins, the corresponding uniform distribution is $\mathcal{U}(q_{y}; a_{i}, \hat{b}_{i})$ where $\hat{b}_{i} = \min(\sqrt{1 - 0.7^{2}}, b_{i})$, i.e., the upper bound of the uniform distribution for the partially legal bin is reduced to $\sqrt{1 - 0.7^{2}}$.
  • Figure 4: An overview of the AQuaMaM architecture. Given an image/rotation matrix pair $({\bm{\mathsfit{X}}}, {\bm{R}})$, the image is first converted into a sequence of $P$ patch embeddings while the rotation matrix is converted into its unit quaternion representation ${\bm{q}} = [q_{x}, q_{y}, q_{z}, q_{w}]$. By restricting the unit quaternions to those with positive real components (which is allowed because ${\bm{q}}$ and $-{\bm{q}}$ encode the same rotation), $q_{w}$ becomes fully determined and does not need to be modeled. Next, each of the first two components $q_{c}$ of ${\bm{q}}$ is mapped to an embedding $z_{q_{c}}$ by a separate subnetwork $g_{q_{c}}$. The full input to the Transformer is thus a sequence consisting of the $P$ patch embeddings, a special [START] embedding $z_{0}$, and the two unit quaternion embeddings. The labels $l_{q_{x}}$, $l_{q_{y}}$, and $l_{q_{z}}$ are generated by assigning $q_{x}$, $q_{y}$, and $q_{z}$ to one of $N$ labels through a binning function $\text{Bin}$. Using a partially causal attention mask, AQuaMaM models the conditional distribution $p(q_{x}, q_{y}, q_{z}|{\bm{\mathsfit{X}}})$autoregressively, i.e., $p(q_{x}, q_{y}, q_{z}|{\bm{\mathsfit{X}}}) = p(q_{x}|{\bm{\mathsfit{X}}}) p(q_{y} | q_{x}, {\bm{\mathsfit{X}}}) p(q_{z} | q_{x}, q_{y}, {\bm{\mathsfit{X}}})$ where each component is modeled as a mixture of uniform distributions that partition the component's geometrically constrained domain. Because minimizing the loss of a mixture of uniform distributions is equivalent (up to a constant) to minimizing the classification loss over the bins, AQuaMaM is trained as a "quaternion language model".
  • Figure 5: On the infinite toy dataset, AQuaMaM rapidly reached its theoretical minimum (classification) average negative log-likelihood (NLL). In contrast, IPDF never reached its theoretical minimum validation NLL, despite converging to its training theoretical minimum.
  • ...and 6 more figures