Table of Contents
Fetching ...

Squared families: Searching beyond regular probability models

Russell Tsuchida, Jiawei Liu, Cheng Soon Ong, Dino Sejdinovic

TL;DR

This work introduces squared families, a new class of densities formed by squaring a linear statistic, and addresses their intrinsic singularities through a dimension-augmentation trick to obtain regular models. It reveals a tight geometric-statistical structure: the normalising constant z(θ) factorises via a θ-independent kernel K_{μ,ψ}, and the Fisher information, Bregman divergence, and statistical divergences are all tied to the same squared-family kernel, enabling efficient computation. The paper further places squared families within the broader g-family framework, showing that after removing singularities, exponential families and positively homogeneous monomial families are the only cases with a Fisher information conformally equivalent to a Hessian metric driven by a generator depending only on z(θ). It also proves estimation and universal approximation results, establishing rates like O(N^−1/2) for well-specified scenarios and an O(n^−1/4) KL bound in universal approximation, highlighting squared families as a practical, tractable alternative for density modelling with neural-network features. The framework unifies normalisation, geometry, and inference, and points to broad applicability in density estimation, neural models, and probabilistic circuits where tractable normalization and composition are desirable.

Abstract

We introduce squared families, which are families of probability densities obtained by squaring a linear transformation of a statistic. Squared families are singular, however their singularity can easily be handled so that they form regular models. After handling the singularity, squared families possess many convenient properties. Their Fisher information is a conformal transformation of the Hessian metric induced from a Bregman generator. The Bregman generator is the normalising constant, and yields a statistical divergence on the family. The normalising constant admits a helpful parameter-integral factorisation, meaning that only one parameter-independent integral needs to be computed for all normalising constants in the family, unlike in exponential families. Finally, the squared family kernel is the only integral that needs to be computed for the Fisher information, statistical divergence and normalising constant. We then describe how squared families are special in the broader class of $g$-families, which are obtained by applying a sufficiently regular function $g$ to a linear transformation of a statistic. After removing special singularities, positively homogeneous families and exponential families are the only $g$-families for which the Fisher information is a conformal transformation of the Hessian metric, where the generator depends on the parameter only through the normalising constant. Even-order monomial families also admit parameter-integral factorisations, unlike exponential families. We study parameter estimation and density estimation in squared families, in the well-specified and misspecified settings. We use a universal approximation property to show that squared families can learn sufficiently well-behaved target densities at a rate of $\mathcal{O}(N^{-1/2})+C n^{-1/4}$, where $N$ is the number of datapoints, $n$ is the number of parameters, and $C$ is some constant.

Squared families: Searching beyond regular probability models

TL;DR

This work introduces squared families, a new class of densities formed by squaring a linear statistic, and addresses their intrinsic singularities through a dimension-augmentation trick to obtain regular models. It reveals a tight geometric-statistical structure: the normalising constant z(θ) factorises via a θ-independent kernel K_{μ,ψ}, and the Fisher information, Bregman divergence, and statistical divergences are all tied to the same squared-family kernel, enabling efficient computation. The paper further places squared families within the broader g-family framework, showing that after removing singularities, exponential families and positively homogeneous monomial families are the only cases with a Fisher information conformally equivalent to a Hessian metric driven by a generator depending only on z(θ). It also proves estimation and universal approximation results, establishing rates like O(N^−1/2) for well-specified scenarios and an O(n^−1/4) KL bound in universal approximation, highlighting squared families as a practical, tractable alternative for density modelling with neural-network features. The framework unifies normalisation, geometry, and inference, and points to broad applicability in density estimation, neural models, and probabilistic circuits where tractable normalization and composition are desirable.

Abstract

We introduce squared families, which are families of probability densities obtained by squaring a linear transformation of a statistic. Squared families are singular, however their singularity can easily be handled so that they form regular models. After handling the singularity, squared families possess many convenient properties. Their Fisher information is a conformal transformation of the Hessian metric induced from a Bregman generator. The Bregman generator is the normalising constant, and yields a statistical divergence on the family. The normalising constant admits a helpful parameter-integral factorisation, meaning that only one parameter-independent integral needs to be computed for all normalising constants in the family, unlike in exponential families. Finally, the squared family kernel is the only integral that needs to be computed for the Fisher information, statistical divergence and normalising constant. We then describe how squared families are special in the broader class of -families, which are obtained by applying a sufficiently regular function to a linear transformation of a statistic. After removing special singularities, positively homogeneous families and exponential families are the only -families for which the Fisher information is a conformal transformation of the Hessian metric, where the generator depends on the parameter only through the normalising constant. Even-order monomial families also admit parameter-integral factorisations, unlike exponential families. We study parameter estimation and density estimation in squared families, in the well-specified and misspecified settings. We use a universal approximation property to show that squared families can learn sufficiently well-behaved target densities at a rate of , where is the number of datapoints, is the number of parameters, and is some constant.

Paper Structure

This paper contains 54 sections, 23 theorems, 112 equations, 1 figure, 2 tables.

Key Result

Theorem 3

The Bregman divergence generated by $\phi(\bm{\theta}) = z(\bm{\theta}) = \bm{\theta}^\top \bm{K}_{\mu, \bm{\psi}} \bm{\theta}$ is twice the squared $L^2$ distance between the functions whose squares are proportional to the probability densities, Suppose that Assumptions ass:simple_parameter_space, ass:strictly_pd and ass:rich_features hold. Then ${\mathop{\mathrm{SL^2}}\nolimits( \bm{\theta}^\to

Figures (1)

  • Figure 1: Useful parameter spaces for squared families. (Left) The space $\mathbbold{\Theta}=\{ \bm{\theta} \in \mathbb{R}^n \mid \theta_1 > 0\}$ removes ambiguity in the sign of the parameter $\bm{\theta}$. When using appropriate dimension-augmentation (§ \ref{['sec:dim_aug']}), it results in an identifiable model with a full-rank Fisher information (Lemma \ref{['lemma:gaussian_fim']}). The boundary of the half ellipsoid $\overline{\mathbbold{\Theta}}= \{ \bm{\theta} \in \mathbbold{\Theta} \mid \bm{\theta}^\top \bm{K}_{\mu, \bm{\psi}} \bm{\theta} = 1\}$ additionally allows for a proper statistical divergence (Right) by removing ambiguity in the scale of the parameter $\bm{\theta}$ (Left). This statistical divergence is equal to a Bregman divergence (Middle) generated through the normalising constant $z(\bm{\theta}) = \bm{\theta}^\top \bm{K}_{\mu, \bm{\psi}} \bm{\theta}$ restricted to $\overline{\mathbbold{\Theta}}$.

Theorems & Definitions (29)

  • Definition 1: Squared family
  • Remark 2
  • Theorem 3
  • Definition 4: $m$-squared family
  • Theorem 5
  • Theorem 6
  • Lemma 6
  • Remark 7: Bias reparameterisation
  • Remark 8
  • Lemma 8
  • ...and 19 more