Table of Contents
Fetching ...

PowerBin: Fast Adaptive Data Binning with Centroidal Power Diagrams

Michele Cappellari

TL;DR

PowerBin tackles the bottlenecks of adaptive binning in large astronomical datasets by recasting binning as an optimal-transport problem whose natural solution is a Centroidal Power Diagram (CPD). It replaces fragile CPD solvers with a fast, physically motivated heuristic that enforces capacity constraints, and it introduces a linear-time bin-accretion method, both contributing to an overall $\mathcal{O}(N\log N)$ scaling. The method yields convex, compact bins with excellent $S/N$ uniformity and demonstrates robustness to non-additive noise, including correlated noise and background-limited data, across mock and real IFS datasets. The resulting performance gains enable practical, scalable binning for million-pixel surveys, while a general-purpose stippling-like tessellation example shows broad applicability beyond astronomy; the public PowerBin Python package facilitates adoption by the community.

Abstract

Adaptive binning is a crucial step in the analysis of large astronomical datasets, such as those from integral-field spectroscopy, to ensure a sufficient signal-to-noise ratio (S/N) for reliable model fitting. However, the widely used Voronoi-binning method and its variants suffer from two key limitations: they scale poorly with data size, often as O(N^2), creating a computational bottleneck for modern surveys, and they can produce undesirable non-convex or disconnected bins. I introduce PowerBin, a new algorithm that overcomes these issues. I frame the binning problem within the theory of optimal transport, for which the solution is a Centroidal Power Diagram (CPD), guaranteeing convex bins. Instead of formal CPD solvers, which are unstable with real data, I develop a fast and robust heuristic based on a physical analogy of packed soap bubbles. This method reliably enforces capacity constraints even for non-additive measures like S/N with correlated noise. I also present a new bin-accretion algorithm with O(N log N) complexity, removing the previous bottleneck. The combined PowerBin algorithm scales as O(N log N), making it about two orders of magnitude faster than previous methods on million-pixel datasets. I demonstrate its performance on a range of simulated and real data, showing it produces high-quality, convex tessellations with excellent S/N uniformity. The public Python implementation provides a fast, robust, and scalable tool for the analysis of modern astronomical data.

PowerBin: Fast Adaptive Data Binning with Centroidal Power Diagrams

TL;DR

PowerBin tackles the bottlenecks of adaptive binning in large astronomical datasets by recasting binning as an optimal-transport problem whose natural solution is a Centroidal Power Diagram (CPD). It replaces fragile CPD solvers with a fast, physically motivated heuristic that enforces capacity constraints, and it introduces a linear-time bin-accretion method, both contributing to an overall scaling. The method yields convex, compact bins with excellent uniformity and demonstrates robustness to non-additive noise, including correlated noise and background-limited data, across mock and real IFS datasets. The resulting performance gains enable practical, scalable binning for million-pixel surveys, while a general-purpose stippling-like tessellation example shows broad applicability beyond astronomy; the public PowerBin Python package facilitates adoption by the community.

Abstract

Adaptive binning is a crucial step in the analysis of large astronomical datasets, such as those from integral-field spectroscopy, to ensure a sufficient signal-to-noise ratio (S/N) for reliable model fitting. However, the widely used Voronoi-binning method and its variants suffer from two key limitations: they scale poorly with data size, often as O(N^2), creating a computational bottleneck for modern surveys, and they can produce undesirable non-convex or disconnected bins. I introduce PowerBin, a new algorithm that overcomes these issues. I frame the binning problem within the theory of optimal transport, for which the solution is a Centroidal Power Diagram (CPD), guaranteeing convex bins. Instead of formal CPD solvers, which are unstable with real data, I develop a fast and robust heuristic based on a physical analogy of packed soap bubbles. This method reliably enforces capacity constraints even for non-additive measures like S/N with correlated noise. I also present a new bin-accretion algorithm with O(N log N) complexity, removing the previous bottleneck. The combined PowerBin algorithm scales as O(N log N), making it about two orders of magnitude faster than previous methods on million-pixel datasets. I demonstrate its performance on a range of simulated and real data, showing it produces high-quality, convex tessellations with excellent S/N uniformity. The public Python implementation provides a fast, robust, and scalable tool for the analysis of modern astronomical data.

Paper Structure

This paper contains 26 sections, 14 equations, 9 figures, 4 algorithms.

Figures (9)

  • Figure 1: Posterior probability distributions for the four kinematic parameters $(V_1,\sigma_1,V_2,\sigma_2)$ of two stellar components, recovered with ppxf coupled to an Adaptive Metropolis MCMC sampler ($10^5$ steps). Left: spectrum with $\mathcal{S/N}=2$ per pixel. The joint and marginal posteriors are highly non-Gaussian, multimodal and biased away from the true input values (magenta dashed lines), illustrating that parameter estimates at low $\mathcal{S/N}$ are unreliable. Right: spectrum with $\mathcal{S/N}=20$ per pixel. The posteriors become well behaved, approximately Gaussian and centred on the true values, allowing robust inference. Contours mark the 68 and 95 per cent credible regions; diagonal panels show marginal histograms. This comparison demonstrates why one must bin spaxels to reach sufficient $\mathcal{S/N}$ before fitting non-linear spectral models.
  • Figure 2: Recovery of a simple input star-formation history from full-spectrum fitting at three signal-to-noise levels. Top row: mock spectrum (black), ppxf best-fit model (red) and residuals (green) for $\mathcal{S/N} = 5$, 50 and 500 (left to right). Bottom row: the distribution of 1000 Monte Carlo recovered SFHs (transparent blue lines) compared to the input single burst at 0.3 Gyr (orange line with markers). At low $\mathcal{S/N}$ the recovered SFHs are biased and highly scattered; increasing $\mathcal{S/N}$ progressively reduces bias and scatter, and by $\mathcal{S/N}=500$ the input burst is recovered with high fidelity. The fits employ 25 solar-metallicity MILES templates and 1000 Monte Carlo realizations per $\mathcal{S/N}$ level.
  • Figure 3: Geometric interpretation of several Voronoi generalizations. Each column illustrates a different tessellation. The top row shows the view from above: the boundaries of the 2D tessellation are the vertical projections onto the $(x,y)$-plane of the pairwise intersection curves of the 3D surfaces (cones or spheres). The middle (inclined) and bottom (side-on) rows show the underlying 3D surfaces. In all cases, the surfaces are centred at the same $(x,y)$ locations (the generators), while their shapes are modified by weights. (a) Ordinary Voronoi diagram: Boundaries come from intersections of identical right circular cones (same slope and same apex height). The projected bisectors are straight lines, yielding convex cells. (b) Multiplicatively weighted Voronoi (Apollonius) diagram: Boundaries come from intersections of cones with different slopes but the same apex height. The weight $w_j$ controls the cone slope, so at any fixed height the base radius scales with $w_j$. The projected bisectors are circular arcs (Apollonius circles), and cells may be non-convex. The figure shows an example where one small cell is contained entirely within a larger one. (c) Additively weighted Voronoi (Johnson–Mehl) diagram: Boundaries come from intersections of cones with the same slope but different apex heights. The weight $w_j$ is encoded in the apex height, and $w_j$ still represents the radius of the cone base. The projected bisectors are hyperbolic arcs, and cells are not guaranteed to be convex. In this example, one cone lies entirely below another, so its corresponding cell is empty and the tessellation has one fewer region than in (a) or (b). (d) Power (Laguerre) diagram: Boundaries come from intersections of spheres centred at $(\mathbf{g}_j,0)$ with radii $r_j=\sqrt{w_j}$. The projected bisectors are straight lines (radical axes), so cells are convex. As in (c), one sphere lies entirely below another, so its corresponding cell is empty and the tessellation has one fewer region.
  • Figure 4: Natural analogues for optimal data binning, illustrating the principles of compactness and capacity uniformity. Left: A https://beekeeping.fandom.com/wiki/Honey demonstrates the optimal packing of equal-area cells (hexagons). Right: A 2D foam (soap bubbles) provides a physical model for a capacity-constrained tessellation. By minimizing surface energy, the bubbles form a structure equivalent to a power diagram, where different bubble sizes correspond to different cell capacities. This illustrates the physical principle behind the PowerBin algorithm. Image courtesy of https://lincolnmathsphys.wordpress.com/2015/02/26/fascinating-foams-the-mathematics-of-soap-bubbles/, Aberystwyth University.
  • Figure 5: The performance of the PowerBin algorithm is shown for two mock galaxies ($n_{\rm Ser}=1$ and $n_{\rm Ser}=8$). Both simulations have $R_{\rm eff}=10$ pixels and an axial ratio of $q'=0.5$. The target signal-to-noise ratio ($\mathcal{S/N}_\mathrm{T}$) was calibrated to ensure $\approx 40$ unbinned spaxels near the centre. Left: Resulting Power Diagrams; circle radii are $r_j = \sqrt{w_j}$, with small black circles marking the generator points. Contours show the galaxy's $\mathcal{S/N}$, spaced logarithmically by 1 mag. Right:$\mathcal{S/N}$ distribution. Original spaxel $\mathcal{S/N}$ values are grey circles; $\mathcal{S/N}_\mathrm{T}$ is the dashed line. Blue crosses are single spaxels ($\mathcal{S/N} > \mathcal{S/N}_\mathrm{T}$), and large red circles are the final bin $\mathcal{S/N}$. Top panels use a simple Poissonian noise model; bottom panels show robustness to non-additive capacities from correlated noise. In all cases, the algorithm optimizes the capacity function $\mathcal{C}=(\mathcal{S/N})^2$ (which is an additive capacity in the Poissonian case), but I plot the square root $\sqrt{\mathcal{C}}=\mathcal{S/N}$.
  • ...and 4 more figures