Hierarchical Latent Structure Learning through Online Inference

Ines Aitsahalia; Kiyohito Iigaya

Hierarchical Latent Structure Learning through Online Inference

Ines Aitsahalia, Kiyohito Iigaya

Abstract

Learning systems must balance generalization across experiences with discrimination of task-relevant details. Effective learning therefore requires representations that support both. Online latent-cause models support incremental inference but assume flat partitions, whereas hierarchical Bayesian models capture multilevel structure but typically require offline inference. We introduce the Hierarchical Online Learning of Multiscale Experience Structure (HOLMES) model, a computational framework for hierarchical latent structure learning through online inference. HOLMES combines a variation on the nested Chinese Restaurant Process prior with sequential Monte Carlo inference to perform tractable trial-by-trial inference over hierarchical latent representations without explicit supervision over the latent structure. In simulations, HOLMES matched the predictive performance of flat models while learning more compact representations that supported one-shot transfer to higher-level latent categories. In a context-dependent task with nested temporal structure, HOLMES also improved outcome prediction relative to flat models. These results provide a tractable computational framework for discovering hierarchical structure in sequential data.

Hierarchical Latent Structure Learning through Online Inference

Abstract

Paper Structure (33 sections, 19 equations, 9 figures, 2 tables)

This paper contains 33 sections, 19 equations, 9 figures, 2 tables.

Prior over cluster assignments
Likelihood and online updates
Likelihood computation
Sequential Monte Carlo inference
Hierarchical prior over latent paths
Depth-decayed concentration and stopping.
Persistence and node reuse
Hierarchical inference procedure
Compositional task environment
Context-dependent task with nested temporal structure
Outcome prediction accuracy
Representational efficiency
One-shot transfer
Code availability.
Depth-decay interpretation.
...and 18 more sections

Figures (9)

Figure 1: Online hierarchical extension of the Chinese Restaurant Process (CRP) for structure learning. (A) CRP Algorithm. The CRP provides a Bayesian nonparametric prior over partitions, assigning each observation i (filled circle) to an existing cluster k with probability proportional to its occupancy ($n_k$) and concentration parameter $\alpha$ or to a new cluster. (B) HOLMES prior with nested Chinese Restaurant Process (nCRP) algorithm. The nCRP provides a Bayesian nonparametric prior over tree-structured partitions, assigning observations to paths by sequentially selecting branches at each level $L$ with probability proportional to occupancy and $\alpha_L$, the level-specific concentration.
Figure 2: Hierarchical inference preserves outcome prediction accuracy while improving representational efficiency. (A) Two-level hierarchical task structure. Observations (e.g. A', B") vary along binary features (observation level) and category identity (latent). Category determines reward outcome. (B) Example learned model structures. The flat latent-cause model represents each observation type separately at the observation level, while HOLMES learns a compressed category-level representation. (C) Outcome prediction performance on 2-level task. Both models achieve equivalent high accuracy. Error bars: 95% CI across 200 parameter combinations. (D) Representational efficiency measured by average entropy of cluster assignments. HOLMES (hierarchical models) show significantly lower entropy (0.076±0.017 vs. 0.131±0.030), indicating better compression. Error bars: 95% CI across 200 parameter combinations. (E) Outcome prediction performance across task complexity. Both models achieve comparable asymptotic accuracy (84-100%) across all complexity levels, with small effect sizes (Cohen's $|d|$ < 0.4). Error bars: 95% CI across 200 parameter combinations. (F) Representational efficiency across complexity. HOLMES shows significantly lower entropy at all levels, with increasing advantage at higher complexities (2-level: -0.055; 5-level: -0.827). Error bars: 95% CI across 200 parameter combinations. (G) Number of learned clusters across task complexity. HOLMES (light) maintains near-optimal compression across complexities, while flat models (dark) show increasing redundancy, using significantly more clusters at all levels (differences: -1.1 to -2.7 clusters). Error bars: 95% CI across 200 parameter combinations.
Figure 3: Hierarchical advantage in one-shot transfer emerges with task complexity. (A) Transfer task structure. Models are first trained on binary outcome prediction as in the previous task. After training, models receive a single labeled example identifying one observation as a particular label, such as "observation A' belongs to A" (Teaching phase). In the transfer test, models must generalize this category label to identify which past observations belong to the same latent category, despite never receiving explicit category labels during training. (B) One-shot transfer across task complexity. While both models showed a decrease in recall accuracy as task complexity increased, HOLMES (light, triangle) maintained superior transfer at higher complexities, while flat models (dark, circle) showed declining performance. At 2-level complexity, both models performed comparably (flat: 89.7±2.0%, hierarchical: 89.3±1.8%); however, hierarchical models significantly outperformed flat models at higher complexities (3-level: +21.0%, 95% CI [19.2%, 22.8%]; 4-level: +24.7%, 95% CI [22.6%, 26.8%]; 5-level: +26.6%, 95% CI [24.4%, 28.8%]. Error bars represent 95% confidence intervals across 200 parameter combinations. (C) Hierarchical advantage (hierarchical minus flat transfer accuracy) increases systematically with task complexity, transitioning from no advantage at 2-level to substantial advantages at 3+ levels. Error bars represent 95% confidence intervals across n=200 parameter combinations. (D) One-shot transfer performance at the most complex task for all the levels tested. The HOLMES (hierarchical model; right) outperforms the flat model, with greater advantage on deeper levels. (E) Relationship between outcome prediction accuracy and one-shot transfer accuracy across 200 parameter settings for hierarchical (light) and flat models (dark). Each point represents one parameter combination averaged over seeds. Regression lines show 95% confidence bands.
Figure 4: Hierarchical inference improves prediction in a context-dependent task with nested temporal structure. (A)Task structure. The task involves two slow-changing contexts, each specifying which of two binary feature dimensions determines reward (illustrated here as shape and texture). In shape-rule contexts, shape determines reward (circle or triangle), while texture is irrelevant. In texture-rule contexts, texture determines reward (dots or stripes), while shape is irrelevant. (B) Nested temporal structure. Each block presents all four stimulus combinations (shape $\times$ texture), ensuring the same stimulus can yield different outcomes depending on the current context. Within each slow context, the rewarded feature value switches in sub-blocks. Dashed lines indicate these fast value switches within contexts. Stars denote rewarded stimuli. (C) Outcome prediction accuracy. HOLMES (Hierarchical models; light gray) achieve significantly higher accuracy than flat models (dark gray). (Flat: 48.1 ± 0.3, HOLMES: 80.3 ± 1.1 (95% CI)). (D) Within-state entropy. HOLMES (hierarchical models) achieve lower within-state entropy, indicating that each of the four latent states (shape-circle, shape-triangle, texture-stripes, texture-dots) maps to fewer clusters. Lower entropy indicates more efficient representations. (Flat: 2.6 ± 0.03, HOLMES: 1.8 ± 0.1 (95% CI)) (E) Representational efficiency. HOLMES uses fewer clusters per latent state. (Flat: 32.3 ± 0.9, HOLMES: 15.1 ± 1.1 (95% CI)).
Figure S1: Flat and Hierarchical model sequential learning. (A) One-layer ("flat") model. Standard CRP-based models maintain a single partition over observations, incrementally assigning each observation to an existing cluster or creating a new one. Trial 1: The first observation creates a new group. Trial 2: The second observation (dashed) creates another new group. Trial 3: The third observation is assigned to the existing compatible group. Trial n: After many trials, the model has discovered multiple groups at a single level of abstraction. (B) Online hierarchical latent structure learning. Our model generalizes the flat formulation by organizing latent structure across multiple levels of abstraction online. Trial 1: The first observation creates new nodes at multiple levels (each marked 'new'). Trial 2: The second observation creates a new branch, forming a sibling relationship with the first observation at higher levels while differing at lower levels, and deepening the hierarchy. Trial 3: The third observation reuses existing structure at higher levels but creates new structure at the observation level. Trial n: The model has discovered a multi-level tree where shared structure at higher levels.
...and 4 more figures

Hierarchical Latent Structure Learning through Online Inference

Abstract

Hierarchical Latent Structure Learning through Online Inference

Authors

Abstract

Table of Contents

Figures (9)