$\texttt{InfoHier}$: Hierarchical Information Extraction via Encoding and Embedding
Tianru Zhang, Li Ju, Prashant Singh, Salman Toor
TL;DR
InfoHier tackles the problem of extracting multi-level information hierarchies from unlabeled data by fusing self-supervised representation learning with hierarchical clustering. It introduces a joint objective that combines a differentiable Dasgupta-based hierarchical clustering loss on hyperbolic embeddings with a unified contrastive SSL loss, allowing end-to-end optimization of both representations and hierarchical structure. Preliminary demonstrations on CIFAR-100 with a ResNet-18 backbone show that the learned latent space reveals intrinsic hierarchy and improves clustering without labels, visualized within the hyperbolic space. The work promises practical impact in scalable data analysis, retrieval, and structured data management by enabling hierarchy-aware representations and efficient hierarchical indexing.
Abstract
Analyzing large-scale datasets, especially involving complex and high-dimensional data like images, is particularly challenging. While self-supervised learning (SSL) has proven effective for learning representations from unlabelled data, it typically focuses on flat, non-hierarchical structures, missing the multi-level relationships present in many real-world datasets. Hierarchical clustering (HC) can uncover these relationships by organizing data into a tree-like structure, but it often relies on rigid similarity metrics that struggle to capture the complexity of diverse data types. To address these we envision $\texttt{InfoHier}$, a framework that combines SSL with HC to jointly learn robust latent representations and hierarchical structures. This approach leverages SSL to provide adaptive representations, enhancing HC's ability to capture complex patterns. Simultaneously, it integrates HC loss to refine SSL training, resulting in representations that are more attuned to the underlying information hierarchy. $\texttt{InfoHier}$ has the potential to improve the expressiveness and performance of both clustering and representation learning, offering significant benefits for data analysis, management, and information retrieval.
