Table of Contents
Fetching ...

Exploiting Data Hierarchy as a New Modality for Contrastive Learning

Arjun Bhalla, Daniel Levenson, Jan Bernhard, Anton Abilov

TL;DR

This paper investigates whether hierarchical structure can serve as a modality for weakly-supervised representation learning, using the WikiScenes cathedral dataset to test hierarchical contrastive pre-training. It introduces a hierarchical triplet loss with level-dependent margins and replay to exploit data organization, reporting both quantitative improvements in downstream classification and qualitative evidence from latent-space visualizations. The proposed method achieves competitive downstream performance and reveals clearer semantic separation in the latent space compared to baselines, demonstrating the value of structure-aware learning for structured datasets. The work highlights a new modality for self-supervised learning and suggests pathways to extend hierarchical training to other structured domains and semantic hierarchies.

Abstract

This work investigates how hierarchically structured data can help neural networks learn conceptual representations of cathedrals. The underlying WikiScenes dataset provides a spatially organized hierarchical structure of cathedral components. We propose a novel hierarchical contrastive training approach that leverages a triplet margin loss to represent the data's spatial hierarchy in the encoder's latent space. As such, the proposed approach investigates if the dataset structure provides valuable information for self-supervised learning. We apply t-SNE to visualize the resultant latent space and evaluate the proposed approach by comparing it with other dataset-specific contrastive learning methods using a common downstream classification task. The proposed method outperforms the comparable weakly-supervised and baseline methods. Our findings suggest that dataset structure is a valuable modality for weakly-supervised learning.

Exploiting Data Hierarchy as a New Modality for Contrastive Learning

TL;DR

This paper investigates whether hierarchical structure can serve as a modality for weakly-supervised representation learning, using the WikiScenes cathedral dataset to test hierarchical contrastive pre-training. It introduces a hierarchical triplet loss with level-dependent margins and replay to exploit data organization, reporting both quantitative improvements in downstream classification and qualitative evidence from latent-space visualizations. The proposed method achieves competitive downstream performance and reveals clearer semantic separation in the latent space compared to baselines, demonstrating the value of structure-aware learning for structured datasets. The work highlights a new modality for self-supervised learning and suggests pathways to extend hierarchical training to other structured domains and semantic hierarchies.

Abstract

This work investigates how hierarchically structured data can help neural networks learn conceptual representations of cathedrals. The underlying WikiScenes dataset provides a spatially organized hierarchical structure of cathedral components. We propose a novel hierarchical contrastive training approach that leverages a triplet margin loss to represent the data's spatial hierarchy in the encoder's latent space. As such, the proposed approach investigates if the dataset structure provides valuable information for self-supervised learning. We apply t-SNE to visualize the resultant latent space and evaluate the proposed approach by comparing it with other dataset-specific contrastive learning methods using a common downstream classification task. The proposed method outperforms the comparable weakly-supervised and baseline methods. Our findings suggest that dataset structure is a valuable modality for weakly-supervised learning.
Paper Structure (21 sections, 2 equations, 5 figures, 1 table)

This paper contains 21 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Images of the Berliner Dom used to demonstrate the hierarchical structure of the WikiScenes dataset.
  • Figure 2: Visualizations of node structure underlying the hierarchical learning procedure.
  • Figure 3: t-SNE plots of model feature maps for an unseen cathedral's first level
  • Figure 4: t-SNE plots of the latent space with respect to the downstream classification labels for an unseen cathedral.
  • Figure 5: Each star marker represents a trained encoder model evaluated on the classification task. Relative $mAP*$ reflects the relative regression compared to the best classification performance of all models in the plot. For example, a Relative $mAP*$ of 0.10 is 90% worse than the best performing model.