Table of Contents
Fetching ...

Financial Bond Similarity Search Using Representation Learning

Amin Haeri, Mahdi Ghelichi, Nishant Agrawal, David Li, Catalina Gomez Sanchez

TL;DR

Addressing bond similarity under fixed-income analytics, the paper shows that incorporating learned embeddings of high-cardinality categorical attributes yields semantically meaningful bond neighbors and improves spread-curve reconstruction when issuer data are sparse. The authors train per-feature embeddings on six categorical attributes, use cosine similarity for retrieval, and apply post-filtering before fitting Nelson–Siegel curves; evaluation via sparse-issuer augmentation demonstrates superiority over one-hot baselines and competitiveness with supervised metric learners in sparse regimes. The work provides a practical framework for risk management and peer selection in fixed income, improving robustness and interpretability in curve construction and risk assessment. It also suggests avenues for hybrid and multimodal representations to further leverage domain structure.

Abstract

Finding similar bonds remains challenging in fixed-income analytics, as numerical financial attributes often overshadow categorical non-financial ones such as issuer sector and domicile. This paper shows that these categorical attributes dominate the predictability of spread curves and proposes embedding models to capture their semantic similarities, outperforming one-hot and many other baselines. Evaluated via sparse-issuer augmentation, the approach improves risk modeling and curve construction.

Financial Bond Similarity Search Using Representation Learning

TL;DR

Addressing bond similarity under fixed-income analytics, the paper shows that incorporating learned embeddings of high-cardinality categorical attributes yields semantically meaningful bond neighbors and improves spread-curve reconstruction when issuer data are sparse. The authors train per-feature embeddings on six categorical attributes, use cosine similarity for retrieval, and apply post-filtering before fitting Nelson–Siegel curves; evaluation via sparse-issuer augmentation demonstrates superiority over one-hot baselines and competitiveness with supervised metric learners in sparse regimes. The work provides a practical framework for risk management and peer selection in fixed income, improving robustness and interpretability in curve construction and risk assessment. It also suggests avenues for hybrid and multimodal representations to further leverage domain structure.

Abstract

Finding similar bonds remains challenging in fixed-income analytics, as numerical financial attributes often overshadow categorical non-financial ones such as issuer sector and domicile. This paper shows that these categorical attributes dominate the predictability of spread curves and proposes embedding models to capture their semantic similarities, outperforming one-hot and many other baselines. Evaluated via sparse-issuer augmentation, the approach improves risk modeling and curve construction.
Paper Structure (15 sections, 2 equations, 18 figures, 1 table)

This paper contains 15 sections, 2 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: Two-dimensional embedding projections colored by their similarities in the high-dimensional space. From left to right: Industry Subgroup, Issuer Bulk, and Country of Domicile. Each point represents an entity in the learned embedding space, projected onto the first two dimensions, with radial lines indicating displacement from the origin. The visualizations illustrate how semantically related entities cluster and separate according to industry, issuer identity, and geographic domicile, highlighting the embedding's ability to capture structured relationships across different categorical views.
  • Figure 2: Overview of the embedding-based approach for bond similarity search. Categorical queries (e.g., Industry, Country) are passed through a fine-tuned embedding model to produce dense numerical representations. These embeddings are then compared via similarity search against bond catalogs. After aggregating results across all features, the system outputs the most overall similar bonds.
  • Figure 3: Initial embedding-based methodology for bond similarity search. In this design, all categorical features are concatenated and jointly passed through the embedding model to produce dense numerical representations. Although this approach captures feature interactions more effectively, it offers less interpretability compared to the revised, feature-wise embedding approach.
  • Figure 4: Evaluation pipeline for the proposed bond similarity search framework. Starting from a complete bond catalog, a subset of non-sparse issuers is selected. A fixed number of bonds are randomly removed to create sparsity. The sparse issuers are then augmented using the proposed similarity-based methodology. The augmented catalog is used to fit predicted bond spread curves via the NS model, which are compared to the actual curves using the RMSE error metric.
  • Figure 5: Visualization of bond similarity search results for the query bonds of AAPL US037833AL42 (left) and BAC US06051GFC87 (right). The first row represents the query bond's profile, while subsequent rows show the most similar bonds ranked by cosine similarity in embedding space. The color intensity indicates similarity, with higher scores (topmost column) reflecting closer structural and economic resemblance to the query bond.
  • ...and 13 more figures