Table of Contents
Fetching ...

The Harmonic Indel Distance

Bob Pepin

TL;DR

The paper introduces the harmonic indel distance (HID), a length-normalized string distance with insertion/deletion costs inversely tied to intermediate length, formalized as $d(A,B)=2H_{|A|+|B|-|\mathrm{lcs}(A,B)|}-H_{|A|}-H_{|B|}$ and proven to satisfy the triangle inequality. It situates HID relative to the indel distance and its Steinhaus transform, and connects it to the contextualized normalized edit distance restricted to indels, offering a quadratic-time computation via LCS. Through classification and regression benchmarks on biomedical sequences and accompanying t-SNE visualizations, HID demonstrates competitive performance with normalized variants and outperforms the unnormalized ID on some tasks, while revealing distinct geometric structures in embeddings. The results support HID as a practical, parameter-free metric for sequence analysis and visualization, with potential applications to shorter sequences and broader domains beyond biology.

Abstract

This short note introduces the harmonic indel distance (HID), a new distance between strings where the cost of an insertion or deletion is inversely proportional to the string length. We present a closed-form formula and show that the HID is a proper distance metric. Then we perform an experimental comparison of HID to normalized and unnormalized versions of the indel distance on benchmark tasks for biomedical sequence data. We finally show planar embeddings of the benchmark datasets to provide some insights into the geometry of the metric spaces associated with the different distance metrics.

The Harmonic Indel Distance

TL;DR

The paper introduces the harmonic indel distance (HID), a length-normalized string distance with insertion/deletion costs inversely tied to intermediate length, formalized as and proven to satisfy the triangle inequality. It situates HID relative to the indel distance and its Steinhaus transform, and connects it to the contextualized normalized edit distance restricted to indels, offering a quadratic-time computation via LCS. Through classification and regression benchmarks on biomedical sequences and accompanying t-SNE visualizations, HID demonstrates competitive performance with normalized variants and outperforms the unnormalized ID on some tasks, while revealing distinct geometric structures in embeddings. The results support HID as a practical, parameter-free metric for sequence analysis and visualization, with potential applications to shorter sequences and broader domains beyond biology.

Abstract

This short note introduces the harmonic indel distance (HID), a new distance between strings where the cost of an insertion or deletion is inversely proportional to the string length. We present a closed-form formula and show that the HID is a proper distance metric. Then we perform an experimental comparison of HID to normalized and unnormalized versions of the indel distance on benchmark tasks for biomedical sequence data. We finally show planar embeddings of the benchmark datasets to provide some insights into the geometry of the metric spaces associated with the different distance metrics.

Paper Structure

This paper contains 10 sections, 4 theorems, 16 equations, 2 figures, 5 tables.

Key Result

Theorem 3.1

The harmonic indel distance defines a distance on the space of strings. For any three strings $A, B, C$ it satisfies the distance axioms

Figures (2)

  • Figure 1: T-SNE plots of ncRNA training dataset using different distance metrics. Colors correspond to different classes. All metrics recover the clusters present in the data with HID and STID obtaining a slightly better separation.
  • Figure 2: T-SNE plots of FLIP training datasets using different distance metrics. Colors correspond to different target values for thermostability in the regression task. HID shows both global and local structure, STID shows local structure and ID shows little apparent structure.

Theorems & Definitions (8)

  • Theorem 3.1
  • Lemma 3.2
  • proof
  • Lemma 3.3
  • proof
  • Lemma 3.4
  • proof
  • proof : Proof of Theorem \ref{['thm:distance']}