Table of Contents
Fetching ...

Triple-Encoders: Representations That Fire Together, Wire Together

Justus-Jonas Erker, Florian Mai, Nils Reimers, Gerasimos Spanakis, Iryna Gurevych

TL;DR

This work addresses the inefficiency of re-encoding entire dialog histories by introducing Contextualized Curved Contrastive Learning (C3L) via Triple-Encoders, which contextualizes independently encoded utterances through Hebbian-inspired co-occurrence without learnable weights, preserving linear inference complexity. By employing two before-spaces [B1] and [B2] and a mean-pooling fusion, the model learns distributed mixtures that better reflect sequential context than standard bi-encoders. Empirically, C3L yields substantial improvements over bi-encoders and zero-shot generalization on DailyDialog and PersonaChat, along with strong short-term planning performance, while maintaining efficient inference comparable to prior CC L approaches. The authors release code and models, underscoring the practical impact of self-organizing, context-aware representations for dialogue and potentially other sequential text tasks.

Abstract

Search-based dialog models typically re-encode the dialog history at every turn, incurring high cost. Curved Contrastive Learning, a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder, has recently shown promising results for dialog modeling at far superior efficiency. While high efficiency is achieved through independently encoding utterances, this ignores the importance of contextualization. To overcome this issue, this study introduces triple-encoders, which efficiently compute distributed utterance mixtures from these independently encoded utterances through a novel hebbian inspired co-occurrence learning objective in a self-organizing manner, without using any weights, i.e., merely through local interactions. Empirically, we find that triple-encoders lead to a substantial improvement over bi-encoders, and even to better zero-shot generalization than single-vector representation models without requiring re-encoding. Our code (https://github.com/UKPLab/acl2024-triple-encoders) and model (https://huggingface.co/UKPLab/triple-encoders-dailydialog) are publicly available.

Triple-Encoders: Representations That Fire Together, Wire Together

TL;DR

This work addresses the inefficiency of re-encoding entire dialog histories by introducing Contextualized Curved Contrastive Learning (C3L) via Triple-Encoders, which contextualizes independently encoded utterances through Hebbian-inspired co-occurrence without learnable weights, preserving linear inference complexity. By employing two before-spaces [B1] and [B2] and a mean-pooling fusion, the model learns distributed mixtures that better reflect sequential context than standard bi-encoders. Empirically, C3L yields substantial improvements over bi-encoders and zero-shot generalization on DailyDialog and PersonaChat, along with strong short-term planning performance, while maintaining efficient inference comparable to prior CC L approaches. The authors release code and models, underscoring the practical impact of self-organizing, context-aware representations for dialogue and potentially other sequential text tasks.

Abstract

Search-based dialog models typically re-encode the dialog history at every turn, incurring high cost. Curved Contrastive Learning, a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder, has recently shown promising results for dialog modeling at far superior efficiency. While high efficiency is achieved through independently encoding utterances, this ignores the importance of contextualization. To overcome this issue, this study introduces triple-encoders, which efficiently compute distributed utterance mixtures from these independently encoded utterances through a novel hebbian inspired co-occurrence learning objective in a self-organizing manner, without using any weights, i.e., merely through local interactions. Empirically, we find that triple-encoders lead to a substantial improvement over bi-encoders, and even to better zero-shot generalization than single-vector representation models without requiring re-encoding. Our code (https://github.com/UKPLab/acl2024-triple-encoders) and model (https://huggingface.co/UKPLab/triple-encoders-dailydialog) are publicly available.
Paper Structure (35 sections, 6 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 35 sections, 6 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of our Triple Encoder to henderson-etal-2020-convert and erker-etal-2023-imagination. Similar to CCL we only need to encode and compute similarity scores of the latest utterance. At the same time, we achieve contextualization through pairwise mean-pooling with previous encoded utterances combining the advantages of both previous works. Our analysis shows that the co-occurrence training pushes representations that occur (fire) together closer together, leading to stronger additive properties (wiring) when being superimposed (compared to erker-etal-2023-imagination) and thus to a better next utterance selection.
  • Figure 2: Difficult example for next utterance selection based on solely independent utterances. Here the model must know that both utterances occur together as it requires considering them jointly to derive the third utterance (in red). This is reflected by the significant gap in the normalized rank between our contextualized approach and the uncontextualized approach of erker-etal-2023-imagination.
  • Figure 3: Concept of relativity in Imaginary Embeddings with $w=5$ using before[B] and after tokens [A] erker-etal-2023-imagination
  • Figure 4: Our Triple-Encoder architecture with two directional before tokens [B1] and [B2]. We create a combined state of two utterances as the average between the separately encoded embeddings. The target distance of this new combined state results as a normalized sum of each individual utterance score from the bi-encoder Curved Contrastive Learning.
  • Figure 5: Relative time dimension in our proposed Contextual Curved Contrastive Learning. As the observation window moves from $t \rightarrow t+1$, $3$ new triplets are added (dark green), $3$ removed (light green), and $3$ decayed by $-0.4$ (green). As shown through the incoming green arrows at utterance $u_5$, we only have to encode the new incoming utterance with a [B2] token. In the next turn we require the [B1] token that can be encoded at idle time while the dialog partner is speaking.
  • ...and 4 more figures