Table of Contents
Fetching ...

Interleaving Text and Number Embeddings to Solve Mathemathics Problems

Marvin Alberts, Gianmarco Gabrieli, Irina Espejo Morales

TL;DR

This paper addresses key shortcomings, including the elimination of numerical artefacts and the ability to handle a wide range of magnitudes without clipping, and introduces a routing layer that differentiates between numerical and text embeddings.

Abstract

Integrating text and numbers effectively is a crucial step towards enhancing Large Language Models (LLMs) capabilities in assisting in scientific tasks. While most current approaches rely on discrete tokenization of numbers, for instance, conversion to scientific notation or base 10-decomposition, a recent approach proposed a continuous numerical encoding as an inductive bias. In this paper, we build upon this approach by introducing more expressive numerical embeddings. Our method addresses key shortcomings, including the elimination of numerical artefacts and the ability to handle a wide range of magnitudes without clipping. Our work presents two key contributions. First, we employ an MLP to assign distinct directions in the embedding space to different numbers. Our second contribution is the introduction of a routing layer that differentiates between numerical and text embeddings. We hypothesise that this combined approach enables the model to distinguish between text and number distributions while maintaining its capacity for arithmetic operations. Using only a 45 M parameter encoder-decoder architecture our method achieves a $R^2$=0.9988 over a wide range of magnitude ($10^{-3},10^{8}$). In addition, we empirically observe a reduction of the numerical artefacts and biases observed compared to the baselines.

Interleaving Text and Number Embeddings to Solve Mathemathics Problems

TL;DR

This paper addresses key shortcomings, including the elimination of numerical artefacts and the ability to handle a wide range of magnitudes without clipping, and introduces a routing layer that differentiates between numerical and text embeddings.

Abstract

Integrating text and numbers effectively is a crucial step towards enhancing Large Language Models (LLMs) capabilities in assisting in scientific tasks. While most current approaches rely on discrete tokenization of numbers, for instance, conversion to scientific notation or base 10-decomposition, a recent approach proposed a continuous numerical encoding as an inductive bias. In this paper, we build upon this approach by introducing more expressive numerical embeddings. Our method addresses key shortcomings, including the elimination of numerical artefacts and the ability to handle a wide range of magnitudes without clipping. Our work presents two key contributions. First, we employ an MLP to assign distinct directions in the embedding space to different numbers. Our second contribution is the introduction of a routing layer that differentiates between numerical and text embeddings. We hypothesise that this combined approach enables the model to distinguish between text and number distributions while maintaining its capacity for arithmetic operations. Using only a 45 M parameter encoder-decoder architecture our method achieves a =0.9988 over a wide range of magnitude (). In addition, we empirically observe a reduction of the numerical artefacts and biases observed compared to the baselines.

Paper Structure

This paper contains 7 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Diagram summarizing the MMD decoding method presented in this paper. (Left) High-level workflow of the encoder, routing layer classifying the modality, and the decoder. (Right) Zoom in on encoding an interleaved text and number sequence where, for text, the usual tokenization scheme is followed and for each number a new embedding vector is trained end-to-end.
  • Figure 2: Comparison of log-log prediction vs ground truth values for arithmetic computations in the test set for baselines BPE and World-level (bottom row from left to right) and our MMD method (top row). The colour of each point indicates the relative error between prediction and ground truth with darker being a lower error.
  • Figure 3: Schematic representation for the different baselines and our models indicating wether the routing layer is active or not when a number is in the input.