Table of Contents
Fetching ...

Asterisk*: Keep it Simple

Andrew Semenov

TL;DR

Asterisk, a compact GPT-based model for generating text embeddings using a minimalist architecture with two layers, two attention heads, and 256 embedding dimensions, is described.

Abstract

This paper describes Asterisk, a compact GPT-based model for generating text embeddings. The model uses a minimalist architecture with two layers, two attention heads, and 256 embedding dimensions. By applying knowledge distillation from larger pretrained models, we explore the trade-offs between model size and performance while minimizing computational and memory requirements. The model is primarily evaluated and optimized for classification tasks, with experimental results showing its moderate performance in zero-shot classification across various downstream applications. With additional configuration, the model performance can approach or even surpass that of larger architectures on specific classification tasks.

Asterisk*: Keep it Simple

TL;DR

Asterisk, a compact GPT-based model for generating text embeddings using a minimalist architecture with two layers, two attention heads, and 256 embedding dimensions, is described.

Abstract

This paper describes Asterisk, a compact GPT-based model for generating text embeddings. The model uses a minimalist architecture with two layers, two attention heads, and 256 embedding dimensions. By applying knowledge distillation from larger pretrained models, we explore the trade-offs between model size and performance while minimizing computational and memory requirements. The model is primarily evaluated and optimized for classification tasks, with experimental results showing its moderate performance in zero-shot classification across various downstream applications. With additional configuration, the model performance can approach or even surpass that of larger architectures on specific classification tasks.

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Asterisk* architecture
  • Figure 2: Asterisk* training process
  • Figure 3: During training, the model achieved cosine similarity with teacher embeddings of 0.65/1 and loss has dropped from 0.7080 to 0.2427
  • Figure 4: Asterisk* + FC classification setup