Table of Contents
Fetching ...

Learning representations of learning representations

Rita González-Márquez, Dmitry Kobak

TL;DR

The paper introduces the ICLR dataset, a public corpus of 24,445 abstracts from 2017–2024 with metadata and keyword-based labels, designed for metascience of ML and as an NLP benchmark to beat a TF-IDF baseline in $k$NN accuracy. It systematically evaluates embedding methods (TF-IDF, SVD, multiple sentence-transformer models, and commercial embeddings) and finds TF-IDF often outperforms specialized abstract models, with SBERT and OpenAI embeddings offering only modest gains. A 2D $t$-SNE visualization built on SBERT representations reveals topic shifts over time, including diffusion-model popularity and NLP dominance by large language models, while showing no systematic gender bias across subfields. The dataset and accompanying code enable both field-wide analyses and practical benchmarking, and the results challenge assumptions about the superiority of modern sentence-transformers for this specific scientific-text task.

Abstract

The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the ICLR dataset consisting of abstracts of all 24 thousand ICLR submissions from 2017-2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transformer models in terms of $k$NN classification accuracy, and the top performing language models barely outperform TF-IDF. We see this as a challenge for the NLP community. Furthermore, we use the ICLR dataset to study how the field of machine learning has changed over the last seven years, finding some improvement in gender balance. Using a 2D embedding of the abstracts' texts, we describe a shift in research topics from 2017 to 2024 and identify hedgehogs and foxes among the authors with the highest number of ICLR submissions.

Learning representations of learning representations

TL;DR

The paper introduces the ICLR dataset, a public corpus of 24,445 abstracts from 2017–2024 with metadata and keyword-based labels, designed for metascience of ML and as an NLP benchmark to beat a TF-IDF baseline in NN accuracy. It systematically evaluates embedding methods (TF-IDF, SVD, multiple sentence-transformer models, and commercial embeddings) and finds TF-IDF often outperforms specialized abstract models, with SBERT and OpenAI embeddings offering only modest gains. A 2D -SNE visualization built on SBERT representations reveals topic shifts over time, including diffusion-model popularity and NLP dominance by large language models, while showing no systematic gender bias across subfields. The dataset and accompanying code enable both field-wide analyses and practical benchmarking, and the results challenge assumptions about the superiority of modern sentence-transformers for this specific scientific-text task.

Abstract

The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the ICLR dataset consisting of abstracts of all 24 thousand ICLR submissions from 2017-2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transformer models in terms of NN classification accuracy, and the top performing language models barely outperform TF-IDF. We see this as a challenge for the NLP community. Furthermore, we use the ICLR dataset to study how the field of machine learning has changed over the last seven years, finding some improvement in gender balance. Using a 2D embedding of the abstracts' texts, we describe a shift in research topics from 2017 to 2024 and identify hedgehogs and foxes among the authors with the highest number of ICLR submissions.
Paper Structure (7 sections, 8 figures, 2 tables)

This paper contains 7 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Summary statistics of the ICLR dataset (ICLR24v2).
  • Figure 2: $t$-SNE embedding of the SBERT representation of ICLR abstracts (2017--2024). Left: coloured by year; right: coloured by topic.
  • Figure 3: ICLR papers containing the words understanding (366), rethinking (155), and a question mark (550) in the title.
  • Figure 4: Top 18 authors by the total number of ICLR submissions over 2017--2024. Each panel shows the total number of submissions and the acceptance rate.
  • Figure S1: Acceptance decisions and average scores. Left: accepted papers are shown on top. Right: papers are shown in randomized order.
  • ...and 3 more figures