Table of Contents
Fetching ...

QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture

Shvetank Prakash, Andrew Cheng, Jason Yik, Arya Tschand, Radhika Ghosal, Ikechukwu Uchendu, Jessica Quaye, Jeffrey Ma, Shreyas Grampurohit, Sofia Giannuzzi, Arnav Balyan, Fin Amin, Aadya Pipersenia, Yash Choudhary, Ankita Nayak, Amir Yazdanbakhsh, Vijay Janapa Reddi

TL;DR

QuArch introduces the first architecture-focused QA dataset, comprising $1{,}547$ expert-validated questions across 13 topics to evaluate domain knowledge in computer architecture. The study surveys SoTA language models, revealing a performance ceiling near $84\%$ and a $12\%$ gap for smaller open-source counterparts, with memory systems and interconnects as persistent weaknesses. It demonstrates QuArch's value as both a benchmark and a training resource, showing fine-tuning gains of $5.4\%$–$8.3\%$ for small models on architecture tasks. The work underscores the practical potential of AI-assisted architecture research while outlining directions toward deeper reasoning and system-level capabilities, with the dataset and leaderboard publicly available at the provided URL.

Abstract

We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models' understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles in memory systems, interconnection networks, and benchmarking. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and leaderboard are at https://harvard-edge.github.io/QuArch/.

QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture

TL;DR

QuArch introduces the first architecture-focused QA dataset, comprising expert-validated questions across 13 topics to evaluate domain knowledge in computer architecture. The study surveys SoTA language models, revealing a performance ceiling near and a gap for smaller open-source counterparts, with memory systems and interconnects as persistent weaknesses. It demonstrates QuArch's value as both a benchmark and a training resource, showing fine-tuning gains of for small models on architecture tasks. The work underscores the practical potential of AI-assisted architecture research while outlining directions toward deeper reasoning and system-level capabilities, with the dataset and leaderboard publicly available at the provided URL.

Abstract

We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models' understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles in memory systems, interconnection networks, and benchmarking. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and leaderboard are at https://harvard-edge.github.io/QuArch/.
Paper Structure (14 sections, 5 figures, 2 tables)

This paper contains 14 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example QAs from QuArch for various topics curated from different sources. The bolded answer is correct.
  • Figure 2: QuArch dataset construction pipeline.
  • Figure 3: Distribution of computer architecture topics in QuArch.
  • Figure 4: QuArch accuracy ranges from 39%-84%. Larger models ($>$70B parameters) attain a max of 84%. Small model ($<$10B parameters) performance drops 12% in comparison.
  • Figure 5: Performance breakdown across topics. Color intensity indicates topic's relative (intra-model) performance, with darker green showing stronger understanding and darker red showing weaker areas. Memory systems and interconnects are more challenging for current LMs. Benchmarking also shows low performance but only accounts for 1% of the QAs.