QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture

Shvetank Prakash; Andrew Cheng; Jason Yik; Arya Tschand; Radhika Ghosal; Ikechukwu Uchendu; Jessica Quaye; Jeffrey Ma; Shreyas Grampurohit; Sofia Giannuzzi; Arnav Balyan; Fin Amin; Aadya Pipersenia; Yash Choudhary; Ankita Nayak; Amir Yazdanbakhsh; Vijay Janapa Reddi

QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture

Shvetank Prakash, Andrew Cheng, Jason Yik, Arya Tschand, Radhika Ghosal, Ikechukwu Uchendu, Jessica Quaye, Jeffrey Ma, Shreyas Grampurohit, Sofia Giannuzzi, Arnav Balyan, Fin Amin, Aadya Pipersenia, Yash Choudhary, Ankita Nayak, Amir Yazdanbakhsh, Vijay Janapa Reddi

TL;DR

QuArch introduces the first architecture-focused QA dataset, comprising $1{,}547$ expert-validated questions across 13 topics to evaluate domain knowledge in computer architecture. The study surveys SoTA language models, revealing a performance ceiling near $84\%$ and a $12\%$ gap for smaller open-source counterparts, with memory systems and interconnects as persistent weaknesses. It demonstrates QuArch's value as both a benchmark and a training resource, showing fine-tuning gains of $5.4\%$–$8.3\%$ for small models on architecture tasks. The work underscores the practical potential of AI-assisted architecture research while outlining directions toward deeper reasoning and system-level capabilities, with the dataset and leaderboard publicly available at the provided URL.

Abstract

We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models' understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles in memory systems, interconnection networks, and benchmarking. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and leaderboard are at https://harvard-edge.github.io/QuArch/.

QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture

TL;DR

QuArch introduces the first architecture-focused QA dataset, comprising

expert-validated questions across 13 topics to evaluate domain knowledge in computer architecture. The study surveys SoTA language models, revealing a performance ceiling near

and a

gap for smaller open-source counterparts, with memory systems and interconnects as persistent weaknesses. It demonstrates QuArch's value as both a benchmark and a training resource, showing fine-tuning gains of

–

for small models on architecture tasks. The work underscores the practical potential of AI-assisted architecture research while outlining directions toward deeper reasoning and system-level capabilities, with the dataset and leaderboard publicly available at the provided URL.

Abstract

Paper Structure (14 sections, 5 figures, 2 tables)

This paper contains 14 sections, 5 figures, 2 tables.

Introduction
Related Work
QuArch
Dataset Curation: The Archipedia Corpus
Dataset Generation: QA Creation
Dataset Coverage: Architecture Topics
Results
Experimental Setup
Understanding of Architecture Concepts
Analysis by Architecture Topics
QuArch as an Architecture Benchmark
QuArch as an Architecture Training Dataset
Conclusion
Acknowledgements

Figures (5)

Figure 1: Example QAs from QuArch for various topics curated from different sources. The bolded answer is correct.
Figure 2: QuArch dataset construction pipeline.
Figure 3: Distribution of computer architecture topics in QuArch.
Figure 4: QuArch accuracy ranges from 39%-84%. Larger models ($>$70B parameters) attain a max of 84%. Small model ($<$10B parameters) performance drops 12% in comparison.
Figure 5: Performance breakdown across topics. Color intensity indicates topic's relative (intra-model) performance, with darker green showing stronger understanding and darker red showing weaker areas. Memory systems and interconnects are more challenging for current LMs. Benchmarking also shows low performance but only accounts for 1% of the QAs.

QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture

TL;DR

Abstract

QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture

Authors

TL;DR

Abstract

Table of Contents

Figures (5)