Scalable and Effective Arithmetic Tree Generation for Adder and Multiplier Designs
Yao Lai, Jinxin Liu, David Z. Pan, Ping Luo
TL;DR
This paper reframes the design of arithmetic modules as tree-generation problems (AddGame for adders and MultGame for multipliers) optimized by reinforcement learning. It introduces two specialized agents—MCTS for prefix-tree optimization and PPO for compressor-tree optimization—coupled in a co-design loop with a fast yet accurate synthesis flow. The approach yields Pareto-optimal 128-bit adders (e.g., $L=10$, size $244$) and substantial improvements over state-of-the-art RL methods in delay and area, with demonstrated transfer to 7nm technology. Practically, these results offer scalable, hardware-accurate designs that can be integrated into modern synthesis flows, potentially accelerating performance and reducing silicon area across a range of devices.
Abstract
Across a wide range of hardware scenarios, the computational efficiency and physical size of the arithmetic units significantly influence the speed and footprint of the overall hardware system. Nevertheless, the effectiveness of prior arithmetic design techniques proves inadequate, as it does not sufficiently optimize speed and area, resulting in a reduced processing rate and larger module size. To boost the arithmetic performance, in this work, we focus on the two most common and fundamental arithmetic modules: adders and multipliers. We cast the design tasks as single-player tree generation games, leveraging reinforcement learning techniques to optimize their arithmetic tree structures. Such a tree generation formulation allows us to efficiently navigate the vast search space and discover superior arithmetic designs that improve computational efficiency and hardware size within just a few hours. For adders, our approach discovers designs of 128-bit adders that achieve Pareto optimality in theoretical metrics. Compared with the state-of-the-art PrefixRL, our method decreases computational delay and hardware size by up to 26% and 30%, respectively. For multipliers, when compared to RL-MUL, our approach increases speed and reduces size by as much as 49% and 45%. Moreover, the inherent flexibility and scalability of our method enable us to deploy our designs into cutting-edge technologies, as we show that they can be seamlessly integrated into 7nm technology. We believe our work will offer valuable insights into hardware design, further accelerating speed and reducing size through the refined search space and our tree generation methodologies. See our introduction video at https://bit.ly/ArithmeticTree. Codes are released at https://github.com/laiyao1/ArithmeticTree.
