DSLUT: An Asymmetric LUT and its Automatic Design Flow Based on Practical Functions
Moucheng Yang, Kaixiang Zhu, Lingli Wang, Xuegong Zhou
TL;DR
This work tackles the high cost and redundancy of conventional LUTs in FPGAs by introducing DSLUT, an asymmetric, domain-specific LUT-like PLB that targets practical functions rather than all possible functions. An automatic design flow combines a practical-function library, NPN-based function categorization, heuristic initialization, and hyper-parameter optimization to generate bit assignments that map functions efficiently into a compact MUX-tree with fewer programmable bits. SAT-based boolean matching against a library of practical functions enables accurate technology mapping within an ABC-based flow, and full EDA validation shows DSLUT6 (26 SRAM bits) covers 780 of 3881 six-input functions with notable level reductions and favorable delay-area trade-offs compared to LUT5 and LUT6 baselines. The approach demonstrates meaningful performance gains for benchmark circuits while maintaining smaller area overhead, enabling higher functional density in domain-specific FPGA deployments.
Abstract
The conventional LUT is redundant since practical functions in real-world benchmarks only occupy a small proportion of all the functions. For example, there are only 3881 out of more than $10^{14}$ NPN classes of 6-input functions occurring in the mapped netlists of the VTR8 and Koios benchmarks. Therefore, we propose a novel LUT-like architecture, named DSLUT, with asymmetric inputs and programmable bits to efficiently implement the practical functions in domain-specific benchmarks instead of all the functions. The compact structure of the MUX Tree in the conventional LUT is preserved, while fewer programmable bits are connected to the MUX Tree according to the bit assignment generated by the proposed algorithm. A 6-input DSLUT with 26 SRAM bits is generated for evaluation, which is based on the practical functions of 39 circuits from the VTR8 and Koios benchmarks. After the synthesis flow of ABC, the post-synthesis results show that the proposed DSLUT6 architecture reduces the number of levels by 10.98% at a cost of 7.25% area overhead compared to LUT5 architecture, while LUT6 reduces 15.16% levels at a cost of 51.73% more PLB area. After the full VTR flow, the post-implementation results show that the proposed DSLUT6 can provide performance improvement by 4.59% over LUT5, close to 5.42% of LUT6 over LUT5, causing less area overhead (6.81% of DSLUT6 and 10.93% of LUT6).
