Small Logic-based Multipliers with Incomplete Sub-Multipliers for FPGAs
Andreas Böttcher, Martin Kumm
TL;DR
This work tackles the need for efficient small-width multipliers on FPGA for AI inference by replacing conventional rectangular sub-multipliers with incomplete, irregular LUT-based tiles within a multiplier tiling framework. It introduces a design space expansion through a systematic search (restricted to a $4\times4$ board) and leverages truth-table simplification to identify high-efficiency tiles, then uses an ILP-based optimization to jointly select tiles and compressor-tree structures. Empirical results on Kintex-7 show that incomplete tiles reduce LUT usage by up to $17.6\%$ (average $3.7\%$) across sizes up to $16\times16$, with notable gains in dense packing scenarios and competitive performance in CPD and latency. The approach delivers higher arithmetic density for AI workloads while maintaining practical critical-path and latency characteristics, and is broadly applicable to modern FPGA platforms.
Abstract
There is a recent trend in artificial intelligence (AI) inference towards lower precision data formats down to 8 bits and less. As multiplication is the most complex operation in typical inference tasks, there is a large demand for efficient small multipliers. The large DSP blocks have limitations implementing many small multipliers efficiently. Hence, this work proposes a solution for better logic-based multipliers that is especially beneficial for small multipliers. Our work is based on the multiplier tiling method in which a multiplier is designed out of several sub-multiplier tiles. The key observation we made is that these sub-multipliers do not necessarily have to perform a complete (rectangular) NxK multiplication and more efficient sub-multipliers are possible that are incomplete (non-rectangular). This proposal first seeks to identify efficient incomplete irregular sub-multipliers and then demonstrates improvements over state-of-the-art designs. It is shown that optimal solutions can be found using integer linear programming (ILP), which are evaluated in FPGA synthesis experiments.
