Table of Contents
Fetching ...

Multiplier Design Addressing Area-Delay Trade-offs by using DSP and Logic resources on FPGAs

Andreas Böttcher, Martin Kumm

TL;DR

The paper tackles efficient FPGA multipliers under DSP-LUT constraints by integrating small Booth-Array sub-multipliers into a multiplier-tiling framework and optimizing tile selection and compressor-tree design via ILP. It extends FloPoCo to incorporate Booth tiles and shows that restricting Booth depth to four levels achieves a favorable balance between area and delay, especially when coupled with DSP-based tiling. The approach yields new Pareto-optimal points and often reduces LUT usage with competitive critical-path delay compared to state-of-the-art logic- and Booth-based designs. Open-source tooling and ILP-based global optimization enable practical, scalable design of medium-to-large multipliers on modern FPGAs.

Abstract

The major challenge when designing multipliers for FPGAs is to address several trade-offs: On the one hand at the performance level and on the other hand at the resource level utilizing DSP blocks or look-up tables (LUTs). With DSPs being a relatively limited resource, the problem of under- or over-utilization of DSPs has previously been addressed by the concept of multiplier tiling, by assembling multipliers from DSPs and small supplemental LUT multipliers. But there had always been an efficiency gap between tiling-based multipliers and radix-4 Booth-Arrays. While the monolithic Booth-Array was shown to be considerably more efficient in terms of LUT-resources on many modern FPGA-architectures, it typically possess a significantly higher critically path delay (or latency when pipelined) compared to multipliers designed by tiling. This work proposes and analyzes the use of smaller Booth-Arrays as sub-multipliers that are integrated into existing tiling-based methods, such that better trade-off points between area and delay can be reached while utilizing a user-specified number of DSP blocks. It is shown by synthesis experiments, that the critical path delay compared to large Booth-Arrays can be reduced, while achieving significant reductions in LUT-resources compared to previous tiling.

Multiplier Design Addressing Area-Delay Trade-offs by using DSP and Logic resources on FPGAs

TL;DR

The paper tackles efficient FPGA multipliers under DSP-LUT constraints by integrating small Booth-Array sub-multipliers into a multiplier-tiling framework and optimizing tile selection and compressor-tree design via ILP. It extends FloPoCo to incorporate Booth tiles and shows that restricting Booth depth to four levels achieves a favorable balance between area and delay, especially when coupled with DSP-based tiling. The approach yields new Pareto-optimal points and often reduces LUT usage with competitive critical-path delay compared to state-of-the-art logic- and Booth-based designs. Open-source tooling and ILP-based global optimization enable practical, scalable design of medium-to-large multipliers on modern FPGAs.

Abstract

The major challenge when designing multipliers for FPGAs is to address several trade-offs: On the one hand at the performance level and on the other hand at the resource level utilizing DSP blocks or look-up tables (LUTs). With DSPs being a relatively limited resource, the problem of under- or over-utilization of DSPs has previously been addressed by the concept of multiplier tiling, by assembling multipliers from DSPs and small supplemental LUT multipliers. But there had always been an efficiency gap between tiling-based multipliers and radix-4 Booth-Arrays. While the monolithic Booth-Array was shown to be considerably more efficient in terms of LUT-resources on many modern FPGA-architectures, it typically possess a significantly higher critically path delay (or latency when pipelined) compared to multipliers designed by tiling. This work proposes and analyzes the use of smaller Booth-Arrays as sub-multipliers that are integrated into existing tiling-based methods, such that better trade-off points between area and delay can be reached while utilizing a user-specified number of DSP blocks. It is shown by synthesis experiments, that the critical path delay compared to large Booth-Arrays can be reduced, while achieving significant reductions in LUT-resources compared to previous tiling.
Paper Structure (12 sections, 5 equations, 5 figures, 5 tables)

This paper contains 12 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Tiling (a) and compressor tree (b) of a $24\times 24$-multiplier, composed of four sub-multiplier-tiles.
  • Figure 2: Slice configuration for Booth multipliers
  • Figure 3: Overall Structure of a $8\times8$ Booth multiplier
  • Figure 4: Dependency between size and efficiency
  • Figure 5: critical path delay vs. complexity