In designing the FPGA programmable logic fabric, we aimed to meet mainstream performance requirements with minimal power and cost. Microsemi’s PolarFire FPGAs typically consume one-tenth the static power of competing SRAM FPGAs, and half the total power. Some attributes of the PolarFire’s FPGAs (such as the non-volatile configuration memory) directly reduce power, while power reduction in an indirect effect of reducing die area in other cases.
Selection of LUT-4 for the Logic Element
6-input LUTs can provide some speed benefits, but 4-input LUTs are the better choice for a power and cost-optimized FPGA like Microsemi’s in a modern process technology. It has been well-established that 4-input LUTs can make more efficient use of a die area than 6-input LUTs. A given user design can be implemented with less silicon area using a 4-LUT architecture than using a 6-LUT architecture. One contributing factor is that a 6-input LUT requires 4x more configuration memory bits (64 versus 16) but can accommodate only about 1.6x as much logic as a 4-input LUT. This traditional observation applies even more strongly to advanced fabrication technologies because SRAM configuration memory has not scaled as fast as ordinary logic, due to the need to mitigate the risk of SEUs. The PolarFire SONOS configuration cell is immune to SEUs.
Consider a PolarFire FPGA cluster of twelve 4-input LUTs versus a cluster of eight 6-input LUTs. The total logic capability of the cluster (that is, the amount of user logic that the cluster can accommodate) is similar in each case. The larger fan-in of the 6-input LUT means fewer levels of logic may be traversed by the critical path within each cluster, potentially reducing the total contribution of intra-cluster delay to the critical path. However, from the outside, the two clusters appear similar; they have a similar typical number of incoming and outgoing signals, and so the total length and delay contributed by the inter-cluster wiring is similar in both cases.
The following illustrations show clusters with different LUTs, but similar numbers of incoming and outgoing signals:
As process technology has progressed from 65nm to 28nm and beyond, the delay of wiring has come to dominate logic delay, due to poor scaling of metal wire and via resistance. To some extent, this can be mitigated by widening the wires, but that adds to the die area and cost. So, with each succeeding generation of process technology, inter-cluster delay becomes a significant contributor to the critical path, and the speed advantage of 6-input LUTs diminishes. The PolarFire FPGA family provides rapid direct connections between nearby LUTs. This can reduce intra-cluster delay, especially in conjunction with advanced synthesis and placement algorithms. Certain logic functions (such as MUX trees) greatly benefit from the direct connections.
Clock Dynamic Power
Clocks can be a significant contributor to dynamic power in applications targeted by PolarFire FPGAs, often consuming nearly as much power as the rest of the routing and logic. For this reason, clocks in PolarFire FPGAs were designed to conserve power. We allocated more area to clock wires to space them further apart, significantly reducing their capacitance and dynamic power. Flip-flops were designed to minimize clock power. Clock gating is provided at two levels in the clock tree, as well as for each individual flip-flop, to avoid wasting power on unused branches. As a result, clock power in PolarFire FPGAs is less than half that of competing 28nm FPGAs, averaged over a suite of designs (as shown in the following illustration):
Choice of Operating Voltage
The PolarFire FPGA family’s power-performance tradeoff has been carefully optimized for a 1.0 V core logic supply, somewhat less than the 1.05 V nominal voltage for the UMC 28nm process on which it is manufactured. Customers desiring extra speed still have the option to use the full 1.05 V supply.
The PolarFire FPGA provides a math block supporting 18-bit multiply-accumulate operations (as seen in Figure 8). The following are improvements from the previous-generation IGLOO™2 FPGA family.
• Provision of a pre-adder with a full 19-bit result. This eliminates the need for fabric adders when implementing symmetric FIR filters, saving power.
• Provision of an input value cascade chain. This reduces the need for fabric registers when implementing systolic FIR filters, again saving power.
• Accumulator widened to 48 bits.
In addition to the 18-bit × 18-bit multiplication mode, the math block supports reduced precision 9-bit operations. The PolarFire math block supports two independent 9 × 9 multiplies with no requirement for a common factor. Unlike another competing math blocks, which can exchange two 18 × 18 multipliers for three 9 × 9 multipliers, PolarFire FPGA can exchange one 18 × 18 multiplier for two 9 × 9 multipliers, a 33% improvement.
The PolarFire math block also supports a unique 9 × 9 dot-product mode (as seen in Figure 9). This is ideal for use in image processing and convolutional neural networks (CNNs). Compared to independent 9 × 9 multipliers, the dot-product operation reduces power in the following ways.
• No need for a separate fabric adder to sum the two products.
• The pre-adder is fully supported, allowing efficient implementation of symmetric 9-bit FIR filters or 2-D convolution.
• All four factors are independent—designers of CNNs don’t rely on complex weight-sharing or input-sharing schemes to maximize resource and power efficiency.
The illustration to the left shows a simplified diagram of a math block dot-product mode.
If you have any questions, or would like to share your thoughts, connect with me on LinkedIn.
Leave a Reply
You must be logged in to post a comment.