Timing Optimization Tutorial

If you only studied digital design in college, there’s a good chance you don’t have much experience with timing analysis, optimization, and closure. Most digital-design classes, especially at the undergraduate level, primarily focus on creating correct circuits without too much focus on achieving specific clock frequencies. Students then graduate and enter industry where they discover many new challenges. Some of those challenges are timing analysis, optimization, and closure.

To better prepare students for industry, I previously developed the following timing-optimization tutorial, which was funded by Intel via a MindShare grant:

https://github.com/ARC-Lab-UF/intel-training-modules/tree/master/timing

The tutorial is targeted towards FPGA timing optimization, but much of the material also applies to ASICs. All examples were created for use with the Intel DevCloud, which provides access to Quartus synthesis tools, making it possible to test all the examples without installing any tools on your local machine. The tutorial includes video lectures, slides, code examples, and exercises.

What is timing analysis, optimization, and closure?

Real-word hardware generally has a variety of timing constraints that must be met during synthesis, placement, and routing. In the simplest case, the designer must specify a clock constraint that tells the synthesis tool the desired clock frequency.

During the synthesis process, each step is geared towards optimizing the design to fulfill the designated timing constraints. When the tools successfully compile a design that achieves all specified timing constraints, it is commonly expressed as the design having “closed timing” or achieved “timing closure.” Alternatively, it is also customary to state that a design “meets timing.”

If synthesis fails to close timing, it is the designer’s responsibility to perform timing optimization until achieving timing closure. The initial phase of this optimization involves identifying the segments of the design contributing to any violated constraints. To aid the designer in this endeavor, synthesis flows incorporate static timing analysis (STA), which provides the designer with detailed reports on the propagation delays of each path in the design. The designer then uses this information to determine timing bottlenecks that can be improved.

How to perform timing optimization?

Before learning how to optimize timing, it is critically important to understand how to interpret the reports from your tool’s timing analyzer. The tutorial includes a set of slides and a corresponding video explaining the background information necessary to understand these reports. For convenience, the video is also included here:

Once a bottleneck has been identified in the timing reports, the next crucial step involves optimizing the circuit to eliminate that bottleneck, a process commonly referred to as timing optimization. Timing optimization is a complex topic whose scope is far greater than a single post or video. In many cases, you will have to perform creative, application-specific optimizations. Fortunately, there are many common optimization strategies that can be applied generally. Those techniques are summarized here:

How is timing optimization different for FPGAs and ASICs?

Fortunately, the timing analysis background is identical between FPGAs and ASICs. However, the optimization strategies can be significantly different.

The biggest difference between ASIC and FPGA optimization is that FPGAs use lookup tables (LUTs) to implement all combinational logic. By contrast, ASICs implement combinational logic (actually, most logic) in “cells” that are provided by a technology/cell library. To create an ASIC, the designer chooses a specific semiconductor fabrication plant (aka “foundry” or “fab”), which provides a specific process technology (e.g., TSMC 7nm).

Each foundry provides a cell library for the corresponding process technology. When you perform synthesis for an ASIC using one of these cell libraries, it is generally referred to as “standard cell” design. These cell libraries are usually very large, with numerous optimized transistor-level implementations for common types of logic (e.g., gates, muxes, registers, latches, etc.). ASICs also provide the flexibility to create your own custom, transistor-level cells (often referred to as custom design). If a series of standard cells creates a propagation delay that is too long, you can often replace it with a custom cell that is faster, smaller, etc. While you can optimize a design implemented on an FPGA, you can’t optimize or modify the LUTs themselves.

When comparing timing analysis reports between FPGAs and ASICs, you’ll see similar information about the propagation delays of paths. However, the logic components along those paths will look considerably different. For FPGAs, you’ll primarily see LUTs, DSPs, and RAMs. For ASICs, you could see numerous different types of cells.

Another distinction with ASICs is that implementing sequential logic can be more costly compared to FPGAs. FPGA architectures incorporate flip-flops throughout the entire device, making register resources readily available. If a register is needed in the design, the flip-flop resources are already embedded, and if unused, they incur no additional cost. Conversely, ASICs allocate flip-flop resources based on the specific requirements of the design, rather than offering them freely throughout. While not typically a significant concern, the indiscriminate addition of registers in an ASIC design could lead to unnecessary area inflation.

Unlike FPGAs, ASICs frequently leverage latches as a resource-efficient alternative to flip-flops. In the FPGA realm, employing latches poses challenges and is generally considered an unsafe practice. Unlike ASICs, where designers have more control over addressing the challenges associated with latches, FPGAs present difficulties in achieving safe latch implementation due to limitations in the basic primitives available.

ASIC timing optimization also has the added flexibility of cell sizing. Basically, if a cell is too slow, it can be resized to improve performance at the cost of area. While some FPGAs have higher-performance resources in some locations, they lack the ability to resize any of the existing resources.