III. THE AGE OF IMPLEMENTATION
With the
advent of ICs, more and more focus shifted to design automation algorithms to
deal with them rather than boards. Traditional CMOS scaling allowed the sizes
of these designs to grow very rapidly. As design sizes grew, design tool
implementation became extremely important to keep up with the increasingly
larger designs and to keep design time under control. New imple-mentations and
data structures were pioneered and algorithms that scaled most efficiently
became the standard.
As design
sizes started to pick up, new layers of abstraction were invented. The invention
of standard cells allowed one to separate the detailed physical implementation
of the cells from the footprint image that is seen by the placement and routing
algorithms. Large-scale application of routing, placement, and later synthesis
algorithms took off with the introduction of the concept of standard cells.
The invention
of the standard cell can be compared to the invention of the printing press.
While manual book writing was known before, it was a labor-intensive process.
Significant automation in the development of printing was enabled by keeping
the height of the letters fixed and letting the width of the base of each of
the letters vary according to the letter’s size. Similarly, in standard cell
application-specific IC (ASIC) design, one uses standard cells of common height
but varying widths depending on the complexity of the single standard cell.
These libraries (Chapter 12) created significant levels of standardization and
enabled large degrees of automation. The invention of the first gate arrays
took the standardization to an even higher level.
This
standardization allowed the creation of the ASIC business model, which created
a huge market opportunity for automation tools and spawned a number of
innovations. Logic synthesis [5] was invented to bridge the gap from language
descriptions to these standard cell implementations.
In the
implementation era, a design flow could be pasted together from a sequence of
discrete steps (see Figure 1.1). RTL logic synthesis translated a Verilog or
VHDL description, perform-ing technology-independent optimizations of the
netlist and then mapping it into a netlist with technology gates, followed by
further technology-dependent gate-level netlist optimization.
FIGURE 1.1 Sequential design flow.
This was
followed by placement of the gates and routing to connect them together.
Finally, a tim-ing simulator was used to verify the timing using a limited
amount of extracted wiring parasitic capacitance data.
Digital
circuit sizes kept on increasing rapidly in this era, doubling in size every 2
years per Moore’s law [6]. Logic synthesis has allowed rapid creation of
netlists with millions of gates from high-level RTL designs. ICs have grown
from 40,000 gates in 1984 to 40,000,000 gates in 2000 to billion gate designs
in 2014.
New data
structures like Quadtrees [7] and R-trees [8] allowed very efficient searches
in the geometric space. Applications of Boolean Decision Diagrams [9] enabled
efficient Boolean reasoning on significantly larger logic partitions.
Much progress
was made in implementing partitioning algorithms. A more efficient version of
Kernighan and Lin’s partitioning algorithm was given by Fiduccia and Mattheyses
[10]. They used a specific algorithm for selecting vertices to move across the
cut that saved runtime and allowed for the handling of unbalanced partitions
and nonuniform edge weights. An implementation using spectral methods [11]
proved to be very effective for certain problems. Yang [12] demon-strated
results that outperformed the two aforementioned methods by applying a network
flow algorithm iteratively.
Optimizing
quadratic wire length became the holy grail in placement. Quadratic algorithms
took full advantage of this by deploying efficient quadratic optimization
algorithms, intermixed with various types of partitioning schemes [13].
Original
techniques in logic synthesis, such as kernel and cube factoring, were applied
to small partitions of the network at a time. More efficient algorithms like
global flow [14] and redundancy removal [15] based on test generation could be
applied to much larger designs. Complete coverage of all timing paths by timing
simulation became too impractical due to its exponential dependence on design
size, and static timing analysis [16] based on early work in [17] was invented.
With larger
designs came more routing layers, allowing over-the-cell routing and sharing of
intracell and intercell routing areas. Gridded routing abstractions matched the
standard cell tem-plates well and became the base for much improvement in
routing speed. Hierarchical routing abstractions such as global routing, switch
box, and area routing were pioneered as effective ways to decouple the routing
problem.
Algorithms
that are applicable to large-scale designs without partitioning must have order
of complexity less than O(n2) and preferably not more than O(n log n). These
complexities were met using the aforementioned advances in data structures and
algorithms, allowing design tools to handle large real problems. However, it
became increasingly difficult to find appropriate cost functions for such algorithms.
Accurate prediction of the physical effects earlier in the design flow became
more difficult.
Let us discuss
how the prediction of important design metrics evolved over time during the
implementation era. In the beginning of the implementation era, most of the
important design metrics such as area and delay were quite easy to predict.
(Performance, power, and area are the corresponding key metrics for today’s
designs.) The optimization algorithms in each of the discrete design steps were
guided by objective functions that relied on these predictions. As long as the
final values of these metrics could be predicted with good accuracy, the RTL
-to-GDSII flow could indeed be executed in a sequence of fairly independent
steps. However, the prediction of impor-tant design metrics was becoming
increasingly difficult. As we will see in the following sections, this led to
fundamental changes in the design closure flow. The simple sequencing of design
steps was not sufficient anymore.
Let us look at
one class of the prediction functions, estimates of circuit delay, in more
detail. In the early technologies, the delay along a path was dominated by the
delay of the gates. In addi-tion, the delay of most gates was very similar. As
a result, as long as one knew how many gates there were on the critical path,
one could reasonably predict the delay of the path by counting the number of
levels of gates on a path and multiplying that with a typical gate delay. The
delay of a circuit was therefore known as soon as logic synthesis had
determined the number of logic levels on each path. In fact, in early timing
optimization, multiple gate sizes were used to keep delays reasonably constant
across different loadings rather than to actually improve the delay of a gate.
Right after the mapping to technology-dependent standard cells, the area could
be reasonably
well predicted by adding up the cell
areas. Neither subsequent placement nor routing steps would change these two
quantities significantly. Power and noise were not of very much concern in
these times.
In newer libraries, the delays of gates
with different logic complexities started to vary significantly. Table 1.1
shows the relative delays of different types of gates. The logical effort
characteristic indicates how the delay of the gate increases with load, and the
intrinsic delay is the load-independent contribution of the gate delay [18].
The fourth column of the table shows that the fanout-of-4 (FO4) delay of a more
complex NOR4 logic gate can be three times the delay of a simple inverter. (The
FO4 delay is the delay of an inverter driving a load capacitance that has four
times the inverter’s input pin capacitance. An FO4 delay is a very useful
metric to measure gate delay as it is mostly independent of process technology
and operating conditions. Static gates vary only 20% in FO4 delay over a
variety of such conditions [19].)
Simple addition of logic levels is
therefore insufficient to estimate path delay. Predicting the delay of a design
with reasonable accuracy requires knowing what gates the logic is actually
mapped to. It became necessary to include a static timing analysis engine
(Chapter 6) in the synthesis system to calculate these delays. The combination
of timing and synthesis was the first step on the way to the era of
integration. This trend started gradually in the 1980s; but by the beginning of
the 1990s, integrated static timing analysis tools were essential to predict
delays accurately. Once a netlist was mapped to a particular technology and the
gate loads could be approximated, a pretty accurate prediction of the delay
could be made by the timing analyzer.
At that time, approximating the gate
load was relatively easy. The load was dominated by the input capacitances of
the gates that were driven. The fact that the capacitance of the wire was
estimated by an inaccurate wire load model was hardly an issue. Therefore, as
long as the netlist was not modified in the subsequent steps, the delay
prediction was quite reliable.
Toward the mid-1990s, these predictions
based on gate delays started losing accuracy. Gate delays became increasingly
dependent on the load capacitance driven as well as on the rise and fall rates
of the input signals (input slew) to the gates. At the same time, the fraction
of loads due to wire capacitance started to increase. Knowledge of the physical
design became essential to reasonably predict the delay of a path. Initially,
it was mostly just the placement of the cells that was needed. The placement
affected the delay, but the wire route had much less impact, since any route
close to the minimum length would have similar load.
In newer technologies, more and more of
the delay started to shift toward the interconnect. Both gate and wire (RC)
delay really began to matter. Figure 1.2 shows how the gate delay and
interconnect delay compare over a series of technology generations. With a
Steiner tree approxi-mation of the global routing, the lengths of the nets
could be reasonably predicted. Using these lengths, delays could be calculated
for the longer nets. The loads from the short nets were not very significant,
and a guesstimate of these was still appropriate. Rapidly, it became clear that
it was very important to buffer the long nets really well. In Figure 1.2, we
see that around the 130 nm node, the difference between a net with repeaters
inserted at the right places and an unbuffered net starts to have a significant
impact on the delay. In automated place and route, buffering of long wires
became an integral part of physical design. Today’s standard approach for
repeater (buffer or inverter) insertion on long wires is van Ginneken’s dynamic
program-ming buffering algorithm [20] and derivatives thereof. Van Ginneken’s
buffering algorithm has
FIGURE 1.2 Gate and interconnect delay.
runtime complexity of O(n2), where n is
the number of possible buffer positions, but this has been improved to
near-linear [21].
Recently, a wire’s physical environment
has become more significant. Elmore RC wire delay models are no longer
accurate, and distributed RC network models must be used, accounting for
resistive shielding [22] and cross coupling. The cross-coupling capacitance
between wires has increased as the ratio between wire spacing and wire height
decreases. Furthermore, optical proximity effects when exposing the wafer vary
depending on the distance between neighboring wires and the resulting impact
when etching the wires also affects their resistance. Therefore, the actual
routes for the wires from detailed routing are now very significant in predicting
the delays. Moreover, global routing estimates of routing congestion can differ
significantly from the actual routes taken, and global routing may mispredict
where routing violations occur due to severe detailed routing congestion [23].
Netlist optimizations, traditionally
done in logic synthesis, became an important part of place and route. Placement
and routing systems that were designed to deal with static (not changing)
netlists had to be reinvented.
Until recently, postroute optimizations
were limited to minor changes that would cause minimal perturbation to avoid
requiring a full reroute. These included footprint compatible cell swaps,
particularly for critical path delay fixing by swapping cells to low threshold
voltage, and leakage power minimization by swapping to high threshold voltage
for paths with timing slack; manual reroutes to overcome severe congestion;
manual engineering change orders (ECOs); and the use of metal programmable
spare cells for fixes very late in the design cycle, avoiding the need to redo
base layers. However, the gap between global routing estimates and detailed
routing extracted capacitance and resistance have forced additional
conservatism before detailed rout-ing, for example, avoidance of small drive
strength cells that will be badly affected by additional wire capacitance,
leaving significant optimization opportunities in postroute.
Comments