1.4 THE AGE OF INTEGRATION
This decrease in predictability continued and firmly
planted us in the age of integration. The following are some of the
characteristics of this era:
·
The
impact of later design steps is increasing. Prediction is difficult.
·
Larger
designs allow less manual intervention.
·
New
cost functions are becoming more important. Design decisions interact.
·
Aggressive
designs allow less guardbanding.
The iterations between sequential design steps such
as repeater insertion, gate sizing, and placement steps not only became
cumbersome and slow but often did not even converge. EDA researchers and
developers have explored several possible solutions to this convergence
problem:
·
Insert
controls for later design steps into the design source. Fix problems at the
end.
·
Improve
prediction.
·
Concurrently
design in different domains.
The insertion of controls proved to be very
difficult. As illustrated in Figure 1.3, the path through which source
modifications influence the final design result can be very indirect, and it
can be very hard to understand the effect of particular controls with respect
to a specific design and tools methodology. Controls inserted early in the
design flow might have a very different effect than desired or anticipated on
the final result. A late fix in the design flow requires an enor-mous increase
in manual design effort. Improving predictions has proven to be very difficult.
Gain-based delay models traded off area predictability for significant delay
predictability and gave some temporary relief, but in general it has been
extremely hard to improve predictions. The main lever seems to be concurrent
design by integrating the synthesis, placement, and routing domains and
coupling them with the appropriate design analyzers.
After timing/synthesis integration, placement-driven
(physical) synthesis was the next major step on the integration path. Placement
algorithms were added to the integrated static timing analysis and logic
synthesis environments. Well-integrated physical synthesis systems became
essential to tackle the design closure problems of the increasingly larger
chips.
This integration trend is continuing. Gate-to-gate
delay depends on the wire length (unknown during synthesis), the layer of the
wire (determined during routing), the configuration of the neighboring wires
(e.g., distance—near/far—which is unknown before detailed routing), and the
signal arrival times and slope of signals on the neighboring wires. Therefore,
in the latest technologies, we see that most of the routing needs to be
completed to have a good handle on the timing of a design. Local congestion
issues might force a significant number of routing detours. This needs to be
accounted for and requires a significantly larger number of nets to be routed
earlier for design flow convergence. Coupling between nets affects both noise
and delay in larger portions of the design. Therefore, knowledge of the
neighbors of a particular net is essential to carry out the analysis to the
required level of detail and requires a significant portion of the local
routing to be completed. In addition, power has become a very important design
metric, and noise issues are starting to significantly affect delays and even
make designs function incor-rectly. In addition to integrated static timing
analysis, power and noise analyses need to be included as well.
For example, white space postroute optimization to
fix timing violations after routing is now in common use. Added cells and
resized cells are placed at a legal location in available white space to avoid
perturbing placement of the other cells and to minimize any changes to detailed
routing
FIGURE 1.3 Controlled design flow.
between the other cells but allowing significant rerouting of nets connected to the modified cells. Larger cells, particularly those of multiple standard cell row heights, may still not be permitted to move in postroute optimization to limit the perturbation.
Consider the analysis of upsizing a cell to reduce short circuit power, which is determined by input slews and load capacitances, in postroute white space dynamic power optimization. This requires incremental evaluation of a new cell location with corresponding changes to detailed routing. The setup and hold timing impact is checked across multiple corners and modes with signal integrity cross talk analysis. Input slews to the cell are determined account-ing for the impact on fanin load capacitance and the cell’s output slew that affects its fanouts. The changed slews and extracted wire loads impact dynamic power consumption, and the tool must also ensure no degradation in total power, which includes both dynamic and leakage power in active mode. Such automated postroute optimizations can reduce cell area by up to 5%, helping yield and reducing fabrication cost; postroute dynamic power reductions of up to 12% have been achieved with Mentor Graphics’ Olympus-SoC™ place-and-route tool [24], reducing total power consumption and extending battery life. Significant setup and hold timing violations can also be fixed in an automated fashion. These are remarkable improvements given how late in the design flow they are achieved, providing significant value to design teams.
These analysis and optimization algorithms need to work in an incremental fashion, because runtime constraints prevent us from recalculating all the analysis results when frequent design changes are made by the other algorithms. In the age of integration, not only are the individual algorithms important, but the way they are integrated to reach closure on the objectives of the design has become the differentiating factor. A fine -grained interaction between these algorithms, guided by fast incremental timing, power, and area calculators, has become essential.
In the era of integration, most progress has been made in the way the algorithms cooperate with each other. Most principles of the original synthesis, placement, and routing algorithms and their efficient implementations are still applicable, but their use in EDA tools has changed significantly. In other cases, the required incrementality has led to interesting new algorithms— for example, trialing different local optimizations with placement and global routing to pick the best viable solution that speeds up a timing critical path [25]. While focusing on the integration, we must retain our ability to focus on advanced problems in individual tool domains in order to address new problems posed by technology and to continue to advance the capabilities of design algorithms.
To achieve this, we have seen a shift to the development of EDA tools guided by four interrelated principles:
1. Tools must be integrated.
2. Tools must be modular.
3. Tools must operate incrementally.
4. Tools must sparsely access only the data that they need to reduce memory usage and runtime.
Let us look at each of these in more detail in the following sections.
1.4.1 TOOL INTEGRATION
Tool integration allows them to directly communicate with each other. A superficial form of integration can be provided by initiating, from a common user interface, the execution of traditional point tools that read input from files and save output to files. A tighter form of inte-gration allows tools to communicate while concurrently executing rather than only through files. This tight integration is generally accomplished by building the tools on a common database (see Chapter 15) or through a standardized message passing protocol.
Tool integration enables the reuse of functions in different domains because the over-head of repeated file reading and writing is eliminated. This helps to reduce development resource requirements and improve consistency between applications. Although tool
integration enables incremental tool operation, it does not require it. For example, one could integrate placement and routing programs and still have the placement run to completion before starting routing. Careful design of an application can make it easier to integrate with other applications on a common database model, even if it was originally written as a stand-alone application.
Achieving tool integration requires an agreed-upon set of semantic elements in the design representation in terms of which the tools can communicate. These elements generally con-sist of cells, pins, and nets and their connectivity, cell placement locations, and wire routes. Cells include standard cells, macros, and hierarchical design blocks. Individual applications will augment this common data model with domain-specific information. For example, a static timing analysis tool will typically include delay and test edge data structures. In order for an integrated tool to accept queries in terms of the common data model elements, it must be able to efficiently find the components of the domain-specific data associated with these elements. Although this can be accomplished by name lookup, when integrated tools operate within a common memory space, it is more efficient to use direct pointers. This, in turn, requires that the common data model provide methods for applications to attach private pointers to the common model elements and callbacks to keep them consistent when there are updates to the elements (see Section 1.4.3.3).
1.4.2 MODULARITY
Modular tools are developed in small, independent pieces. This gives several benefits. It simplifies incremental development because new algorithms can more easily be substituted for old ones if the original implementation was modular. It facilitates reuse. Smaller modular utilities are easier to reuse, since they have fewer side effects. It simplifies code development and maintenance, making problems easier to isolate. Modules are easier to test independently. Tool modularity should be made visible to and usable by application engineers and sophisticated users, allowing them to integrate modular utilities through an extension language.
In the past, some projects have failed largely due to the lack of modularity [26]. To collect all behavior associated with the common data model in one place, they also concentrated control of what should be application-specific data. This made the data model too large and complicated and inhibited the tuning and reorganization of data structures by individual applications.
1.4.3 INCREMENTAL TOOLS
Tools that operate incrementally update design information, or the design itself, without revisiting or reprocessing the entire design. This enables fine-grained interaction between integrated tools. For example, incremental processing capability in static timing analysis helps logic synthesis to change a design and evaluate the effect of that change on timing, without requiring a complete timing analysis to be performed.
Incremental processing can reduce the time required to cycle between analysis and optimization tools. As a result, it can improve the frequency of tool interaction, allowing a better understanding of the consequences of each optimization decision.
The ordering of actions between a set of incremental applications is important. If a tool like synthesis invokes another incremental tool like timing analysis, it needs immediate access to the effects of its actions as reported by the invoked tool. Therefore, the incremental invocation must behave as if every incremental update occurs immediately after the event that precipitates it.
An example of incremental operation is fast what-if local timing analysis for a cell resize that can be quickly rolled back to the original design state and timing state if the change is detrimental. The local region of timing analysis may be limited to the cell, the cell’s fanouts because of out-put slew impacting their delay, the cell’s fanins as their load is affected, and the fanins’ fanouts, as fanin slews changed with the change in load. Timing changes are not allowed to propagate beyond that region until the resize is committed or until global costing analysis is performed. Fully accurate analysis would require propagating arrival times to the timing endpoints and then
propagating back required times, which can be very runtime expensive. Even in a full incremental timing analysis update, only those affected paths are traversed and updated, and timing is not propagated further where the change is determined to have no impact, for example, if it remains a subcritical side path.
There are four basic characteristics desirable in an incremental tool:
Consider the analysis of upsizing a cell to reduce short circuit power, which is determined by input slews and load capacitances, in postroute white space dynamic power optimization. This requires incremental evaluation of a new cell location with corresponding changes to detailed routing. The setup and hold timing impact is checked across multiple corners and modes with signal integrity cross talk analysis. Input slews to the cell are determined account-ing for the impact on fanin load capacitance and the cell’s output slew that affects its fanouts. The changed slews and extracted wire loads impact dynamic power consumption, and the tool must also ensure no degradation in total power, which includes both dynamic and leakage power in active mode. Such automated postroute optimizations can reduce cell area by up to 5%, helping yield and reducing fabrication cost; postroute dynamic power reductions of up to 12% have been achieved with Mentor Graphics’ Olympus-SoC™ place-and-route tool [24], reducing total power consumption and extending battery life. Significant setup and hold timing violations can also be fixed in an automated fashion. These are remarkable improvements given how late in the design flow they are achieved, providing significant value to design teams.
These analysis and optimization algorithms need to work in an incremental fashion, because runtime constraints prevent us from recalculating all the analysis results when frequent design changes are made by the other algorithms. In the age of integration, not only are the individual algorithms important, but the way they are integrated to reach closure on the objectives of the design has become the differentiating factor. A fine -grained interaction between these algorithms, guided by fast incremental timing, power, and area calculators, has become essential.
In the era of integration, most progress has been made in the way the algorithms cooperate with each other. Most principles of the original synthesis, placement, and routing algorithms and their efficient implementations are still applicable, but their use in EDA tools has changed significantly. In other cases, the required incrementality has led to interesting new algorithms— for example, trialing different local optimizations with placement and global routing to pick the best viable solution that speeds up a timing critical path [25]. While focusing on the integration, we must retain our ability to focus on advanced problems in individual tool domains in order to address new problems posed by technology and to continue to advance the capabilities of design algorithms.
To achieve this, we have seen a shift to the development of EDA tools guided by four interrelated principles:
1. Tools must be integrated.
2. Tools must be modular.
3. Tools must operate incrementally.
4. Tools must sparsely access only the data that they need to reduce memory usage and runtime.
Let us look at each of these in more detail in the following sections.
1.4.1 TOOL INTEGRATION
Tool integration allows them to directly communicate with each other. A superficial form of integration can be provided by initiating, from a common user interface, the execution of traditional point tools that read input from files and save output to files. A tighter form of inte-gration allows tools to communicate while concurrently executing rather than only through files. This tight integration is generally accomplished by building the tools on a common database (see Chapter 15) or through a standardized message passing protocol.
Tool integration enables the reuse of functions in different domains because the over-head of repeated file reading and writing is eliminated. This helps to reduce development resource requirements and improve consistency between applications. Although tool
integration enables incremental tool operation, it does not require it. For example, one could integrate placement and routing programs and still have the placement run to completion before starting routing. Careful design of an application can make it easier to integrate with other applications on a common database model, even if it was originally written as a stand-alone application.
Achieving tool integration requires an agreed-upon set of semantic elements in the design representation in terms of which the tools can communicate. These elements generally con-sist of cells, pins, and nets and their connectivity, cell placement locations, and wire routes. Cells include standard cells, macros, and hierarchical design blocks. Individual applications will augment this common data model with domain-specific information. For example, a static timing analysis tool will typically include delay and test edge data structures. In order for an integrated tool to accept queries in terms of the common data model elements, it must be able to efficiently find the components of the domain-specific data associated with these elements. Although this can be accomplished by name lookup, when integrated tools operate within a common memory space, it is more efficient to use direct pointers. This, in turn, requires that the common data model provide methods for applications to attach private pointers to the common model elements and callbacks to keep them consistent when there are updates to the elements (see Section 1.4.3.3).
1.4.2 MODULARITY
Modular tools are developed in small, independent pieces. This gives several benefits. It simplifies incremental development because new algorithms can more easily be substituted for old ones if the original implementation was modular. It facilitates reuse. Smaller modular utilities are easier to reuse, since they have fewer side effects. It simplifies code development and maintenance, making problems easier to isolate. Modules are easier to test independently. Tool modularity should be made visible to and usable by application engineers and sophisticated users, allowing them to integrate modular utilities through an extension language.
In the past, some projects have failed largely due to the lack of modularity [26]. To collect all behavior associated with the common data model in one place, they also concentrated control of what should be application-specific data. This made the data model too large and complicated and inhibited the tuning and reorganization of data structures by individual applications.
1.4.3 INCREMENTAL TOOLS
Tools that operate incrementally update design information, or the design itself, without revisiting or reprocessing the entire design. This enables fine-grained interaction between integrated tools. For example, incremental processing capability in static timing analysis helps logic synthesis to change a design and evaluate the effect of that change on timing, without requiring a complete timing analysis to be performed.
Incremental processing can reduce the time required to cycle between analysis and optimization tools. As a result, it can improve the frequency of tool interaction, allowing a better understanding of the consequences of each optimization decision.
The ordering of actions between a set of incremental applications is important. If a tool like synthesis invokes another incremental tool like timing analysis, it needs immediate access to the effects of its actions as reported by the invoked tool. Therefore, the incremental invocation must behave as if every incremental update occurs immediately after the event that precipitates it.
An example of incremental operation is fast what-if local timing analysis for a cell resize that can be quickly rolled back to the original design state and timing state if the change is detrimental. The local region of timing analysis may be limited to the cell, the cell’s fanouts because of out-put slew impacting their delay, the cell’s fanins as their load is affected, and the fanins’ fanouts, as fanin slews changed with the change in load. Timing changes are not allowed to propagate beyond that region until the resize is committed or until global costing analysis is performed. Fully accurate analysis would require propagating arrival times to the timing endpoints and then
propagating back required times, which can be very runtime expensive. Even in a full incremental timing analysis update, only those affected paths are traversed and updated, and timing is not propagated further where the change is determined to have no impact, for example, if it remains a subcritical side path.
There are four basic characteristics desirable in an incremental tool:
1.
Autonomy to avoid explicit
invocation and control of other engines, for example, by registering callbacks
so that an engine will be notified of events that impact it.
2.
Lazy evaluation to defer and
minimize additional computation needed.
3.
Change notification to inform other
engines of changes so that they may make associ-ated updates where necessary.
4.
Reversibility to quickly undo
changes, save, and restore
We shall explain these further in the following sections.
1.4.3.1 AUTONOMY
Autonomy means that applications initiating events that precipitate changes in another incre-mental tool engine do not need to notify explicitly the incremental tool of those events and that applications using results from an incremental tool do not need to initiate explicitly or control the incremental processing in that tool. Avoiding explicit change notification is important because it simplifies making changes to a design and eliminates the need to update the application when new incremental tools are added to the design tool environment. After registering callbacks, incremental applications can be notified of relevant events.
Avoiding explicit control of other incremental tools is important so that the caller does not need to understand the details of the incremental algorithm, facilitating future algorithmic improvements and making mistakes less likely. It also simplifies the use of the incremental tool by other applications.
1.4.3.2 LAZY EVALUATION (FULL AND PARTIAL)
Lazy evaluation means that an incremental tool should try, to the greatest extent possible, to defer processing until the results are needed. This can save considerable processing time when some results are never used. For example, consider a logic synthesis application making a series of individual disconnections and reconnections of pins to accomplish some logic transforma-tion. If an incremental timing analyzer updates the timing results after each of these individual actions, it will end up recomputing time values many times, while only the last values computed are actually used.
Lazy evaluation simplifies the flow of control when recalculating values. If we have interde-pendent analysis functions that all try to update results when notified of a netlist change, then the callbacks would have to be ordered to ensure that updates occur in the correct order. For example, if timing updates are made before extraction updates, the timing results will be incorrect as stale extraction results are being used. With lazy evaluation, each application performs only invalida-tion when notified of a design change, then updates are ordered correctly through demand-driven recomputation. After a netlist change, when timing is requested, the timing engine requests extraction results before performing the delay computation. The extraction engine finds that the existing extraction results have been invalidated by the callbacks from the netlist change, so it updates those results, and only then is the timing analysis updated by the timing engine. An example of this is shown in Figure 1.4.
Lazy evaluation may be full, if all pending updates are performed as soon as any information is requested, or partial, if only those values needed to give the requested result are updated. If partial lazy evaluation is used, the application must still retain enough information to be able to determine which information has not yet been updated, since this information may be requested later. Partial lazy evaluation is employed in some timing analyzers [27] by levelizing the design and limiting the propagation of arrival and required times based on this levelization, providing significant benefits in the runtime of the tool.
FIGURE 1.4 This example shows change notification callbacks from the routing engine
to the extraction and timing
analysis engines. There are lazy updates to the wire RC extraction and the
tim-ing analysis upon request from the timing engine and optimization engine,
respectively.
1.4.3.3 CHANGE NOTIFICATION (CALLBACKS AND
UNDIRECTED QUERIES)
With change notification, the incremental tool
notifies other applications of changes that concern them. This is more than
just providing a means to query specific pieces of information from the
incremental tool, since the application needing the information may not be
aware of all changes that have occurred. In the simplest situations, a tool
initiating a design change can assume that it knows where consequent changes to
analysis results (e.g., timing) will occur. In this case, no change
notification is required. But in other situations, a tool may need to respond
not only to direct changes in the semantic elements of the common data model
but also to secondary changes within specific application domains (e.g.,
changes in timing).
Change notification is important because the
applications may not know where to find all the incremental changes that affect
them. For example, consider an incremental placement tool used by logic
synthesis. Logic synthesis might be able to determine the blocks that will be
directly replaced as a consequence of some logic change. But if the replacement
of these blocks has a ripple effect, which causes the replacement of other
blocks (e.g., to open spaces for the first set of replaced blocks), it would be
much more difficult for logic synthesis to determine which blocks are in this
second set. Without change notification, the logic synthesis system would need
to examine all blocks in the design before and after every transformation to
ensure that it has accounted for all consequences of that transformation.
Change notification may occur immediately after the
precipitating event or may be deferred until requested. Immediate change
notification can be provided through callback routines that applications can
register with and that are called whenever certain design changes occur.
Deferred change notification requires the
incremental tool to accumulate change informa-tion until a requesting
application is ready for it. This requesting application will issue an
undi-rected query to ask for all changes of a particular type that have
occurred since some checkpoint (the same sort of checkpoint required for
undoing design changes). It is particularly important for analysis results,
since an evaluation routine may be interested only in the cumulative effect of
a series of changes and may neither need nor want to pay the price (in nonlazy
evaluation) of immediate change notification.
A typical use of undirected queries is to get
information about changes that have occurred in the design, in order to decide
whether or not to reverse the actions that caused the changes.
For this purpose, information is needed not only
about the resultant state of the design but also about the original state so
that the delta may be determined. (Did things get better or worse?) Thus, the
application making an undirected query needs to specify the starting point from
which changes will be measured. This should use the same checkpoint capability
used to provide reversibility.
1.4.3.4 REVERSIBILITY (SAVE/RESTORE AND RECOMPUTE)
Trial evaluation of candidate changes is important
because of the many subtle effects any particu-lar design change may cause and
because of the order of complexity of most design problems. An application
makes a trial design change, examines the effects of that change as determined
in part by the incremental tools with which it is integrated, and then decides
whether to accept or reject the change. If such a change is rejected, we need
to make sure that we can accurately recover the previous design state, which
means that all incremental tool results must be reversible.
Each incremental tool should handle the reversal of
changes to the data for which it is responsible. In some cases such as timing
analysis, the changed data (e.g., arrival and required times) can be
deterministically derived from other design data, and it may be more efficient
to recompute them rather than to store and recover them. In other cases such as
incremental placement, changes involve heuristics that are not reversible, and
previous state data must be saved, for example, in a local stack. The
incremental tool that handles the reversal of changes should be the one
actually responsible for storing affected data. Thus, changes to the netlist
initiated by logic synthesis should be reversed by the data model (or an
independent layer built on top of it) and not by logic synthesis itself.
All such model changes should be coordinated through
a central undo facility. Such a facility allows an application to set
checkpoints to which it could return. Applications, which might have
information to undo, register callbacks with the facility to be called when
such a checkpoint is established or when a request is made to revert to a
previous checkpoint.
Ordering of callbacks to various applications to
undo changes requires care. One approach is to examine the dependencies between
incremental applications and decide on an overall ordering that would eliminate
conflicts. A simpler solution is to ensure that each atomic change can be
reversed and to have the central undo facility call for the reversal of all
changes in the opposite order from that in which they were originally made.
Applications can undo changes in their data in one
of two ways, as outlined earlier. Save/restore applications (e.g., placement)
may merely store the previous state and restore it without requiring calls to
other applications. Recompute applications (e.g., timing analysis) may
recompute previous state data that can be uniquely determined from the state of
the model. Recompute applications generally do not have to register specific
undo callbacks, but to allow them to operate, model change callbacks (and all
other change notification callbacks) must be issued for the reversal of model
changes just as they are for the original changes.
Such a central undo facility can also be used to
help capture model change information. This can be either to save to a file,
for example, an ECO file, or to transmit to a concurrently executing parallel
process, on the same machine or another one, with which the current
applications need to synchronize.
Even when we choose to keep the results of a change,
we may need to undo it and then redo it. A typical incremental processing
environment might first identify a timing problem area and then try and
evaluate several alternative transformations. The original model state must be
restored before each alternative is tried, and after all alternatives have been
evaluated, the one that gives the best result is then reapplied.
To make sure that we get the same result when we
redo a transformation that we got when we did it originally, we also need to be
able to reverse an undo. Even for deterministic transforma-tions, the exact
result may depend on the order in which certain objects are visited, and such orderings
may not be semantic invariants and thus may be altered when undoing a change.
Also, a considerable amount of analysis may be required by some transformations
to determine the exact changes that are to be made, and we would like to skip
this analysis when we redo the trans-formation. Avoiding nondeterministic
behavior is also highly desirable; otherwise, reproducing faulty behavior
during debug is fraught with complications.
For these reasons, the central undo/ECO facility
should be able to store and manage a tree of checkpoints rather than just a
single checkpoint. Trial transformations to a circuit may also be nested, so we
need to be able to stack multiple checkpoints.
It is important to remember that there is no magic
bullet that will turn a nonincremental tool into an incremental one. In
addition to having a common infrastructure, including such things as a common
data model and a callback mechanism, appropriate incremental algorithms need to
be developed.
Following some of these guidelines also encourages
incremental design automation tool development. Rewriting tools is an expensive
proposition. Few new ideas change all aspects of a tool. Incremental
development allows more stability in the tool interface, which is particularly
important for integrated applications. It also allows new ideas to be
implemented and evaluated more cheaply, and it can make a required function
available more quickly.
1.4.4 SPARSE ACCESS
Sparse access refers to only loading the portion of
the design and associated files that are neces-sary to perform a given task and
deferring loading additional data until needed. This can reduce both memory
usage and runtime by an order of magnitude in some cases.
For example, a design check to verify correct
connectivity to sleep header cells for power gating needs to only traverse the
cell pin and net connections and may avoid loading libraries if cell types are
identified by a naming convention. The netlist portion traversed might be
further limited to just the nets connecting to the sleep header cells and the
cells driving the sleep control signal, ensuring that logic is connected to the
always-on power supply.
Sparse access to netlist or parasitic data typically
requires paging memory to disk with an index to specify where to find data for
a given cell or net. Likewise, cells within a library can be indexed to provide
individual access rather than loading the entire library, which can reduce the
memory overhead for loading composite current source (CCS) libraries by an
order of magnitude or more as only a few hundred cells are used out of several
thousand. This may require prechar-acterization of libraries, for example, to
provide a cached list of which delay buffers are available to fix hold
violations and the subset of those that are Pareto optimal in terms of area or
power versus delay trade-off.
1.4.4.1 MONOLITHIC EDA TOOLS AND MEMORY LIMITS
In-house tools at design companies are often fast
point tools that load only the necessary portion of the design and associated
technology files or libraries to minimize the start-up runtime over-head, as
the task being performed may run in a minute to an hour. This can be
illustrated by the example of running tens of design checks across all the
design blocks in parallel on a server farm on a nightly basis. Such point tools
are very useful to analyze and check designs for any significant issues before
launching further runs that will take significant time.
In contrast, commercial EDA tools have typically
been designed for use in a monolithic manner. One or more major flow steps
would be performed within the tool, such as synthesis, placement, and
optimization; clock tree synthesis (CTS); post-CTS optimization; detailed rout-ing;
and postroute optimization. The entire design, technology files, and libraries
are all loaded taking significant memory. The initial loading runtime of
several minutes is smaller compared to that of the larger flow step being
performed.
Commercial EDA tools are now facing tight memory
limits when analyzing and optimizing large designs across multiple processes
and operating corners. A combination of 30 corners and modes or more is not
uncommon—for example, slow, typical, and fast process corners; low
tem-perature, room temperature, and high temperature; −5% voltage, typical
voltage, and +5% voltage; supply voltage modes such as 0.7, 0.9, and 1.1 V; and
asleep, standby, awake, and scan operating modes. CCS libraries can be 10 GB
each, so just loading the libraries for all these corners can be 100 GB or more
of memory. Today’s designs are typically anywhere between blocks of sev-eral
hundred thousand gates to hundreds of millions of gates when analyzed in a flat
manner at the top level and can also take 100 GB or more of memory,
particularly with standard parasitic
exchange format (SPEF)–annotated parasitic
capacitances. When running multithreaded across a high-end 16 CPU core server
to reduce runtimes that would otherwise take days, some data must also be
replicated to run multithreaded to avoid memory read/write collisions and to
limit pauses needed to synchronize data. Today’s server farms only have a few
machines with 256GB of memory, and servers with higher memory than that are
very expensive.
EDA developers use a variety of standard programming
techniques to reduce memory usage. For example, a vector data structure takes
less memory than a linked list (to store the same data), whereas multiple
Boolean variables may be packed as bits in a single CPU word and extracted with
bitmasks when needed. Duplicated data can be compressed, for example, during
layout versus schematic verification [28] where hierarchical representations of
layout data represent individual library cells containing many repeated geometric
shapes. Memory usage, runtime, and quality of results, such as circuit timing,
area, and power, are carefully monitored across regression suites to ensure
that code changes in the EDA tool do not degrade results or tool performance.
1.4.4.2 PARALLEL AND DISTRIBUTED COMPUTING FOR EDA
EDA vendors are migrating tools to work in a
distributed manner with a controller on one machine distributing task to
workers on other machines to perform in parallel. Multithreading and fork–join
[28] are also common parallelism techniques used in EDA software, and they can
be combined with distributed computation. These methods can reduce runtime for
analysis, optimization, and verification by an order of magnitude. Shared
resources can also be more effectively used with appropriate queuing.
Many of the algorithms used in the RTL-to-GDSII flow
are graph algorithms, branch-and-bound search, or linear algebra matrix
computations [29]. While most of these algorithms can be parallelized, Amdahl’s
law [30] limits the speedup due to the serial portion of the computation and
overheads for communication, cache coherency, and bottlenecks where
synchronization between processes is needed. Performance improvements with
parallelism are also often overstated [31].
Much of the development time in parallelizing
software actually focuses on reducing the runtime of the serial portion, for
example, avoiding mutually exclusive locks on shared resources to minimize
stalls where one process has to wait for another to complete. Fast and memory-efficient
implementations of serial algorithms, such as Dijkstra’s shortest-path
algorithm [32], are still very important in EDA as there is often no better
parallel algorithm [31].
As an example of how distributed computing is used
in EDA, optimization at several impor-tant timing corners may require a server
with a large amount of memory, but lower-memory servers can verify that timing
violations do not occur at other corners. Analysis at dominant corners causing
the majority of setup and hold timing violations reduces the need for analysis
at other corners. However, there are still cases where a fix at a dominant
corner can cause a violation at another corner, so it cannot be entirely
avoided. The controller and the workers cannot afford the memory overhead of
loading libraries for all the corners, nor the additional analysis runtime.
Distributed optimization allows staying within the server memory limits. One
worker performs optimization with analysis at dominant corners only, with other
workers verifying that a set of changes do not violate constraints at the other
corners.
Partitioning large designs into smaller portions
that can be processed separately is a common approach to enabling parallelism
[29]. This may be done with min-cut partitioning [12] to minimize cross -border
constraints where there will be suboptimality. However, some regions may be
overly large due to reconvergent timing paths, latch-based timing, cross
coupling between wires, and other factors that limit a simple min-cut partitioning
approach. The regions need to be sized so as to balance loads between workers.
How best to subpartition a design is a critical problem in EDA [29] and varies
by application. For example, timing optimization may consider just critical
timing paths, whereas area optimization may try to resize every cell that is
not fixed.
Portions of a design can be rapidly optimized in
parallel in a distributed manner. Border constraints imposed to ensure
consistency between workers’ changes to the design do reduce the optimality.
Individual workers use sparse access to just read the portion of the design
that they are optimizing, though logical connectivity may not be sufficient as
cross-coupling analysis considers physically nearby wires.
Parallel computation can be a source of
nondeterminism when multiple changes are performed in parallel. To avoid this
nondeterminism, incremental analysis on distributed workers may need to use a
consistent snapshot of the circuit with periodic synchronizations points at
which changes and analysis updates are done or undone in a fixed deterministic
order. Some operations may also need to be serial to ensure consistent
behavior.
Comments