This paper has evolved from an article by Clive “Max” Maxfield that was first published in
EE Times
and also on the
Programmable Logic DesignLine website. Any portions of the original article that
appear in this paper are reproduced here with the kind permission of CMP/EE Times.
View Topics
Before We Start
Before we start, we should note that the following discussions relate to the illustration shown below (of which we are inordinately
proud, because capturing the diverse computing options graphically proved to be a non-trivial task).
Also, this paper is intended to be a "living-breathing" document that will be updated to reflect any new players and architectures in the multi-processor
and reconfigurable computing arena. So if you think we got anything wrong or we missed something out, don't hesitate to contact us and we'll leap into action
with gusto and abandon.
Defining Some Terms
OK, let's kick things off by defining a few concepts, because this will make things easier as we wend our way through the rest of this paper.
The term
central processing unit (CPU) refers to the "brain" of a general-purpose digital computer – this is where all of the decision making and
number crunching operations are performed. By comparison, a
digital signal processor (DSP) is a special-purpose CPU that has been created to process
certain forms of digital data more efficiently than can be achieved with a general-purpose CPU.
Both CPUs and DSPs may be referred to as "processors" for short. The term microprocessor refers to a processor that is implemented on a single
integrated circuit (often called a "silicon chip," or "chip") or a small number of chips. The term microcontroller refers to the combination of a
general-purpose processor along with all of the memory, peripherals, and input/output (I/O) interfaces required to control a target electronic system
(all of these functions are implemented on the same chip to cut down on size, cost, and power consumption).
The heart of a processor is its arithmetic-logic unit (ALU) – this is where arithmetic and logical operations are actually performed on the
data. Also, in the case of DSP algorithms, it is often required to perform multiply-accumulate (MAC) operations in which two values are multiplied
together and the result is added to an accumulator (that is, a register in which intermediate results are stored). Thus, DSP chips often contain special
hardware MAC units.
Last but not least, the term core is understood to refer to a microprocessor (CPU or DSP) or microcontroller that is implemented as a function
on a larger device such as a field-programmable gate array (FPGA) or a System-on-Chip (SoC). Depending on the context, the term processor may be
used to refer to a chip or a core. [The underlying concepts behind devices such as FPGAs and SoCs – and also ASICs and ASSPs as mentioned later in
this paper – are explained in excruciatingly interesting detail in our book Bebop to
the Boolean Boogie (An Unconventional Guide to Computers), ISBN: 0750675438.]
Introduction
The first commercial microprocessor was the Intel 4004, which was introduced in 1971. This device had a 4-bit CPU with a 4-bit data bus
and a 12-bit address bus (the data and address buses were multiplexed through the same set of four pins because the package was pin-limited). Comprising
only 2,300 transistors and with a system clock of only 108 KHz, the 4004 could execute only 60,000 operations per second.
For the majority of the three and a half decades since the 4004's introduction, increases in computational performance and throughput have been largely
achieved by means of relatively obvious techniques as follows:
| a) |
Increasing the width of the data bus from 4 to 8 to 16 to 32 to the current 64 bits used in high-end processors. |
| b) |
Adding (and then increasing the size of) local high-speed "cache" memory. |
| c) |
Shrinking the size – and increasing the number – of transistors; today's high-end processors can contain hundreds
of millions of transistors. |
| d) |
Increasing the sophistication of processor architectures, including pipelining and adding specialized execution blocks, such as dedicated floating-point units. |
| e) |
Increasing the sophistication of such things as branch prediction and speculative execution. |
| f) |
Increasing the frequency of the system clock; today's high-end processors have core clock frequencies of 3 GHz (that's three billion clock cycles a second)
and higher. |
The problem is that these approaches can only go so far, with the result that traditional techniques for increasing computational performance and throughput
are starting to run out of steam. When a conventional processor cannot meet the needs of a target application, it becomes necessary to evaluate alternative
solutions such as multiple processors (in the form of chips or cores) and/or configurable processors (in the form of chips or cores).
The Computing Universe
For the purposes of this paper, we will consider the term
computing in its most general sense; that is, we will understand "computing"
to refer to the act of performing computations. There are many different types of computational tasks we might wish to perform, including – but not limited
to – general-purpose office-automation applications (word-processing, spreadsheet manipulation, etc.); extremely large database manipulations such as
performing a Google search; one-dimensional digital-signal processing (DSP) applications such as an audio codec; and two-dimensional DSP applications such
as edge-detection in robotic vision systems.
In many cases, these different computational tasks are best addressed by a specific processing solution. For example, an FPGA may be configured (programmed)
to perform certain DSP tasks very efficiently, but one typically wouldn't consider using one of these devices as the main processing element in a desktop computer.
Similarly, off-the-shelf Intel and AMD processor chips are applicable to a wide variety of computing applications, but you wouldn't expect to find one powering a
cell phone (apart from anything else, the battery life of the phone would be measured in seconds).
Fundamentally, there are three main approaches when it comes to performing computations. At one end of the spectrum we have a single, humongously large
processor; at the other end of the spectrum we have a massively-parallel conglomeration of extremely fine-grained functions (which some may call "a great big
pile of logic gates"); and in the middle we have a gray area involving multiple medium- and coarse-grained processing elements. (Note that this paper
focuses on the microprocessor/CPU/DSP arenas; mainframe computers and supercomputers are outside the scope of these discussions.)
Single Processors
The classical processing solution for many applications is to use a single, humongously large "off-the-shelf" processor, such as a general-purpose
CPU chip from Intel (
www.intel.com) or AMD (
www.amd.com) or a
special-purpose DSP chip from Texas Instruments (
www.ti.com). Similarly, in the case of embedded applications, one
might choose to use a single general-purpose processor core from ARM or ARC or a DSP core from TI.
At some stage, a single processor simply cannot meet the needs of a target application, in which case it becomes necessary to evaluate alternative solutions
as discussed in the following topics.
Multiple Processors (Homogeneous)
Perhaps the most famous early example of using multiple processors was the INMOS
transputer chip, which surfaced in the mid 1980s
(the all lowercase "transputer" was the official written form). As a point of interest, the native programming language for the transputer was
occam
(again, the all lowercase "occam" was the official written form), which was named in honor of the 14th century English philosopher and Franciscan friar
William of Ockham, also spelled Occam (1286–1348 give or take a few years).
Each transputer chip contained a single processor that was designed to communicate with – and work in parallel with – other transputers. The idea was
that users could hook as many transputer chips together on a circuit board as was necessary to satisfy the computational requirements of the target application.
Many believed that the transputer was going to be the next great leap in computing, but creating programs that ran efficiently on this parallel architecture was
non-trivial, and the transputer eventually faded away.
Although most non-engineers don't realize it, it is actually very common for systems to use multiple processors. Consider a home computer, for example;
in addition to the main CPU, the keyboard will also have its own processor; each hard disk and optical (CD/DVD) drive will typically contain two or more
processors, and so forth. Even a simple "USB Memory Stick" contains its own processor, which is used to make the contents of the stick appear to be a hard disk drive as
far as the host computer's operating system is concerned.
However, the above examples are characterized by the fact that these multiple processors all have very focused well-partitioned tasks that can be largely
performed in isolation. It is much more complicated to have tightly-coupled homogeneous processors, such as the dual-core chips that are now available
from AMD and Intel (the term "homogeneous" means that these processing elements are of the same kind). Another term that is applicable to this type of
configuration is symmetric multiprocessing (SMP), which means that the view of the rest of the system – memory, input/output, operating system, etc.
– is exactly the same (i.e. "symmetrical") for each processor.
When moving from a single processor/core to a dual-processor/core configuration, the system becomes noticeably more responsive, and users don't experience
those annoying "hang-ups" and "stalls" that are the hallmark of a single-processor environment. And two processors are only the start; for example,
Intel is already talking about a four-core microprocessor called "Clovertown," which is expected to appear on the market in early 2007.
Meanwhile, Sun Microsystems (www.sun.com) is already fielding an eight-core processor called the
Ultrasparc TI. Formally known as Niagara, this extreme-performance device is well-suited to highly-threaded commercial environments, such as thread-aware web
servers, applications servers, and database servers. Of particular interest is that fact that Sun is open sourcing this chip; the register transfer level
(RTL) representation of this device was made available to the engineering community when the www.opensparc.net
website went live on January 24th 2006.
And if you think an eight-core processor is impressive, you should check out the Vega processor chip from Azul
Systems (www.azulsystems.com). The current implementation of this device boasts
an array of twenty-four 64-bit CPU cores, and Azul have announced that a forty-eight core version will be made available in 2007.
Before we move on, we should also make mention of the Multicore Association (www.multicore-association.org),
which is a new industry group focused on companies involved with multi-processor hardware, software, and system implementations.
Multiple Processors (Heterogeneous)
As opposed to using multiple identical cores, it may be preferable to use a mixture of dissimilar cores. For example, the main digital chip
in even the most rudimentary cell phone will typically contain at least one CPU core (to manage the human-machine interface) coupled with at least one DSP core (to
perform the baseband signal processing functions). Such solutions are referred to as being "heterogeneous," meaning
"consisting of dissimilar elements or parts."
One example of this type of scenario is the Cell processor from IBM (www.ibm.com), which is a single chip
containing a general-purpose CPU core tightly coupled with eight DSP cores [IBM actually call these DSP cores Synergistic Processor Elements (SPEs); these
little scamps contain floating-point engines and other units; they are predominantly used for graphics calculations.]. Another example is a high-end cell phone,
which may include two or more CPU cores and two or more DSP cores combined with large numbers of hardware accelerator blocks and peripheral functions.
Things are further complicated by the fact that the processing cores and other functional units may have their own individual memories along with shared
memory structures; and everything may be connected together using multi-level buses and cross-point switches (some of the larger chips actually feature
a Network-on-Chip (NoC), which the various processors and peripherals use to communicate with each other). One term which is commonly associated with
this type of environment is asymmetric multiprocessing (AMP or ASMP), in which computational tasks (or threads) are strictly divided by
type between processors.
CPU Chips Linked to FPGA-based Coprocessors and Accelerators
As a starting point for this topic, we should note that several companies make computer motherboards that support two general-purpose
processors linked by a high-speed bus. For example, there are several motherboards that boast two of AMDs Opteron processor chips linked by the
high-speed, low-latency
HyperTransport (HTX) bus, where each of these
processors may contain two, four, or more CPU cores as new chips come onto the market.
The idea is to remove one of these general-purpose processor chips and replace it with a small pin-compatible card containing one or more high-end FPGAs.
In this case, the general-purpose AMD processor is be used to execute control-type tasks, while the FPGA module will be configured to perform
algorithmically-intensive data-processing and number-crunching tasks with extreme speed. Meanwhile, the HyperTransport bus is used to move massive
amounts of data around the system with extreme speed.
A good example of this type of approach is offered by the folks at XtremeData (www.xtremedatainc.com)
who combine an AMD processor with an FPGA-based module using high-capacity, high-performance FPGAs from Altera
(www.altera.com). Similar examples are provided by
Cray (www.cray.com) and DRC Computer Corporation
(www.drccomputer.com), who do much the same thing but with FPGAs from
Xilinx (www.xilinx.com).
Note #1: Many servers now include a Hypertransport (HTX) expansion slot along with their PCI and PCI Express slot(s). The HTX slot
attaches directly to the same bus that both processors use; this means that the FPGA-based accelerator card can communicate to the processors in the
same way as the 'socket' solutions discussed above, but now you get to keep both processor chips on the motherboard. (There is no technical reason why
there couldn't be multiple HTX expansion slots, but today's systems typically offer only one such slot.)
Note #2: AMD's initiative to promote the openness of the HyperTransport Bus is known as
Torrenza; in response, Intel announced a proposal –
codenamed Geneseo – to open up
their front side bus (FSB) to facilitate the same type of implementation.
There are many other vendors with interesting solutions, such as Celoxica (www.celoxica.com),
who plug into the HTX slot discussed above; SRC Computers (www.srccomputers.com), whose
FPGA-based accelerator card plugs directly into one or more memory slots; and Nallatech (www.nallatech.com),
who boast a wide variety of products and tools.
On-chip Coprocessors and Accelerators
If you are in the process of creating a new chip from the ground up, one technique is to augment a pre-defined processor core with one or
more dedicated coprocessors and/or hardware accelerators. For example, CriticalBlue (
www.criticalblue.com)
has a tool called Cascade that accepts as input compiled applications (which may be referred to as
binaries) in the form of executable ARM machine code. By means of a simple interface,
the user selects which functions are to be accelerated, and Cascade then generates the
register transfer level (RTL) description for a dedicated
coprocessor (and the microcode to run on that coprocessor) to implement the selected functions.
A somewhat similar approach is that taken by Binachip www.binachip.com), whose tools also
take compiled (binary) programs. However, these tools first read the binary code into a neutral format, then they allow you to select which functions will be
implemented in hardware and which functions are to be realized in software. Finally, they re-generate the binary code for the software portions of the
system and generate register transfer level (RTL) representations for the accelerators used to implement the hardware portions of the system.
An alternative technique is that adopted by Poseidon Systems (www.poseidon-systems.com), whose
Triton tool suite allows users to analyze ANSI standard C source code, to identify areas of the code to be accelerated, and to generate accelerators/coprocessors
that can be used in conjunction with ARM, PowerPC, Nios, or MicroBlaze hard and soft processor cores implemented in SoCs and/or FPGAs.
And then there are the tools from Synfora (www.synfora.com) can also analyze ANSI standard C source code and
generate register transfer level (RTL) representations for corresponding hardware accelerators.
In reality, there are quite a few other players in this arena; these include (but are not limited to)
Altera (www.altera.com) with its C2H (ANSI C to hardware accelerator) technology, Celoxica
(www.celoxica.com) with its Agility Compiler (SystemC to hardware accelerator) and DK Suite
(Handel-C to hardware accelerator) approaches, Forte Design Systems (www.forteds.com) with
its Cynthesizer (SystemC/C++ to hardware accelerator) suite, and Mentor Graphics (www.mentor.com)
with its Catapult BL and SL (C to hardware accelerator) technology.
Large Arrays of "Things"
One way to think of the hardware used to perform computations is in terms of its granularity. The finest level of granularity is provided by
an
application-specific integrated circuit (ASIC) or
application-specific standard part (ASSP), in which algorithms can be hand-crafted in silicon
at the level of individual logic gates. (An ASIC is a device that is custom-created for a particular application and is intended for use by only one – or very
few – companies. By compassion, ASSPs are devices that are created using ASIC technologies, but that are intended to be sold as standard parts to anybody
who wants to use them.)
Next, we have FPGAs with their four-input lookup tables (LUTs). These are off-the-shelf chips that can contain the equivalent of tens of thousands to tens of millions
of logic gates. FPGAs are designed in such a way that they can be configured (programmed) to perform some desired function or functions; the SRAM-based versions
of these devices have the advantage that they can be reconfigured as required. [Structured ASICs may be considered to occupy a space somewhere between ASICs
and FPGAs, especially in the case of devices from eASIC (www.easic.com), which combine custom routing with
FPGA-like SRAM-based LUTs.]
Note that we might decide to include one or more hard processor cores on an ASIC or ASSP, in which case we would refer to this device as a System-on-Chip (SoC).
Similarly, we might decide to include one or more hard and/or soft processor cores on an FPGA (which may also be viewed as an SoC by some folks). All of these
cases would then be considered to be a hybrid solution involving a mixture of traditional processor core(s) and algorithms implemented in gates/LUTs/etc.
In recent years, a number of companies have started to offer more exotic architectures, each of which is applicable to a focused set of computational
applications. If we consider these offerings in terms of granularity, then the first step above traditional FPGAs would be an architecture such as that
provided by Elixent (www.elixent.com). This reconfigurable algorithm processing (RAP) architecture –
which is targeted toward the efficient implementation of arithmetic/DSP functions – is based on an array of 4-bit arithmetic-logic units (ALUs) in a "sea"
of programmable interconnect. These ALUs can be linked using fast carry chains so as to implement wider functions. In addition to forming part of a datapath,
the output of one ALU may be used to select the instruction of another ALU. The programming model for these devices is to take the same register transfer level (RTL)
representation used to create an ASIC or to configure (program) an FPGA, and to use an appropriate synthesis engine to generate a corresponding configuration file.
Next, we have the field programmable object array (FPOA) architecture from MathStar (www.mathstar.com). An
example FPOA device may contain around 400 silicon "objects" in the form of 16-bit ALUs (each with its own instruction cache and scratchpad memory), register files,
and multiply accumulators (MACs) – along with internal RAM banks and external high-speed memory interfaces – all of which can communicate with each
other through programmable interconnect fabric. Each object can be programmed individually and acts autonomously. All of the objects and the interconnect
run at 1 GHz. In addition to general-purpose I/O (GPIO) pins, the FPOA boasts high-speed I/O that can transmit and receive 2 × 32 GB/s. The main programming
model for these devices is to use a graphical interface that generates SystemC, and the target application area is for compute-intensive DSP tasks such as edge
detection and pattern recognition for robotic vision systems with high-frame-rates and high resolutions.
Another group of architectures may be classed as comprising one (or a small number) of general-purpose CPU cores coupled with an array of processing elements
(PEs). Depending on the implementation, each of these PEs can contain multipliers, adders, ALUs, MACs, counters, synchronizers, memory, etc. Three good examples
of this concept are IPFlex (www.ipflex.com) with an off-the-shelf device comprising two CPUs and hundreds
of 32-bit PEs; ClearSpeed (www.clearspeed.com) with an off-the shelf device comprising a general-purpose
CPU coupled with an array of 32/64-bit PEs containing floating-point multipliers and suchlike targeted toward scientific and engineering calculations;
and IMEC (www.imec.be) with a configurable core comprising a single very long instruction word (VLIW) CPU coupled
with an array of 32/64 PEs each containing an ALU/MAC combo.
A good example of the next higher level of granularity is provided by picoChip (www.picochip.com), whose
picoArray features several hundred 16-bit CPU and DSP cores connected by a sea of programmable interconnect that can move 5 terabits of data per-second around
the device. Each core, has its own local memory (ranging from 1K to 64K depending on the core type). The programming model for a picoArray is an interesting
mixture of styles. A VHDL block-level netlist is used to define the connectivity between each of the CPU and DSP cores (each block in the netlist maps onto
a specific type of core); meanwhile, the actual function of each block is defined in C and/or assembly code.
Another example of this level of granularity is provided by the multiprocessor DSP (MDSP) architecture from Cradle Technologies
(www.cradle.com). Current incarnations of the MDSP offer up to 8 CPU cores and 16 DSP cores. Each of these
32-bit cores has its own local instruction and data memory. The latest programming model for these devices is to create a C program that is divided
into multiple threads, and to tag each thread as being either a control thread (to be executed on a CPU) or a signal processing thread (to be executed
on a DSP). A run-time dynamic scheduler is then used to assign threads to available resources on the device.
And yet another example is provided by Ambric (www.ambric.com). Right from the beginning, the folks
at Ambric resolved that massive parallelism is only practical if you first start with the programming model, and then build the chip accordingly, rather
than the other way around. Thus, they started by defining a structural object programming model, which is a hierarchical structure of self-contained objects
linked through asynchronous self-synchronizing channels. The objects are written in Java or assembly code, and the structure is programmed graphically
or textually in an Eclipse-based Integrated Design Environment (IDE). It was only after defining this programming model that the little rapscallions
built a silicon chip upon which to implement this programming model – and what a chip it is! This little scamp comprises an array of 360 32-bit
CPU/DSP cores and 360 1-KByte RAMs all linked by a configurable interconnect of channels. The result is a programmable chip capable of performing one
tera-operations/second (which makes your eyes water) that can be easily programmed by system architects and software engineers.
Configurable Processors
As for so many things in computing, the term "configurable" is something of a slippery customer, because it means different
things to different people. In the case of cores from ARC (
www.arc.com), for example, you
have the ability to customize the instruction set – and therefore the architecture of the core. By analyzing your source code application(s) using
tools from ARC, you can determine which instructions aren’t used and remove them from the instruction set and the processor core. Also, you have
the ability to add new instructions to the core (this is a tad more complicated).
Another technique is the concept fielded by Tensilica (www.tensilica.com).
In this case, you start with a predefined 32-bit post-RISC processing engine called Xtensa that comprises around 25K gates. Next, Tensilica's tools analyze
your C/C++ application and evaluate millions of possible processor extensions based on techniques like single-instruction-multiple-data (SIMD) and vector
operations, operator fusion, and parallel execution. Once you select the configuration that's best for your particular application, a processor generator
outputs the register transfer level (RTL) description for your custom processor along with a custom compiler, assembler, and source-level debugger.
A typical customer may end up with 5 or 6 heterogeneous Tensilica cores on their SoC, and some devices (for networking applications) have several
hundred such cores.
As an aside, in February 2006, Tensilica started offering a suite of off-the-shelf cores called the Diamond Standard family. These are cores that Tensilica
have pre-configured to perform a range of CPU and DSP functions extremely efficiently (these cores feature extremely high performance coupled with low
power consumption).
And then we have the guys and gals at CoWare (www.coware.com) with their Processor Designer
technology, which allows you to create a custom core from the ground up. As opposed to ARC and Tensilica whom we might regard as providing configurable IP,
the tools from CoWare should be regarded as being more of an Electronic Design Automation (EDA) approach. In this case, the folks at CoWare have developed a high-level language that is
designed to allow you to specify the required functionality of a processor core, including things like the instructions forming the instruction set,
register files, execution units, the memory subsystem, and so forth. Using this language, you can define CPU and/or DSP cores with a wide variety of
characteristics, such as single instruction multiple data (SIMD) capabilities, very long instruction word (VLIW) superscalar architectures, and so on.
Then, once you are ready, you press the "Go" button and Processor Designer generates the register transfer level (RTL) representation used to create
your core, along with a custom assembler, C compiler, linker, debugger, and instruction set simulator (ISS).
Another group of folks worth mentioning are the guys and gals at Target Compiler Technologies (www.retarget.com)
whose Chess/Checkers technology also allows you to create a custom core from the ground up. Once again, this is more of an EDA approach. Also known as TCT, Target
is an interesting company in that they are reputed to have more design wins in this space than any of their competitors, but not many people know about them (apart from
the folks who are in this arena). An industry expert told the author that this is largely because everyone who works at Target is an engineer with at least
five different jobs, and nobody has the time (or inclination) to do any marketing.
Last but not least, we should also note that ARM (www.arm.com) has a product called OptimoDE that can be used to generate specially configured cores. However, these cores are designed
to act as slaves (coprocessors); that is; they require a host processor to load their local memories and start them running. Also, someone who shall remain
namless told the authors of this paper that: ”OptimoDE is so difficult to work with that only a few guys in Belgium actually know how to use it!.”
Reconfigurable Processors
The term "reconfigurable computing" means different things to different folks. The best comparison the authors have heard thus far is that
of the transporter systems on Star Trek. By this we mean that we all know how these devices are supposed to work and what they do, but we don't have a clue how
to build one with the technologies available today.
Similarly, engineers have a vision of the ideal reconfigurable computing scenario, which involves a silicon chip whose function can be reconfigured at the
level of individual logic gates (that is, changing an AND gate into an OR gate, for example) and whose connections between gates can be reconfigured
on-the-fly without any negative impact with regard to performance or power consumption. In this dream world, it would also be possible to be reconfiguring
certain portions of the device while other portions continued to function, thereby allowing new design variations to dynamically evolve in real-time.
The problem is that, at this time, we don't have a clue how to build such a device and – even if we did – we don’t have the tools required to
program one of these little scamps.
OK, back to the real world. One incarnation of reconfigurable computing that can be achieved with today's technologies is known as static reconfiguration.
In this case, a programmable device such as an FPGA is first configured to perform a certain task, and is later reconfigured to perform a different task.
By comparison, dynamic reconfiguration refers to configuring different portions of a device "on-the-fly" while other portions of the device continue
to perform their tasks.
One interesting scenario involves an FPGA containing a number of soft microprocessor and DSP cores, each executing its own local microcode. A special
controller block can be used to supply the various processor cores with new microcode as required (this new microcode could be stored in an external memory).
Perhaps the best example of reconfigurable computing to date is provided by Stretch Inc. (www.stretchinc.com),
which provides a family of off-the-shelf software-configurable processors. Each of these chips contains two main units: Tensilica's Xtensa core coupled with
Stretch's reconfigurable instruction set extension fabric (ISEF), which contains wide register files and lots of computational units
(multipliers, adders, and so forth) in a sea of programmable interconnect. Stretch's tools analyze your C/C++ application and generate a corresponding
configuration file to program the ISEF to perform specific tasks. The point here is that the ISEF can be reconfigured thousands of times a second so as
to tailor it to better serve different portions of the algorithm.
Summary
This paper has really only touched the surface of the state of play in modern computing. In addition to yet more hardware solutions,
it is also necessary to consider such things as operating system issues along with the problems of programming, debugging, verifying, and profiling
applications.
The point is that there are now a lot of options available to the designers of today's state-of-the-art systems. As usual, system architects have
to perform the traditional tradeoff between power, performance, and cost. Ultimately designers have to ask the questions: How much performance do
we want? How much do we need? And how much can we afford?