|
Power efficiency has
becoming the driving factor behind the design of new CPUs. Even
Intel has admitted that it can’t continue
to push up the clock speed and power dissipation of its processors,
instead using two or more CPUs per chip to increase performance.
While this multicore approach is a step in the right direction,
over time the CPUs themselves must evolve. In this case, however,
the change is more like de-evolution.
During the 1990s, CPU designers created ever-more complex microarchitectures
intended to maximize both clock speed and performance per cycle.
Achieving these goals required techniques such as superscalar,
out-of-order instruction issue, superpipelining, instruction
translation, and speculative execution. With a new focus on
performance per
watt instead of per cycle, we must now reduce or discard these
power-wasting techniques. Instead, CPUs must use power mainly
to execute instructions. Shortening
the Pipeline
Superpipelining
was best exemplified by Intel’s “Prescott” Pentium
4, which used a staggering 31-stage pipeline to achieve clock
speeds in excess of 3GHz. Intel recently scrapped this design,
however, returning to the relatively moderate P6 pipeline, which
has 14 stages.
The main problem with the long pipeline is the high clock frequency
(f) directly impacts power dissipation (p=cv2f). Short pipelines
have other advantages, however. They require less buffering and
bypass logic. And they have shorter branch penalties, reducing
the need for power-wasting branch-prediction logic. These changes
reduce die area and design time as well as power dissipation.
Although Intel has cut back to 14 stages, many embedded
processors use pipelines
with 7 to 11 stages for further savings.
Most CPUs execute a simple set of instructions at the core
of their designs. To achieve this, many CPUs break up and
sometimes
merge
together software instructions to create the desired internal
instructions. The most extreme examples are x86 processors,
which have extensive
logic to convert CISC instructions to RISC instructions and,
in some cases, fuse them back together. Clearly, if this conversion
were done ahead of time by software, the CPU would use less
power. PCs cannot take this approach without breaking software
compatibility,
but embedded systems often use CPUs with simpler, more efficient
instruction sets, such as MIPS and ARM. Simpler
Instruction Issue
Superscalar CPUs can issue two or more instructions per
cycle, increasing the execution rate over single-issue (scalar)
CPUs. This method requires extra circuitry to analyze each
group of instructions and determine if they can be issued
together. For example, most CPUs cannot issue two of the
same type of instructions (such as two branches) or one instruction
that depends on the results of the other. Because each instruction
in a group must be checked against all others in the group,
this circuitry rapidly becomes more complex as CPUs attempt
to group three or more instructions.
Out-of-order adds another level of complexity by issuing
instructions in a different order than they appear in the
original program. This method improves instruction grouping
but requires extensive logic that can reconcile the actual
execution order with the original program order. This complex
logic burns power but does not perform useful work.
Another problem with these techniques is that most high-performance
CPUs are limited by memory speed, not CPU speed. These
CPUs must often stall to wait for memory. CPUs that execute
several
instructions per cycle may get to the memory access sooner,
but this speed is hidden by the slow memory.
Back to the Stone Age? This is not to say that the ideal CPU is a classic single-issue
RISC with a five-stage pipeline. Judicious use of extra
pipeline stages and superscalar issue (most likely
two instructions
per cycle) can improve performance to a greater degree
than the increase in power dissipation. But the need
for speed
that drove CPU design for so long has pushed microarchitecture
in the wrong direction. When optimizing for performance
per watt, the pendulum must swing back toward simpler
designs.
The drawback is simpler designs reduce CPU performance,
at least for applications with a single execution
thread. Intel
and AMD do not like this approach, because most PC
software is still single-threaded. Embedded processor
vendors,
however, have put 16 or more simplified CPUs on a
single chip. These
are not puny CPUs; Cavium, for example, has announced
a 16-CPU chip with 1GHz dual-issue MIPS CPUs. This
chip, by the way,
dissipates less power than an Intel processor with
only two CPUs.
Embedded designers are quickly creating multithreaded
software to take advantage of many CPUs per chip.
The same change
will eventually happen in the PC world, although
it will take longer.
Originally published in Nikkei
Electronics Asia,
November 2006
© 2002-2006 The Linley Group
|