Undoing CPU Design

By Linley Gwennap    

Power efficiency has becoming the driving factor behind the design of new CPUs. Even Intel has admitted that it can’t continue to push up the clock speed and power dissipation of its processors, instead using two or more CPUs per chip to increase performance. While this multicore approach is a step in the right direction, over time the CPUs themselves must evolve. In this case, however, the change is more like de-evolution.

During the 1990s, CPU designers created ever-more complex microarchitectures intended to maximize both clock speed and performance per cycle. Achieving these goals required techniques such as superscalar, out-of-order instruction issue, superpipelining, instruction translation, and speculative execution. With a new focus on performance per watt instead of per cycle, we must now reduce or discard these power-wasting techniques. Instead, CPUs must use power mainly to execute instructions.

Shortening the Pipeline

Superpipelining was best exemplified by Intel’s “Prescott” Pentium 4, which used a staggering 31-stage pipeline to achieve clock speeds in excess of 3GHz. Intel recently scrapped this design, however, returning to the relatively moderate P6 pipeline, which has 14 stages.

The main problem with the long pipeline is the high clock frequency (f) directly impacts power dissipation (p=cv2f). Short pipelines have other advantages, however. They require less buffering and bypass logic. And they have shorter branch penalties, reducing the need for power-wasting branch-prediction logic. These changes reduce die area and design time as well as power dissipation. Although Intel has cut back to 14 stages, many embedded processors use pipelines with 7 to 11 stages for further savings.

Most CPUs execute a simple set of instructions at the core of their designs. To achieve this, many CPUs break up and sometimes merge together software instructions to create the desired internal instructions. The most extreme examples are x86 processors, which have extensive logic to convert CISC instructions to RISC instructions and, in some cases, fuse them back together. Clearly, if this conversion were done ahead of time by software, the CPU would use less power. PCs cannot take this approach without breaking software compatibility, but embedded systems often use CPUs with simpler, more efficient instruction sets, such as MIPS and ARM.

Simpler Instruction Issue

Superscalar CPUs can issue two or more instructions per cycle, increasing the execution rate over single-issue (scalar) CPUs. This method requires extra circuitry to analyze each group of instructions and determine if they can be issued together. For example, most CPUs cannot issue two of the same type of instructions (such as two branches) or one instruction that depends on the results of the other. Because each instruction in a group must be checked against all others in the group, this circuitry rapidly becomes more complex as CPUs attempt to group three or more instructions.

Out-of-order adds another level of complexity by issuing instructions in a different order than they appear in the original program. This method improves instruction grouping but requires extensive logic that can reconcile the actual execution order with the original program order. This complex logic burns power but does not perform useful work.

Another problem with these techniques is that most high-performance CPUs are limited by memory speed, not CPU speed. These CPUs must often stall to wait for memory. CPUs that execute several instructions per cycle may get to the memory access sooner, but this speed is hidden by the slow memory.

Back to the Stone Age?

This is not to say that the ideal CPU is a classic single-issue RISC with a five-stage pipeline. Judicious use of extra pipeline stages and superscalar issue (most likely two instructions per cycle) can improve performance to a greater degree than the increase in power dissipation. But the need for speed that drove CPU design for so long has pushed microarchitecture in the wrong direction. When optimizing for performance per watt, the pendulum must swing back toward simpler designs.

The drawback is simpler designs reduce CPU performance, at least for applications with a single execution thread. Intel and AMD do not like this approach, because most PC software is still single-threaded. Embedded processor vendors, however, have put 16 or more simplified CPUs on a single chip. These are not puny CPUs; Cavium, for example, has announced a 16-CPU chip with 1GHz dual-issue MIPS CPUs. This chip, by the way, dissipates less power than an Intel processor with only two CPUs.

Embedded designers are quickly creating multithreaded software to take advantage of many CPUs per chip. The same change will eventually happen in the PC world, although it will take longer.


Originally published in
Nikkei Electronics Asia, November 2006




© 2002-2006 The Linley Group