Intel spills beans on Core 2 successor: SSE4, faster virtualization, bigger caches

-from ARS technica

By Jon Stokes | Published: March 28, 2007 – 02:23PM CT

At a press conference today, Intel’s Pat Gelsinger revealed fresh details of the company’s forthcoming 45nm processor family, codenamed Penryn. Penryn is the 45nm successor to the Merom/Conroe/Woodcrest microarchitecture that underlies the popular 65nm Core 2 Duo processor line.
Related Stories

Gelsinger opened the briefing with a discussion of the success of the company’s “tick-tock” model of processor innovation, a model in which process shrinks and major architectural revisions are rolled out on a staggered two-year time scale. “Today’s disclosures clearly lay out that this engine is delivering and delivering on track,” said Gelsinger, responding to what he characterized as initial skepticism that the model could work.

Gelsinger then moved on to discuss Penryn, which is the first product that will be produced on Intel’s new high-k dielectric 45nm process. Penryn is more than just a shrink—it’s a derivative of Core 2 Duo (codenamed Merom) with a number of improvements. Gelsinger laid out those improvements in more detail than we’ve seen so far, so I’ll outline them below.
Penryn’s improvements: SSE4 support, better virtualization

Penryn’s back end boasts two major advances over its predecessor. First is a new radix-16 divider that offers a 2x performance improvement on division operations vs. Core 2 Duo. The fast divider also speeds up a range of operations that depend on the divider hardware, like the square root function. Penryn’s SQRT operation is 4x the speed of Core 2.

The other major back-end improvement is support for the SSE4 extensions, a group of 50 new vector instructions aimed at speeding up media and other data-parallel applications. SSE4 will be paired with a new “Super Shuffle Engine,” a full-width, single-pass, 128-bit shuffle unit. This will enable Penryn’s vector hardware to perform 128-bit shuffle operations (e.g. pack, unpack, packed shift) in a single clock cycle. The beefed up shuffle capabilities will help Penryn align incoming vector data in the SSE registers so that the execution hardware can go to work on it.

Intel claims that SSE4, in combination with other new features that I’ll describe shortly, will offer Penryn a performance improvement of as much as 40 percent over Core 2 Duo on some software like video codecs, and as much as 20 percent on games.

A big part of this performance boost will no doubt be due to the higher frontside bus speeds that Penryn will support. Penryn-based Xeon systems will sport frontside bus speeds of up to 1600MHz. Intel estimates that the increased FSB speed could yield up to a 45 percent speedup on bandwidth- and floating-point-intensive applications on the fastest Penryn-based quad-core systems.

To go with the faster FSB, Intel has also upped the cache on the Penryn processors. Dual-core parts will have 6MB of shared L2, while quad-core products will have 12MB. These caches will also be paired with an enhanced version of Intel’s Smart Cache technology. The new Smart Cache will let Penryn speculatively execute across cache lines, eliminating the typical stall associated with non-aligned loads.

Intel will take advantage of the 45nm process not only to increase the amount of cache, but also to raise clockspeeds without significantly boosting power dissipation. Penryn parts will eventually reach the 3GHz mark, and may go even higher. The TDP numbers for Penryn desktop quad-core parts will be 95 and 130 watts, with desktop dual-core parts coming in at a 65W TDP. The TDP numbers for the 45nm Xeon will be 50W/80W/120W, depending on clockspeed. For dual-core, the numbers are 40W/65W/80W.

For Penryn-based mobile parts, Intel will introduce a new low-power state that they’re calling Deep Power Down. In the new state, the core clock is turned completely off, along with the L1 and L2 caches. Process state is saved in a special part of the processor so that the system can be restored on wakeup.

The other big power-related news about Penryn is that it will an enhanced version of Intel’s Dynamic Acceleration Technology. The new version will let Penryn detect when one core is largely idle—and thus not drawing much power—so that it can boost the clockspeed of the other, more active core while remaining in the same power envelope. For single-threaded applications where only one core is used, this will enable Penryn to speed up that one thread by devoting more power to the core on which it’s running.
Penryn boosts virtualization

One of the major features that Penryn brings to the table is a pretty important improvement to the performance of its Virtualization Technology (VT). Specifically, the performance of its virtual machine (VM) exit and VM entry instructions has been boosted so that VM transition times decrease by an average of 25 to 75 percent.

Right now, a lot of folks who’re testing out VT have been disappointed that its performance isn’t much better than existing, non-VT-based virtualization solutions like VMware. Specifically, VMware products use a binary translation engine that ingests regular x86 OS code and produces a “safe” subset; VMware claims that this binary translation approach is as fast as, or faster, than VT-based approaches because the OS doesn’t have to do costly VM transitions in order to execute privileged instructions. (These claims are debated; I’m merely reporting the fact that they are made.)

A major decrease in VM transition times will help the performance of VT-based solutions like Xen, and it would make the “which virtualization package to use?” debate even more about managment and less about relative performance than it already is.

All told, Intel will introduce six Penryn products this year, spanning the full range of segments from ultra-mobile to server. A full fifteen Penryn products are currently in development.