Saturday, January 24, 2009

What is Nehalem Microprocessor Microarchitecture? How the Nehalem Microprocessor Microarchitecture Works?

Take the number two and double it and you've got four. Double it again and you've got eight. Continue this trend of doubling the previous product and within 10 rounds you're up to 1,024. By 20 rounds you've hit 1,048,576. This is called exponential growth. It's the principle behind one of the most important concepts in the evolution of electronics.

In 1965, Intel co-founder Gordon Moore made an observation that has since dictated the direction of the semiconductor industry. Moore noted that the density of transistors on a chip doubled every year. That meant that every 12 months, chip manufacturers were finding ways to shrink transistor sizes so that twice as many could fit on a chip substrate.

Moore pointed out that the density of transistors on a chip and the cost of manufacturing chips were tied together. But the media -- and just about everybody else -- latched on to the idea that the microchip industry was developing at an exponential rate. Moore's observations and predictions morphed into a concept we call Moore's Law.

Over the years, people have tweaked Moore's Law to fit the parameters of chip development. At one point, the length of time between doubling the number of transistors on a chip increased to 18 months. Today, it's more like two years. That's still an impressive achievement considering that today's top microprocessors contain more than a billion transistors on a single chip.

­Another way to look at Moore's Law is to say that the processing power of a microchip doubles in capacity every two years. That's almost the same as saying the number of transistors doubles -- microprocessors draw processing power from transistors. But another way to boost processor power is to find new ways to design chips so that they're more efficient.

­This brings us back to Intel. Intel's philosophy is to follow a tick-tock strategy. The tick refers to creating new methods of building smaller transistors. The tock refers to maximizing the microprocessor's power and speed. The most recent Intel tick chip to hit the market (at the time of this writing) is the Penryn chip, which has transistors on the 45-nanometer scale. A nanometer is one-billionth the size of a meter -- to put that in the proper perspective, an average human hair is about 100,000 nanometers in diameter.

So what's the tock? That would be the new Core i7 microprocessor from Intel. It has transistors the same size as the Penryn's, but uses Intel's new Nehalem microarchitecture to increase power and speed. By following this tick-tock philosophy, Intel hopes to stay on target to meet the expectations of Moore's Law for several more years.

How does the Nehalem microprocessor use the same-sized transistors as the Penryn and yet get better results? Let's take a closer look at the microprocessor.

Nehalem Architecture

Gordon Moore
Justin Sullivan/Getty Images
Intel co-founder Gordon Moore, of Moore's Law fame.
You can look at the Nehalem microprocessor as a chip that has two main sections: a core and then the surrounding components called the un-core. The core of the microprocessor contains the following elements:

  • The processors, which do the actual number crunching. This can include anything from simple mathematical operations like adding and subtracting to much more complex functions.
  • A section devoted to out-of-order scheduling and retirement logic. This lets the microprocessor make calculations in a more efficient manner by tackling instructions in whichever order is fastest.
  • Cache memory takes up about one-third of the microprocessor's core. The cache allows the microprocessor to store information temporarily on the chip itself, decreasing the need to pull information from other parts of the computer. There are two sections of cache memory in the core called L1 and L2.
  • A branch prediction section on the core allows the microprocessor to anticipate functions based on previous input. By predicting functions, the microprocessor can work more efficiently. If it turns out the predictions are wrong, the chip can cease calculations and shift to perform the correct functions.
  • The rest of the core orders functions, decodes information and organizes data.

The un-core section has an additional 8 megabytes of memory contained in the L3 cache. The reason the L3 cache isn't in the core is because the Nehalem microprocessor is scalable and modular. That means Intel can build chips that have multiple cores. The cores all share the same L3 memory cache. That means multiple cores can work from the same information at the same time.

Why create scalable microprocessors? It's an elegant solution to a tricky problem -- building more processing power without having to reinvent the processor itself. In a way, it's like connecting several batteries in a series. Intel plans on building Nehalem microprocessors in dual, quad and eight-core configurations. Dual-core processors are good for small devices like smartphones. You're more likely to find a quad-core processor in a desktop or laptop computer. Intel designed the eight-core processors for machines like servers -- computers that handle heavy workloads.

Intel says that it will offer Nehalem microprocessors that incorporate a graphics processing unit (GPU) in the un-core. The GPU will function much the same way as a dedicated graphics card.

Nehalem and QuickPath

Core i7 chip
© Intel
Intel built the Core i7 chip series using the Nehalem microarchitecture.

According to Intel, one of the most important developments in microarchitecture is an interconnect system the company calls QuickPath. QuickPath encompasses the connections between the processors, memory and other components.­

In older Intel microrprocessors, commands come in through an input/output (I/O) controller to a centralized memory controller. The memory controller contacts a processor, which may request data. The memory controller retrieves this data from memory storage and sends it to the processor. The processor makes computations based upon that data and sends the results back through the memory controller to the I/O controller. As microprocessors become more complex with multiple processors on a single chip, this model becomes less efficient.

Using the old microarchitecture, Intel's chips had a memory bandwidth of up to 21 gigabytes per second. But microprocessor power was beginning to outpace the speed of data transmissions. QuickPath connectivity improves the memory bandwidth, allowing more information to pass each second.

QuickPath decentralizes and compartmentalizes communication between processors and memory. Instead of a centralized memory controller, each processor has its own memory controller, dedicated memory and cache memory. The processors communicate directly with the I/O controller. Commands come from the I/O controllers to the processors. Because each processor has a dedicated memory controller, memory and cache, there's no centralized bottleneck. Each processor can communicate with its dedicated memory at a speed of 32 gigabytes per second.

Processors also have point-to-point interconnections between each other. That means if one processor needs to access data within another processor's cache, it can send a request directly to the respective processor and get a response. Within each interconnection are distinct data pathways. Data can flow in both directions at the same time, speeding up data transfers. Transfer speeds between the multiple processors and the I/O controller can be up to 25.6 gigabytes per second.

IMC
Intel calls the memory controller, memory and cache configuration the Integrated Memory Controller, or IMC.

QuickPath also allows processors to take shortcuts when requesting information from other processors. Imagine a quad-core microprocessor with processors A, B, C and D. There are links between each processor. In older architectures, if processor A needed information from D, it would send a request. D would then send a request to processors B and C to make sure D had the most recent instance of that data. B and C would send the results to D, which would then be able to send information back to A. Each round of messages is called a hop -- this example had four hops.

QuickPath skips one of these steps. Processor A would send its initial request -- called a "snoop" -- to B, C and D, with D designated as the respondent. Processors B and C would send data to D. D would then send the result to A. This method skips one round of messages, so there are only three hops. It seems like a small improvement, but over billions of calculations it makes a big difference.

In addition, if one of the other processors had the information A requests, it can send the data directly to A. That reduces the hops to 2. QuickPath also packs information in more compact payloads.

Nehalem Branches and Loops

Core i7 chip without heatspreader­
© Intel
The Core i7 chip with the heatspreader removed.

­In a microprocessor, everything runs on clock cycles. Clock cycles are a way to measure how long a microprocessor takes to execute an instruction. Think of it as the number of instructions a microprocessor can execute in a second. The faster the clock speed, the more instructions the microprocessor will be able to handle per second.

One way microprocessors like the Core i7 try to increase efficiency is to predict future instructions based on old instructions. It's called branch prediction. When branch prediction works, the microprocessor completes instructions more efficiently. But if a prediction turns out to be inaccurate, the microprocessor has to compensate. This can mean wasted clock cycles, which translates into slower performance.

Nehalem has two branch target buffers (BTB). These buffers load instructions for the processors in anticipation of what the processors will need next. Assuming the prediction is correct, the processor doesn't need to call up information from the computer's memory. Nehalem's two buffers allow it to load more instructions, decreasing the lag time in the event of one set turning out to be incorrect.

Another efficiency improvement involves software loops. A loop is a string of instructions that the software repeats as it executes. It may come in regular intervals or intermittently. With loops, branch prediction becomes unnecessary -- one instance of a particular loop should execute the same way as every other. Intel designed Nehalem chips to recognize loops and handle them differently than other instructions.

Microprocessors without loop stream detection tend to have a hardware pipeline that begins with branch predictors, then moves to hardware designed to retrieve -- or fetch -- instructions, decode the instructions and execute them. Loop stream detection can identify repeated instructions, bypassing some of this process.

Intel used loop stream detection in its Penryn microprocessors. Penryn's loop stream detection hardware sits between the fetch and decode components of older microprocessors. When the Penryn chip's detector discovers a loop, the microprocessor can shut down the branch prediction and fetch components. This makes the pipeline shorter. But Nehalem goes a step farther. Nehalem's loop stream detector is at the end of the pipeline. When it sees a loop, the microprocessor can shut down everything except the loop stream detector, which sends out the appropriate instructions to a buffer.

The improvements to branch prediction and loop stream detection are all part of Intel's "tock" strategy. The transistors in Nehalem chips are the same size as Penryn's, but Nehalem's design makes more efficient use of the hardware.

Nehalem and Multithreading

Back of the Core i7 Intel chip
© Intel
The back side of the Core i7 chip with Nehalem microarchitecture.

As software applications become more sophisticated, sending instructions to processors becomes complicated. One way to simplify the process is through threading. Threading starts on the software side of the equation. Programmers build applications with instructions that processors can split into multiple streams or threads. Processors can work on individual threads of instructions, teaming up to complete a task. In the world of microprocessors, we call this parallelism because multiple processors work on parallel threads of data at the same time.­

Nehalem's architecture allows each processor to handle two threads simultaneously. That means an eight-core Nehalem microprocessor can process 16 threads at the same time. This gives the Nehalem microprocessor the ability to process complex instructions more efficiently. According to Intel, the multithreading capability is more efficient than adding more processing cores to a microprocessor. Nehalem microprocessors should be able to meet the demands of sophisticated software like video editing programs or high-end video games.

Another benefit to multithreading is that the processor can handle multiple applications at the same time. This lets you work on complex programs while running other applications like virus scanners in the background. With older processors, these activities could cause a computer to slow down or even crash.

Rock Around the Overclock
Nehalem's turbo boost feature is similar to an old hacking trick called overclocking. To overclock a microprocessor is to increase its processing frequency beyond the normal parameters of the chip. Some gamers overclock the processors on their machines to get better performance when playing sophisticated video games. But overclocking isn't always a good idea -- it can cause chips to overheat.

Intel has incorporated an additional technology the company calls turbo boost within Nehalem's architecture. If the processor is running below its limits on power consumption, processing capacity and temperature levels, it can increase its clock frequency. This makes the active processors work faster. With older applications that have a single thread, the chip can increase clock speeds even more.

The turbo boost feature is dynamic -- it makes the Nehalem microprocessor work harder as the workload increases, provided the chip is within its operating parameters. As workload decreases, the microprocessor can work at its normal clock frequency. Because the microchip has a monitoring system, you don't have to worry about the chip overheating or working beyond its capacity. And when you aren't placing heavy demands on your processor, the chip conserves power.

Intel's Tick Tock

Nehalem press conference
© Intel
Intel Executive Vice President Sean Maloney demonstrates the power of the Nehalem microarchitecture using a touchscreen interface at a press conference.
­

Developing a microprocessor takes years. While Intel unveiled Nehalem in 2008, the project was more than five years old at the time. That means even as people wait for an announced microchip to make its way into various electronic devices and computers, manufacturers like Intel are working on the next step in microprocessor evolution. They have to, if they want to keep up with Moore's Law.

The next step for Intel is another "tick" development. That means reducing transistors down to 32-nanometers wide. Producing one microprocessor with transistors that size is an amazing achievement. But what is even more daunting is finding a way to mass produce millions of chips with transistors that small in an efficient, reliable and cost-effective way.

The codename for the next Intel chip is Westmere. Westmere will use the same microarchitecture as Nehalem but will have the 32-nanometer transistors. That means Westmere will be more powerful than Nehalem. But that doesn't mean Westmere's architecture will make the most sense for a microprocessor with transistors that small. That will fall to the next "tock" microprocessor.

And the tock already has a name: Sandy Bridge. The Sandy Bridge microchip will have an architecture optimized for 32-nanometer transistors. It may take a couple of years before we see Sandy Bridge rolled out into the commercial market, but when it does it will likely be just as revolutionary as Nehalem is today.

Where will Intel go after that? It's hard to say. While transistors have shrunk down to sizes practically unimaginable a decade ago, we're getting close to hitting some fundamental laws of physics that could put a halt to rapid development. That's because as you work with smaller materials, you begin to enter the realm of quantum mechanics. The world of quantum mechanics can seem strange to someone only familiar with classic physics. Particles and energy behave in ways that seem counterintuitive from a classic perspective.

One of those behaviors is particularly problematic when it comes to microprocessors: electron tunneling. Normally, transistors can funnel electrons without much risk of leakage. But as barriers get thinner, the possibility for electron tunneling becomes more likely. When an electron encounters a very thin barrier -- something on the order of a single nanometer in width -- it can pass from one side of the barrier to the other even if the electron's energy levels seem too low for that to happen normally. Scientists call the phenomenon tunneling even though the electron doesn't make a physical hole in the barrier.

This is a big problem for microprocessors. Microprocessors work by channeling electrons through transistor switches. Microprocessors with transistors on the nanoscale already have to deal with some levels of electron leakage. Leakage makes microprocessors less efficient. Without a dramatic change to the way Intel designs transistors, there's a danger that Moore's Law will finally become moot.

Still, engineers tend to think of ways around problems that seem completely insurmountable. Even if transistors can't get any smaller after one or two more generations, it won't be the end of electronics. It just might mean we advance a little more slowly than we're accustomed to.

No comments: