Bulldozer 20 Questions, Part 2
Here is the second set of answers from our 20 Questions blog. I am leaving the last 2 blogs (final 10 answers) until after the Hot Chips event. We’ll be unveiling new information there and I am pretty sure that will drive a lot of questions.
“Will Bulldozer implement new versions of Hypertransport?” – Rheo
No, Bulldozer takes advantage of the same version of HyperTransport™ (HT) technology as our existing AMD Opteron™ 4000 and 6000 series processors, HyperTransport 3.1.
“Is there any”programmable-tangible” improvement in synchronization between cores in the same module? In other words, will I get tangible performance improvement if I can partition my multi-threaded algorithm to pairs of closely interacting threads, and schedule each pair to a module?” – Edward Yang
That is a very interesting question.
For the majority of software, the OS will work in concert with the processor to manage the thread to core relationships. We are collaborating with Microsoft and the open source software community to ensure that future versions of Windows and Linux operating systems will understand how to enumerate and effectively schedule the Bulldozer core pairs. The OS will understand if your machine is setup for maximum performance or for maximum performance/watt which takes advantage of Core Performance Boost.
However, let’s say you want to explore if you can get a performance advantage if your threads were scheduled on different modules. The benefit you can gain really depends on how much sharing the two threads are going to do.
Since the two integer cores are completely separate and have their own execution clusters (pipelines) you get no sharing of data in the L1 – and there is no specific optimizations needed at the software level. However, at the L2 cache level there could be some benefits. A shared L2 cache means that both cores have access to read the same cache lines – but obviously only one can write any cache line at any time. This means that if you have a workload with a main focus of querying data and your two threads are sharing a data set that fits in our L2, then having them execute in the same module could have some advantages. The main advantage we expect to see is an increase in the power efficiency of the cores that are idle. The more idle other cores are, the better chance the busy cores will have to boost.
However, there is another consideration to this which is how available other cores are. You need to weigh the benefits of data sharing with the benefit of starting the thread on the next available core. Stacking up threads to execute in proximity means that a thread might be waiting in line while an open core is available for immediate execution. If your multi-threaded application isn’t optimized to target the L2 (or possibly the L3 cache), or you have distinctly separate applications to run, and you don’t need to conserve power, then you’ll likely get better performance by having them scheduled on separate modules. So it is important to weigh both options to determine the best execution.
“How much extra performance will we see when running two-threaded applications on one Bulldozer Module compared to two cores in different modules?” – Simon
Without getting too specific around actual scaling across cores on the processor, let me share with you what was in the Hot Chips presentation. Compared to CMP (chip multiprocessing – which is, in simplistic terms building a multicore chip with each core having its own dedicated resources) two integer cores in a Bulldozer module would deliver roughly 80% of the throughput. But, because they have shared resources, they deliver that throughput at low power and low cost. Using CMP has some drawbacks, including more heat and more die space. The heat can limit performance in addition to consuming more power. Ask yourself, would you rather have a 4-cylinder engine that delivered 300HP or a 6-cylinder engine that delivered 360HP and consumed less gas? The cylinder to horsepower ratio for 4-cylinder is obviously higher (75HP/cylinder vs. the V6’s 60HP/cylinder), meaning that each cylinder can give you more performance. However, looking at the overall enginge, you are getting less total output; and you are getting that lower output at a higher cost (higher gas consumption).
“Current and forthcoming Nehalem EX based servers from IBM and HP top out at 8 sockets and 64 cores. What kind of vertical scalability can we expect from Bulldozer-based servers?” – David Roff
Bulldozer will fit into the current “Maranello” and “San Marino/Adelaide” platforms. “Maranello” is our high performance platform that will support up to 4 CPUs. Combining a “Maranello” platform with the upcoming 16-core “Interlagos” processors, the total core density of a 4P system will reach as many as 64 cores.
The 8P x86 market today is pretty small. According to IDC, last year it accounted for roughly 7,915 total servers, down 26% from the year before (Source: IDC Quarterly Server Tracker, Q4 2009). If you want to say that 2009 was a bad year, from 2007 to 2008 the 8P x86 market was essentially flat as well, so that isn’t a growth engine. Part of what is impacting that market is the core and memory densities of today’s systems. People bought 8P servers to get to 48 cores (8 x 6-core) or to get to large memory footprints. Today’s 4P systems are meeting those needs at a lower price, with lower power consumption and lower latency. When we get to 2011 with “Bulldozer,” you’ll see an increase up to 64 cores, and we expect the total memory footprint will increase again.
The bottom line is, you’ll get the 64 cores that you want, you’ll just have to spend a lot less to get them; is that OK?
“As far as power usage goes, from what I understand BD is supposed to be taking power management features to a level of granularity that hasn’t been seen yet with consumer/business grade CPUs. Will those new features be available to current MC users or will a platform upgrade be necessary? Can you elaborate on any new power saving features that would make a business want to consider BD at this time?” – Jeremy Stewart
Current “Maranello” platforms with AMD Opteron™ 6100 Series processors already have the hooks embedded in them for any “Bulldozer”-level power efficiency features. When we specified the platforms for today’s processors, we did so with “Bulldozer” in mind.
As we have said already in this blog, we expect the shared architecture to provide us with a great deal of power savings – there are a lot of circuits that are essentially being duplicated in today’s multicore processors. Having a new “from the ground up” design allowed us to take a very close look at the circuits and determine which ones are ripe for consolidation and which ones really need their own dedicated resources.
We started with inherently power-efficient microarchitecture and implementation that included dynamic sharing of shared resources, minimized data movement and took advantage of extensive clock and power gating. From there, we added active management support that allows us to digitally measure activity in order to estimate power. Support for chip-level core power gating was also added to the processor.
These new features join existing AMD Opteron processor technologies such as AMD PowerNow!™, AMD CoolCore™, low voltage DDR-3 memory support and more, all working in concert to help create a power efficient system.
Even though you’ll see processors with 33% more cores and larger caches than the previous generation, we’ll still be fitting them into the same power and thermal ranges that you see with our existing 12-core processors.
John Fruehe is the Director of Product Marketing for Server/Workstation products at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. This blog contains forward-looking statements. Forward-looking statements are generally preceded by words such as “plans,” “expects,” “believes,” “anticipates” or “intends.” AMD Investors are cautioned that all forward-looking statements in this blog involve risks and uncertainties that could cause actual results to differ materially from current expectations.
POSTED IN: AMD Opteron, Bulldozer
TAGS: AMD Cores, AMD Opteron, Bulldozer


Скажите пожалуйста, “модуль” будет считаться как 1 ядро процессора или как 2?
A module will have 2 cores in it. It will be seen by the hardware and software as 2 cores. The module will essentially be invisible to the system.
being a massive gamer and a big fan of the X3 720BE.
seeing on the 1090t that the turbo feature can speed up 3 cores when others are not being used.
can there be an option for the new processors that can do similar but only on one core that massivly boosts its potential.
almost all games these days are single thread still and some of them bring 3.6ghz to its knees (x3 terran conflict)
overclocker dreaming…. hopefully not
There will be updated Turbo CORE technology in Bulldozer, but we will not release details until launch.
Really Increadible I can absolutely not believe that this really works
Pingback: [Thread Ufficiale] Aspettando AMD Bulldozer e APU FUSION (Llano/Bobcat) - Xtreme Hardware Forum