Monday, 28 March 2011

16 And 17 Core Processors

With an unusual core number of 17, IBM’s new BlueGene/Q processor draws attention. AMD’s 16-core processor Interlagos might arrive a bit earlier than expected and the Itanium shows new signs of life.

The “Heptakaideka-Core” processor BlueGene/Q, presented by IBM at the supercomputing conference SC10 in New Orleans, is supposed to power the 20Pflops computer Sequoia, which IBM is supposed to deliver to the Lawrence Livermore National Laboratory in about 2 years. However, only 16 of its 17 cores are meant for computing, the extra core will handle control and I/O tasks. Actually, the BlueGene/Q even has 18 cores, but one of them is spare, that can improve the yield or the reliability during operation. Unlike its BlueGene predecessors, the Q-version was upgraded to 64-bit processing and the SIMD unit was widened so that now it can execute four double precision fused-multiply-add commands with eight floating-point operations per clock. Accordingly, at 1.6 GHz clock speed, the processor would get 205 Gflops – but resourceful software engineers could still improve the performance by making the 17th core calculate, too. Additionally, the processor supports four-way SMT and so, for instance, provides the operating systems (RHEL6 on the I/O nodes, special compute OS on the computing nodes) with 64 “logical” cores or threads.

hanks to the 64-bit support, the modules can now run 8 or 16 GB of DDR3 memory. Five links (2 GB/s per direction) connect each module to its neighbors, making it possible to create different 5D topologies. Half a rack with 8192 BlueGene/Q cores has already proven its capabilities in the Linpack benchmark. With 65.3 Tflops, the test system from the Thomas J. Watson Research Center scored 115th place in the new Top500 list. Its power consumption of 38.8 kW represented a new record value for energy efficiency at close to 1700 Mflops/watt. The Sequoia is supposed to get 96 fully equipped racks, which are supposed to deliver 20 Pflops of theoretical peak performance at the end of 2012.

The True AVX Processor

By then, AMD’s 16-core Interlagos with the new Bulldozer architecture should already have been on the race track for quite a while. At the SC10, AMD even raised the scientists’ hopes that the processor might be ready earlier than expected, which would mean before the third quarter of 2011. The wide-spread doubt in the HPC scene concerning the “halved” FPU – a Bulldozer module contains two integer cores, but only one FPU – was more or less coherently countered by AMD, with the argument that the “Flex FP” is capable of executing two 128-bit commands simultaneously (SSE, AVX). In particular, this is true for the multiply-add commands (FMA) – which are much valued for HPC and which are not supported by Intel’s Sandy Bridge and will probably be lacking from the feature list of its successor, the Ivy Bridge, too. Only for the currently still rarely used 256-bit AVX operations Bulldozer links both units.

Consequently, the Interlagos with its eight modules or 16 cores manages 64 double precision floating-point operations per clock, which makes 224 Gflops at 3.5 GHz. At this clock speed, Intel’s planned 8-core Sandy Bridge EP will achieve the same theoretical peak value. While it doesn’t support FMA, it’s able to execute an AVX multiplication and addition in full 256-bit width in parallel. The Bulldozer’s clock rate specification of 3.5 GHz and the number of transistors per module (213 million) can be found in the abstracts of the presentations for the next International Solid-State Circuits Conference (ISSCC) in February of 2011. Apart from some further details on the Sandy Bridge and Westmere-EX, Intel first of all intents to release first specifications for the next Itanium generation Poulson. The abstract gives away some details already: 32-nm technology, 8 cores with simultaneous multithreading (SMT), 12-issue superscalar (4 bundles with 3 commands each per clock; two times as many as before), 3.1 billion transistors on 544 mm², a total of 50 MB of cache, 128 GB of bandwidth between the processors and 45 GB of memory bandwidth. Now there are speculations that Poulson might feature fine-grained SMT – maybe even with different priorities, like the Power7. A further similarity with the Power7 architecture could be a possible switch to out-of-order execution. This was brought to attention by David Kanter from www.realworldtech.com, in whose forum a certain Linus Torvalds speaks out against the “failed“ Itanium architecture in some pithy lines.

The Itanium only knows modern vector units like SSE in the 32-bit emulation. If the Poulson will have AVX or maybe even something better, is still unknown, but its amount of transistors – after deducting the caches – would most likely be insufficient in any case.

However, the current advanced vector extensions (AVX) are not the same that Intel presented almost three years ago. In the meantime, Intel eliminated some permutation commands and added 256-bit streaming commands. Most importantly, the FMA operations (like “VFMADDPD”) planned as four-operand commands have been reduced to three operands. A source operand will consequently be overridden by the resulting value; which operand that is, can be chosen. So that “VFMADD213PD”multiplies the second operand with the first, adds the third and overrides the first with the result.

Probably, Ronak Singhal, under whose lead the Haswell processor – slated for 2012 – is being developed in the 22-nm process in Oregon (and, later on, the Rockwell in 16-nm structures), encouraged this to make the instruction set compatible with the one of the future 512 bits wide vector unit. This instruction set, which emerged as Larrabee New Instruction Set (LNI) in the past, has exactly the same syntax for FMA; it only supports three operands.

Meanwhile, in addition to the new features, AMD intents to offer the initially planned four-operand version (FMA4) for the Bulldozer, although with a slightly changed encoding. Also the permutation commands, which Intel eliminated, will probably be supported by the Bulldozer. Regarding it this way, the Bulldozer is the processor that Intel originally had in mind for AVX, not the Sandy Bridge.

No comments:

Post a Comment