Sports Car Forum - MotorWorld.net - View Single Post - Official PS3 thead. Pre Order at GameStop/eb games OCT 10!

Pimp Racer · 08-03-2006, 10:44 PM

Xbox360 “Xenon” compared to Playstation 3’s “Cell” – the CPUs:
Inter-core communication speed:
Another mystery with the Xbox360 (at least in my view) exists with the inter-core communication on the Xenos CPU between its cores. IBM clearly documents the Cell’s inter-core communication mechanism physically and how it is implemented in hardware and software. This bandwidth needs to be extremely high if separate cores need to communicate and share data effectively. The EIB on the Cell is documented at a peak performance of 204GB/s with an observed rate at 197GB/s. The major factor that affects this rate is the direction, source, and destination of data flow between the SPE and PPEs on the Cell. I tried to find out the equivalent piece of hardware inside the Xenon CPU and haven’t found a direct answer.

Looking at the second architectural diagram of the Xenon, it seems that the fastest method the cores can use to talk to each other is through the L2 cache. Granted, the Xenon only has 3 cores, game modules are usually highly dependent and will need to talk to each other frequently. I might be a jumping the gun a bit, but given the L2 cache and FSB are running at half of the core speed, as opposed to the Playstation 3’s EIB which runs at the same clock speed as the cores, I’m pretty positive using L2 cache to communicate is not going to be very fast. It seems that independent threads are really what Microsoft was aiming for with the Xbox360 CPU design, and games are not optimally implemented if they have massive streaming transfers to hand off to other cores. What would suggest that the Xbox360 cores can communicate quickly and with high bandwidth, would be evidence that the reading and writing to the L2 cache are in larger segments than the writes to the EIB, compensating for the lower clock speed. Additionally, just writing to memory isn’t enough as the receiver needs some sort of notification that it has new data unless it is a permanent buffer. If anyone wants to do research on the topic, please add it to the discussion and include links to your sources.

Enhanced VMX-128 instruction set:
This is one of the features Microsoft boasts to claim they have a better gaming machine than Sony. They focus on the fact that their enhancements support a single cycle dot product instruction, and the larger register file. The problem with this boast over the Playstation 3 is that it compares it to the PPE’s VMX-128 unit which comparably only has 1 set of 32 128-bit registers and presumably less instructions. If the code requires 128 128-bit registers, or more complex instructions, then the code is most definitely vector processing heavy and should be run on an SPE which sports the exact same register file size, and includes a superset of the VMX instructions in terms of functionality(it is not a superset in terms of being binary compatible).

While each core in the Xbox360 also has two VMX-128 register sets, this is done to support the dual threaded nature of the cores better. It doesn’t actually have two vector execution units. Each core only has one VMX-128 execution unit meaning that even though there are two sets of registers per core, two threads that are using vector code have to share this single execution unit.

Comparably, the Cell’s PPE has the limited 32 128-bit register file with a single VMX vector unit on the PPE. This is what Microsoft usually singles out when they compare Playstation 3 to the Xbox360’s CPU. They forget(purposefully) that the Cell has 7 SPEs running at 3.2 GHZ, which is far greater SIMD performance than their 3 enhanced VMX-128 execution units. For vector based computations, the Playstation 3 undeniably outdoes the Xbox360 by an order of magnitude.

The dot product instruction claim is matched at least on the SPEs on the Playstation 3 though a simple multiply-add instruction. For those of you that aren’t mathematically inclined, a dot product is basically a measure of how parallel or perpendicular two lines are. The calculation of a dot product is basically multiplying each corresponding dimension value together, and then taking those products and adding them all together. Take two vectors <2, 3, 4> and <6, 7, 8>. The dot product would be: 2*6 + 3*7 + 4*8 = 65. If you read the earlier section in this post covering the SPES and SIMD architectures, you should remember that at the very least, an SPE can do all of the multiplying in one cycle, and all that needs to be done is a follow up add between the elements in the result vector. I do know that the SPEs have a few multiply-add instructions, but the bit of haziness is if the multiply can be an intra-vector(between two separate vectors) operation, while the add instruction is an inter-vector(between elements in the same vector) instruction from the result of the multiply. Sony claims that the dot product can be done in one cycle on an SPE, and it is very reasonable that this is the case as there are vector permute/shuffles/shift instructions in the SPE instruction set. There just isn’t a labeled dot product instruction in the SPE instruction set – but an intelligent programmer should find what he needs.

I found the multiply-add instruction in the Cell BE Handbook. It takes 4 vectors, one is definitely the result vector and two are operands, but the third parameter named ‘rc’, which I think represents a control register that dictates how to perform inter and intra vector operations. That means the multiply-add instruction has to operate on only two vectors, and the control vector is able to dictate an add between the result components of the multiply.

Symmetrical Cores?:
Symmetrical cores means identical cores. The appeal to this setup is entirely for developers. It represents no actual horsepower advantage over asymmetric cores since code running on any of the cores, will run exactly the same as it would run if it were on another core. Relocating code to different cores has absolutely no performance gain or loss unless it means something with respect to how the 3 cores talk to each other. It should be noted though, that thread relocation does matter between the cores, as a thread might not co-exist well with another thread that is trying to use the same hardware that isn’t duplicated on the core. In that case, the thread would be better located on a core that has that execution resource free or less used. The only case of this I can think of is the VMX-128 execution unit. I think most other hardware is duplicated on the cores in the 360 to allow for two threads to co-exist with almost no problem.

The Cell chip has asymmetrical cores, which means they are not all identical. That being said, the SPEs are all symmetrical with each other and the code that runs on an SPE could be relocated to any other SPE in the Cell. While the execution speed local to the SPEs are the same, there are performance issues related to the bandwidth the SPE is using and who it’s talking to on the EIB. Developers should look at where their SPE code is executing to ensure optimal bandwidth is being observed on the EIB, but once they find an optimal location to execute the code on, they can just put it there without rewriting anything. If a task was running on the PPE or PPE’s VMX unit, then it would have to be recompiled with C, and probably rewritten if hardware specific instructions are in the code(C or ASM) before it moves to an SPE, and the same applies in reverse. Good design and architecture should immediately let developers know what should run on the PPE and what should run on the SPEs, eliminating the chance of rewriting code if they see something better fit to run on an SPE later in development.

Is general purpose needed?:
Another one of Microsoft’s claims for the Xbox360’s superiority in gaming is the general purpose processing advantage since they have 3 general purpose cores instead of 1.

To say “most of the code is general purpose” probably refers to code size, not execution time. First, it should be clarified that “general purpose code” is only a label for the garden variety of instructions that may be given to hardware. On the hardware end, this code fits into various classifications such as arithmetic, load/store, SIMD, floating point, and possibly more. General purpose applications are programs made up of general purpose code on the scale that one function might be arithmetically heavy, and another might be memory bound. Good examples of this are MS Word, a web browser, or an entire operating system. With MS Word there is a lot of string processing which involves some arithmetic, comparison, a lot of branching, and memory operations. When you click import or export and save to various file formats, it is an I/O heavy operation. Applications like these tend to not execute the same code over an over, and have many different functions that can occur on relatively a small set of data depending on what the user does. These functions can vary from being very I/O device bound (saving to disk), to string processing intensive (spelling/grammar check), to floating point intensive(embedded Flash media game or resizing an image). Ultimately, there is a large amount of code written to handle the small set of data and most of it never gets executed.

Games are not general purpose programs. Any basic game programming book will introduce you to the concept of a game loop. This loop contains all of the functionality a game performs each frame. This loop handles all of the events that can occur in the game. An important principle in a game loop is to avoid branches when unnecessary as it slows down execution and makes the code on screen extremely and unnecessarily long. A good example of this is the Cohen-Sutherland line clipping algorithm. Instead of writing lengthy and complicated branches to check the 9 regions a point lies in, the code performs 4 simpler checks, and computes a region code which can be easily be used.

This automatic and repetitive processing has to occur for many game objects which represents a massive amount of data, with a relatively small code size. This is opposite of the general purpose paradigm, which typically has a small set of data (word document or html) and performs many various functions on it representing a large code size. Games processing has a large data size, but much smaller code size. Game objects also tend to be very parallel in nature as game objects are typically independent until they interact (collision) – which means they can be processed well on SIMD architectures if they are well thought out..

The whole integer advantage claim for the Xbox360 CPU is pretty stupid considering the SIMD architectures can operate on 4 32-bit integers at the same time, and integer processing abilities of games are not the bottleneck of 3D games processing.

What this general purpose power does grant Xbox360 owners over Playstation 3 is the ability to run general purpose applications faster. If the Xbox360 had a web browser(official or not), the design for such an application would work better on a general purpose CPU(s). That being said, it’s too bad Xbox360 doesn’t come with one, and web browsers don’t put the highest demand on general purpose processors to begin with. Most general purpose applications remain idle until the user gives actually input. The application will then process the task and complete before sitting idle again.

AI routines that navigate through large game trees are probably another area where general purpose processing power might be better utilized since this code tends to be more branch laden and varying depending on the task the AI is actually trying to accomplish. The plus side for the Playstation 3 is generating of these game trees, which is also time consuming. Generating a game tree is a more computational oriented task, and is likely to be executed faster by SIMD architectures. I am largely speaking speculatively under my Computer Science knowledge in this area. Anyone who knows more or has done more research on AI algorithms is welcome to add to discussion in this area.

The only case I can really see the general purpose computing power of the Xbox360 cores manifesting itself as a true advantage over the Playstation 3, is if Windows or similar OS was put on an Xbox360, having multiple applications running simultaneously along with some background services. Again, it is funny that Playstation 3 is more likely to have a general purpose operating system running on it than Xbox360 even though it would perform worse doing such a task.

XDR vs GDDR3 – System Memory Latency:
XDR stands for eXtreme Data Rate while GDDR3 stands for Graphics Double Data Rate version 3. XDR RAM is a new next generation RAM technology from those old folks at Rambus, who brought out that extremely high bandwidth RDRAM back during the onset of Pentium 4 processors. DDR was released soon after and offered comparable bandwidth at a much lower cost. RDRAM also had increased latency, higher cost, and a few other drawbacks which ultimately led to it being dropped very quickly by Intel back when it was released. Take note that DDR RAM is not the same as GDDR RAM.

Anyways, it was hard to make a good assessment on what the exact nature of the performance difference between these two RAM architectures are, but from what I gathered, GDDR3 is primarily meant to serve GPUS which means bandwidth is the goal of the architecture, at the cost of increased latency. For GPUs this is accepatable since, large streaming chunks of data are being worked on instead of small random accesses. In the case of CPU main memory, when more general purpose tasks are being performed, latency has increased importance on memory access times because data will be accessed at random more frequently than a GPU would.

That being said, the Xbox360’s CPUs bandwidth to RAM tops out at 21.6GB/s while the Cell processor still has more bandwidth to its RAM at 25.6GB/s. XDR RAM also does this without incurring high latency, and I’m almost positive its latency is lower than GDDR3 which is considered to actually have high latency. Games are not going to be performing a lot of general purpose tasks so the latency advantage for the Playstation 3 might not be that large, but the CPU will be performing more random accesses to memory regardless. The Xbox360’s CPU latency may be made worse than the already inherent GDDR3 latency issues due to being separated by the GPU.

08-03-2006, 10:44 PM	#5
Pimp Racer Regular User Join Date: Oct 2004 Location: 30° 4' N 85° 35' W Posts: 1,783	Xbox360 “Xenon” compared to Playstation 3’s “Cell” – the CPUs: Inter-core communication speed: Another mystery with the Xbox360 (at least in my view) exists with the inter-core communication on the Xenos CPU between its cores. IBM clearly documents the Cell’s inter-core communication mechanism physically and how it is implemented in hardware and software. This bandwidth needs to be extremely high if separate cores need to communicate and share data effectively. The EIB on the Cell is documented at a peak performance of 204GB/s with an observed rate at 197GB/s. The major factor that affects this rate is the direction, source, and destination of data flow between the SPE and PPEs on the Cell. I tried to find out the equivalent piece of hardware inside the Xenon CPU and haven’t found a direct answer. Looking at the second architectural diagram of the Xenon, it seems that the fastest method the cores can use to talk to each other is through the L2 cache. Granted, the Xenon only has 3 cores, game modules are usually highly dependent and will need to talk to each other frequently. I might be a jumping the gun a bit, but given the L2 cache and FSB are running at half of the core speed, as opposed to the Playstation 3’s EIB which runs at the same clock speed as the cores, I’m pretty positive using L2 cache to communicate is not going to be very fast. It seems that independent threads are really what Microsoft was aiming for with the Xbox360 CPU design, and games are not optimally implemented if they have massive streaming transfers to hand off to other cores. What would suggest that the Xbox360 cores can communicate quickly and with high bandwidth, would be evidence that the reading and writing to the L2 cache are in larger segments than the writes to the EIB, compensating for the lower clock speed. Additionally, just writing to memory isn’t enough as the receiver needs some sort of notification that it has new data unless it is a permanent buffer. If anyone wants to do research on the topic, please add it to the discussion and include links to your sources. Enhanced VMX-128 instruction set: This is one of the features Microsoft boasts to claim they have a better gaming machine than Sony. They focus on the fact that their enhancements support a single cycle dot product instruction, and the larger register file. The problem with this boast over the Playstation 3 is that it compares it to the PPE’s VMX-128 unit which comparably only has 1 set of 32 128-bit registers and presumably less instructions. If the code requires 128 128-bit registers, or more complex instructions, then the code is most definitely vector processing heavy and should be run on an SPE which sports the exact same register file size, and includes a superset of the VMX instructions in terms of functionality(it is not a superset in terms of being binary compatible). While each core in the Xbox360 also has two VMX-128 register sets, this is done to support the dual threaded nature of the cores better. It doesn’t actually have two vector execution units. Each core only has one VMX-128 execution unit meaning that even though there are two sets of registers per core, two threads that are using vector code have to share this single execution unit. Comparably, the Cell’s PPE has the limited 32 128-bit register file with a single VMX vector unit on the PPE. This is what Microsoft usually singles out when they compare Playstation 3 to the Xbox360’s CPU. They forget(purposefully) that the Cell has 7 SPEs running at 3.2 GHZ, which is far greater SIMD performance than their 3 enhanced VMX-128 execution units. For vector based computations, the Playstation 3 undeniably outdoes the Xbox360 by an order of magnitude. The dot product instruction claim is matched at least on the SPEs on the Playstation 3 though a simple multiply-add instruction. For those of you that aren’t mathematically inclined, a dot product is basically a measure of how parallel or perpendicular two lines are. The calculation of a dot product is basically multiplying each corresponding dimension value together, and then taking those products and adding them all together. Take two vectors <2, 3, 4> and <6, 7, 8>. The dot product would be: 26 + 37 + 4*8 = 65. If you read the earlier section in this post covering the SPES and SIMD architectures, you should remember that at the very least, an SPE can do all of the multiplying in one cycle, and all that needs to be done is a follow up add between the elements in the result vector. I do know that the SPEs have a few multiply-add instructions, but the bit of haziness is if the multiply can be an intra-vector(between two separate vectors) operation, while the add instruction is an inter-vector(between elements in the same vector) instruction from the result of the multiply. Sony claims that the dot product can be done in one cycle on an SPE, and it is very reasonable that this is the case as there are vector permute/shuffles/shift instructions in the SPE instruction set. There just isn’t a labeled dot product instruction in the SPE instruction set – but an intelligent programmer should find what he needs. I found the multiply-add instruction in the Cell BE Handbook. It takes 4 vectors, one is definitely the result vector and two are operands, but the third parameter named ‘rc’, which I think represents a control register that dictates how to perform inter and intra vector operations. That means the multiply-add instruction has to operate on only two vectors, and the control vector is able to dictate an add between the result components of the multiply. Symmetrical Cores?: Symmetrical cores means identical cores. The appeal to this setup is entirely for developers. It represents no actual horsepower advantage over asymmetric cores since code running on any of the cores, will run exactly the same as it would run if it were on another core. Relocating code to different cores has absolutely no performance gain or loss unless it means something with respect to how the 3 cores talk to each other. It should be noted though, that thread relocation does matter between the cores, as a thread might not co-exist well with another thread that is trying to use the same hardware that isn’t duplicated on the core. In that case, the thread would be better located on a core that has that execution resource free or less used. The only case of this I can think of is the VMX-128 execution unit. I think most other hardware is duplicated on the cores in the 360 to allow for two threads to co-exist with almost no problem. The Cell chip has asymmetrical cores, which means they are not all identical. That being said, the SPEs are all symmetrical with each other and the code that runs on an SPE could be relocated to any other SPE in the Cell. While the execution speed local to the SPEs are the same, there are performance issues related to the bandwidth the SPE is using and who it’s talking to on the EIB. Developers should look at where their SPE code is executing to ensure optimal bandwidth is being observed on the EIB, but once they find an optimal location to execute the code on, they can just put it there without rewriting anything. If a task was running on the PPE or PPE’s VMX unit, then it would have to be recompiled with C, and probably rewritten if hardware specific instructions are in the code(C or ASM) before it moves to an SPE, and the same applies in reverse. Good design and architecture should immediately let developers know what should run on the PPE and what should run on the SPEs, eliminating the chance of rewriting code if they see something better fit to run on an SPE later in development. Is general purpose needed?: Another one of Microsoft’s claims for the Xbox360’s superiority in gaming is the general purpose processing advantage since they have 3 general purpose cores instead of 1. To say “most of the code is general purpose” probably refers to code size, not execution time. First, it should be clarified that “general purpose code” is only a label for the garden variety of instructions that may be given to hardware. On the hardware end, this code fits into various classifications such as arithmetic, load/store, SIMD, floating point, and possibly more. General purpose applications are programs made up of general purpose code on the scale that one function might be arithmetically heavy, and another might be memory bound. Good examples of this are MS Word, a web browser, or an entire operating system. With MS Word there is a lot of string processing which involves some arithmetic, comparison, a lot of branching, and memory operations. When you click import or export and save to various file formats, it is an I/O heavy operation. Applications like these tend to not execute the same code over an over, and have many different functions that can occur on relatively a small set of data depending on what the user does. These functions can vary from being very I/O device bound (saving to disk), to string processing intensive (spelling/grammar check), to floating point intensive(embedded Flash media game or resizing an image). Ultimately, there is a large amount of code written to handle the small set of data and most of it never gets executed. Games are not general purpose programs. Any basic game programming book will introduce you to the concept of a game loop. This loop contains all of the functionality a game performs each frame. This loop handles all of the events that can occur in the game. An important principle in a game loop is to avoid branches when unnecessary as it slows down execution and makes the code on screen extremely and unnecessarily long. A good example of this is the Cohen-Sutherland line clipping algorithm. Instead of writing lengthy and complicated branches to check the 9 regions a point lies in, the code performs 4 simpler checks, and computes a region code which can be easily be used. This automatic and repetitive processing has to occur for many game objects which represents a massive amount of data, with a relatively small code size. This is opposite of the general purpose paradigm, which typically has a small set of data (word document or html) and performs many various functions on it representing a large code size. Games processing has a large data size, but much smaller code size. Game objects also tend to be very parallel in nature as game objects are typically independent until they interact (collision) – which means they can be processed well on SIMD architectures if they are well thought out.. The whole integer advantage claim for the Xbox360 CPU is pretty stupid considering the SIMD architectures can operate on 4 32-bit integers at the same time, and integer processing abilities of games are not the bottleneck of 3D games processing. What this general purpose power does grant Xbox360 owners over Playstation 3 is the ability to run general purpose applications faster. If the Xbox360 had a web browser(official or not), the design for such an application would work better on a general purpose CPU(s). That being said, it’s too bad Xbox360 doesn’t come with one, and web browsers don’t put the highest demand on general purpose processors to begin with. Most general purpose applications remain idle until the user gives actually input. The application will then process the task and complete before sitting idle again. AI routines that navigate through large game trees are probably another area where general purpose processing power might be better utilized since this code tends to be more branch laden and varying depending on the task the AI is actually trying to accomplish. The plus side for the Playstation 3 is generating of these game trees, which is also time consuming. Generating a game tree is a more computational oriented task, and is likely to be executed faster by SIMD architectures. I am largely speaking speculatively under my Computer Science knowledge in this area. Anyone who knows more or has done more research on AI algorithms is welcome to add to discussion in this area. The only case I can really see the general purpose computing power of the Xbox360 cores manifesting itself as a true advantage over the Playstation 3, is if Windows or similar OS was put on an Xbox360, having multiple applications running simultaneously along with some background services. Again, it is funny that Playstation 3 is more likely to have a general purpose operating system running on it than Xbox360 even though it would perform worse doing such a task. XDR vs GDDR3 – System Memory Latency: XDR stands for eXtreme Data Rate while GDDR3 stands for Graphics Double Data Rate version 3. XDR RAM is a new next generation RAM technology from those old folks at Rambus, who brought out that extremely high bandwidth RDRAM back during the onset of Pentium 4 processors. DDR was released soon after and offered comparable bandwidth at a much lower cost. RDRAM also had increased latency, higher cost, and a few other drawbacks which ultimately led to it being dropped very quickly by Intel back when it was released. Take note that DDR RAM is not the same as GDDR RAM. Anyways, it was hard to make a good assessment on what the exact nature of the performance difference between these two RAM architectures are, but from what I gathered, GDDR3 is primarily meant to serve GPUS which means bandwidth is the goal of the architecture, at the cost of increased latency. For GPUs this is accepatable since, large streaming chunks of data are being worked on instead of small random accesses. In the case of CPU main memory, when more general purpose tasks are being performed, latency has increased importance on memory access times because data will be accessed at random more frequently than a GPU would. That being said, the Xbox360’s CPUs bandwidth to RAM tops out at 21.6GB/s while the Cell processor still has more bandwidth to its RAM at 25.6GB/s. XDR RAM also does this without incurring high latency, and I’m almost positive its latency is lower than GDDR3 which is considered to actually have high latency. Games are not going to be performing a lot of general purpose tasks so the latency advantage for the Playstation 3 might not be that large, but the CPU will be performing more random accesses to memory regardless. The Xbox360’s CPU latency may be made worse than the already inherent GDDR3 latency issues due to being separated by the GPU. __________________ http://gthotspot.blogspot.com/