Because those caches only send and access fragment data you idiot. They have 'high bandwidth' usage because they constantly access the memory for calculations, or to send small fragments of a texture. Its from WORKING ON the assets, or feeding small peices, not MOVING THEM en masse, in their entirety, which is the job main memory levels 1 and 2 which you are talking about.
If they were actually being fed terrabytes of data to work on:
1. There would need to be hundreds or thousands of simd arrays.
2. The operations bandwidth for the lds would shoot up into the petabytes... as it outputs petaflops of data.... Like it does for gpgpu super computers.
And just in general.
1. Local data shares are not caches, as they are entirely under programmer control.
2. Edram is nowhere near half a Tb in bandwidth, and you have no proof whatsoever to show that it is, only an assumption that its bussed 1024 pins a cell.
3. You are a moron.
dude, i already knew that local data cache was under programmers control and that werent caches, I just used caches so that you would undestand i was refering to all the internal memories on the GPU
so what, does that change the facts that each SIMD cores can handle 2TB/s with the local data share?
does that change the fact that texture caches are 480GB/s of bandwidth or more per texture unit in the hd4000 gpus?
seriosuly, you seem just try to avoid the obvious, even if i dont account the simd cores, i still have the texture caches of 480GB/s and there could be like 16 of them or more cause each texture unit has its own texture cache
Seems you like to act smart, yet you fail
you get it?
16x 480GB/s=?
terabytes dude, for the caches
who is the moron?(this happens when you try to play the smart, is not gonna work)
wii u gpu, either a derivate of an hd4000 or hd5000 or whatever, obviously can handle 563.2GB/s without problems with the edram, thats natural cause gpus are more advanced and if a gpu from 2000 can handle 18GB/s of bandwidth with its edram or embedded memory its obvious that a new gpu from 10 years later can surely handle more
have no proof?
1.-made by NEC, the same that made xbox 360 edram
2.- xbox 360 edram had 1024bits, and the formula proves it cause you get the 256GB/s
3.-wii u uses a new edram 7 years more upated fro the same NEC who made the xbox 360 edram
4,-renesas confirmed that the wii u edram uses the latest technologies on that plant from NEC
5.-shinen mentions that wii u has plenty of high bandwidth
6.- wii u requires about 7MB for 720p while xbox 360 requires the whole 10MB for 720p(confirmed by micro at the msdn), which menas that those 7MB should provide similar bandwidth that those old 10MB OF 360 HAVE
7.- Have that reportfrom a respected writer Bob Peterson
and again, i am not saying this, about the terabytes of bandwidth with the caches or the local data sahres is not an assumption, is a fact
read dude, read
http://www.tomshardw...850,1957-5.html
"
With the RV770, the AMD engineers didn’t stop at optimizing their architecture to only slightly increase the die real-estate— they also borrowed a few good ideas from the competition. The G80 had introduced a small, 16-KB memory area per multiprocessor that’s entirely under the programmer’s control, unlike a cache. This memory area, accessible in CUDA applications, can share data among threads. AMD has introduced its version of this with the RV770. It’s called Local Data Share and is exactly the same size as its competitor’s Shared Memory. It also plays a similar role by enabling GPGPU applications to share data among several threads. The RV770 goes even further, with another memory area (also 16 KB) called Global Data Share to enable communication among SIMD arrays.
Texture units
While the ALUs haven’t undergone a major modification, the texture units have been completely redesigned. The goal was obvious – as with the rest of the GPU, it was to increase performance significantly while maintaining as small a die area as possible. The engineers set fairly ambitious goals, aiming for an increase of 70% in performance for an equivalent die area. To do that, they focused their efforts largely on the texture cache. The bandwidth of the L1 texture cache was increased to 480 GB/s.
But that’s not all; the L1 cache that was shared by all the SIMD arrays has been broken down into 10 cache memories, one per SIMD array, and each contains only data exclusive to the corresponding SIMD array. Shared data are now stored in an L2 cache, which has also been completely redesigned, now having a bandwidth 384 GB/s to the L1 cache. In order to reduce latency, this L2 cache has been positioned near the memory controllers. Let’s see what the results of these improvements are in practice:
"
what?
are you gonna say that page is fake?
so i suppose these ones are too right?
http://www.anandtech.com/show/2556/4
"
AMD did also make some enhancements to their texture units as well. By doing some "stuff" that they won't tell us about, they improved the performance per mm^2 by 70%. Texture cache bandwidth has also been doubled to 480 GB/s while bandwidth between each L1 cache and L2 memory is 384 GB/s. L2 caches are aligned with memory channels of which there are four interleaved channels (resulting in 8 L2 caches).
Now that texture units are linked to both specific SIMD cores and individual L1 texture caches, we have an increase in total texturing ability due to the increase in SIMD cores with RV770. This gives us a 2.5x increase in the number of 8-bit per component textures we can fetch and bilinearly filter per clock, but only a 1.25x increase in the number of fp16 textures (as fp16 runs at half rate and fp32 runs at one quarter rate). It was our understanding that fp16 textures could run at full speed on R600, so the 1.25x increase in performance for half rate texturing of fp16 data makes sense.
Even though AMD wouldn't tell us L1 cache sizes, we had enough info left over from the R600 time frame to combine with some hints and extract the data. We have determined that RV770 has 10 distinct 16k caches. This is as opposed to the single shared 32k L1 cache on R600 and gives us a total of 160k of L1 cache. We know R600's L2 cache was 256k, and AMD told us RV770 has a larger L2 cache, but they wouldn't give us any hints to help out.
"
or maybe this one is a fake too
http://techreport.co...ics-processor/5
"
With 10 texture units onboard, the RV770 can sample and bilinearly filter up to 40 texels per clock. That's up from 16 texels per clock on RV670, a considerable increase. One of the ways AMD managed to squeeze down the size of its texture units was taking a page from Nvidia's playbook and making the filtering of FP16 texture formats work at half the usual rate. As a result, the RV770's peak FP16 filtering rate is only slightly up from RV670. Still, Hartog described the numbers game here as less important than the reality of measured throughput.
To ensure that throughput is what it should be, the design team overhauled the RV770's caches extensively, replacing the R600's "distributed unified cache" with a true L1/L2 cache hierarchy.
Each L1 texture cache is associated with a SIMD/texture unit block and stores unique data for it, and each L2 cache is aligned with a memory controller. Much of this may sound familiar to you, if you've read aboutcertain competitors to RV770. No doubt AMD has learned from its opponents.
Furthermore, Hartog said RV770 uses a new cache allocation routine that delays the allocation of space in the L1 cache until the request for that data is fulfilled. This mechanism should allow RV770 to use its texture caches more efficiently. Vertices are stored in their own separate cache. Meanwhile, the chip's internal bandwidth is twice that of the previous generation—a provision necessary, Hartog said, to keep pace with the amount of data coming in from GDDR5 memory. He claimed transfer rates of up to 480GB/s for an L1 texture fetch and up to 384GB/s for data transfers between the L1 and L2 caches.
"
thats not enough, then how about his cousin which is not evw gpu but rather an optimized rv770 with some modifications?
http://www.bit-tech....ure-analysis/11
"
ATI Radeon HD 5870 Architecture Analysis
Published on 30th September 2009 by Tim Smalley
The L1 texture cache has remained unchanged in terms of size and associativity - it still has effectively unlimited access per clock cycle - but the increased core count means that the number of texture caches has doubled. There are now twenty 8KB L1 texture caches, meaning a total of 160KB L1 texture cache GPU-wide. The four L2 caches, which are associated with each of the four memory controllers, have doubled in capacity as well and are now 128KB each, meaning a total of 512KB across the GPU.
Texture bandwidth has also been bolstered, with texture fetches from L1 cache happening at up to 1TB/sec (one terabyte per second) - that's more than double the L1 texture cache bandwidth available in RV770. I said so earlier, but it's worth reiterating again - that's a phenomenal amount of bandwidth. What's more, bandwidth between L1 and L2 caches has been increased to 435GB/sec from 384GB/sec on RV770 - another impressive figure.
"
Edited by megafenix, 27 February 2014 - 07:02 PM.