Introduction
This paper focuses on the key improvements found when upgrading an NVIDIA® GPU from the Pascal to the Turing to the Ampere architectures, specifically from GP104 to TU104 to GA104. (The Volta architecture that preceded Turing is mentioned but is not a focus of this paper.)
NVIDIA GPUs have always excelled at video graphics processing and in providing support for general purpose data processing that benefitted from massive parallel processing algorithms. In the update from Pascal to Volta/Turing NVIDIA also became a leader in artificial intelligence (AI) processing with the inclusion of Tensor cores, which were first introduced in the Volta architecture for data centers in 2017, followed by their introduction in the Turing architecture for desktop and other use cases in 2019. The Turing architecture also introduced Ray Tracing cores used to accelerate photo realistic rendering. With Ampere NVIDIA has continued to make significant improvements to the GPU, including updates to CUDA® core processing data paths and updates to the next generation of Turing cores and Ray Tracing cores.
Figure 1: NVIDIA Ampere GA104 architecture. Details for each SM are shown in Figure 2.
High-Level Components used in GPUs
The high-level components in the NVIDIA GPU architecture have remained the same from Pascal to Volta/Turing to Ampere:
- PCIe Host Interface
- GigaThread engine
- Memory controllers
- L2 Cache
- Graphics Processing Clusters (GPCs)
Table 1: Component Blocks used in an NVIDIA GPU
|
Pascal GP104 |
Turing TU104 |
Ampere GA104 |
PCIe Host Interface |
Gen 3 |
Gen 3 |
Gen 4 |
Memory type supported |
GDDR5 |
GDDR6 |
GDDR6 |
Memory Controllers |
8 32-bit (256-bits total) |
8 32-bit (256-bits total) |
8 32-bit (256-bits total) |
Memory Bandwidth |
320 GB/s |
448 GB/s |
448 GB/s |
L2 Cache Size |
2048 KB |
4096 KB |
4096 KB |
Graphics Processing Clusters (GPCs)/GPU |
4 |
5 or 6 |
6 |
PCIe Host Interface: The Ampere GPU updated the PCIe host interface to PCIe 4.0. This can provide double the bandwidth compared to Gen 3, and it is still fully compatible with the previous PCIe generation interfaces.
Memory Support: The Pascal GPU supported GDDR5 memory. The Ampere and Turing GPUs support GDDR6 memory. GDDR6 supports higher bandwidth, a bigger interface, and is more energy efficient than GDDR5. It is also higher density, so more memory can be included when using the same footprint.
Components in Graphics Processing Clusters (GPCs)
Graphics processing clusters are the data processing engines of the GPU. Each GPC includes:
- 1 Raster Engine
- 2 Raster Operator Partitions (ROPs), each containing 8 ROP units
- Texture Processing Clusters (TPCs) which include:
- PolyMorph Engine
- Streaming Multiprocessors (SMs)
- Since Volta/Turing: Tensor Core
- Since Turing: Ray Tracing Core
Table 2: Component blocks in an NVIDIA Graphics Processing Cluster (GPC)
|
Pascal GP104 |
Turing TU104 |
Ampere GA104 |
ROPs |
64 (tied to the memory controller and L2 cache) |
64 (tied to the memory controller and L2 cache) |
96 (integrated into GPC) |
Texture Processing Clusters (TPCs)/GPC |
5 |
4 |
4 |
TPC/GPU |
20 |
20 or 24 |
24 |
Streaming Multiprocessors (SM)/TPC |
1 |
2 |
2 |
Maximum SM/GPU |
20 |
48 |
48 |
Raster Operator (ROP) Units: In Pascal and Turing architectures ROPs were tied to the memory controller and L2 cache. In the Ampere architecture ROPs are integrated into each Graphics Processing Cluster (GPC). Including ROP partitions in the GPC helps to eliminate bottlenecks. There are also a higher overall number of ROP units in Ampere GPUs.
Other High-Level Architecture Changes
Manufacturing Process and Power Efficiency: Chips are manufactured using processes that determine the size of each transistor on the chip measured in nm. The smaller the size is the faster the transistor will be and the less power it will use at the same performance level.
Display and Video Engine: With each generation support for higher resolution display output has increased, and when using an Ampere GPU with VESA Display Stream Compression (DSC) technology enabled High Dynamic Range (HDR) rendering is also supported. Hardware accelerated encoding and decoding have also continued to improve, offloading the most computationally intense tasks from the CPU to the GPU, providing real-time performance for high resolution encoding and decoding.
Table 3: Other High-Level Architecture Changes to NVIDIA GPUs
|
Pascal GP104 |
Turing TU104 |
Ampere GA104 |
Manufacturing Process |
16 nm |
12 nm |
8 nm |
Transistors per GPU |
7.2 billion |
13.6 billion |
17.4 billion |
TGP (Watts) |
180 |
215 - 230 |
220 |
DisplayPort output |
1.2 certified |
1.4a |
1.4a |
HDMI output |
2.0b |
2.0b |
2.1 |
NVENC (hardware accelerated encode) |
4th Gen |
7th Gen |
7th Gen |
NVDEC (hardware accelerated decode) |
3rd Gen |
4th Gen |
5th Gen with AV1 |
Streaming Multiprocessor (SM) Architecture
Major improvements have been made to many of the components found in the Streaming Multiprocessors in each subsequent generation.
Figure 2: NVIDIA Streaming Multiprocessor architecture for Pascal, Turing, Ampere
Each Streaming Multiprocessor (SM) includes:
- Four SM Processing Blocks (Partitions), and each includes:
- CUDA data paths which can handle Floating Point (FP) or Integer (INT) calculations. The way the CUDA cores are assigned to perform a specific type of calculation has changed over the generations (see below for more info).
- Tensor Core (Turing/Ampere)
- Instruction cache per SM (Pascal) or L0 Instruction Cache per SM Block (Turing/Ampere)
- Warp scheduler and Dispatch Unit. The way tasks are assigned has significantly improved over the generations to optimize core use (see below for more info).
- Register File
- Load/store units (LD/ST units)
- Special function units (SFU) for transcendental math functions (e.g., log x, sin x, cos x, ex)
- L1 Data Cache/Shared Memory; this was consolidated starting with Turing
- Texture Units
- Ray Tracing Core (Turing/Ampere)
- Two FP64 units (Turing/Ampere)
Table 4: Streaming Multiprocessor Changes
|
Pascal GP104 |
Turing TU104 |
Ampere GA104 |
CUDA Cores/SM |
128 FP32 or INT32 |
64 FP32 and 64 INT32 |
64 FP32 only, |
CUDA Cores/GPU |
2560 cores |
3072 cores |
3072 or 6144 FP cores |
SM Cores concurrent execution |
Cores could be used for FP32 or INT32, no concurrent execution per partition |
one FP32 partition, one INT32 partition, concurrent execution of FP and INT |
one FP32 partition and one FP32 or INT32 partition, concurrent execution of FP and INT possible |
Shared Memory/L1 Cache/SM |
64 KB Shared Mem |
96 KB Shared Mem |
128 KB Shared Mem |
Total Shared Memory/L1 Cache |
1280 KB |
4608 KB |
6144 KB |
Memory handling |
Separate instruction cache and per partition buffer; two L1 cache; shared memory |
New L0 instruction Cache per partition; combined L1/Shared Memory (as per Volta) |
Similar structure as with Turing, but with larger memory |
Warp Scheduler and Dispatch Unit |
warp scheduler + 2 dispatcher units |
warp scheduler + dispatch unit; independent thread scheduling for sub‑warp granularity (as per Volta) |
warp scheduler + dispatch unit (as per Volta/Turing) |
Ray Tracing Cores |
None |
Gen 1, 1 RT core/SM |
Gen 2, 1 RT core/SM |
Tensor Cores |
None |
320 of Gen 2 |
184 of Gen 3 |
CUDA Datapath Changes
CUDA cores can be used for FP32 or for INT32 operations. With the Pascal architecture SM partitions could either be assigned to FP32 or they could be assigned to INT32 operations, but they could not execute both simultaneously. With the Turning architecture SM partitions separated the CUDA cores into two data paths, one dedicated to FP32, and the other dedicated to INT32. This allowed Turing SM partitions to execute both FP32 and INT32 operations simultaneously. With the Ampere architecture the two data paths of Turing are still present, and one of them is still dedicated to FP32, but the other can now be used for either FP32 or INT32, depending on what is in demand.
Graphic workloads often require more FP32 calculations than INT32 calculations. In NVIDIA’s Turing Architecture whitepaper they estimated that in then current games “for every 100 FP32 pipeline instructions there are about 35 additional instructions that run on the integer pipeline”, or approximately 26% of the required operations for those games are integer operations. (See NVIDIA TURING GPU ARCHITECTURE, page 66) Given their unequal use ensuring that one of the Ampere SM data paths can flexibly be used for either FP32 or INT32 calculations ensures that there will be no idle cores waiting for INT calculation tasks as those cores can now be assigned FP calculation tasks.
Ray Tracing Cores Generation 2
The second generation Ray Tracing cores found in Ampere architecture GPUs can effectively deliver twice the performance of the first generation Ray Tracing cores found in Turing architecture GPUs. Ampere SMs also allow RT core and CUDA core compute workloads to run concurrently, introducing even more efficiencies. For users who need to render complex models with accurate shadows, reflections and refractions, or to render ray-traced motion blur, the Ampere RT cores will provide big performance improvements.
Tensor Cores Generation 3
The third generation Tensor cores found in Ampere GPUs can provide a much higher performance compared to the second generation Tensor cores found in Turing GPUs. The new Tensor cores have added acceleration for many more data types. The Volta Tensor core added FP16, The Turing Tensor cores introduced INT8, INT4 and binary 1-bit precisions, and the Ampere Tensor cores add support for TF32 and BF32 data types. Depending on the type of workload the 3rd generation Tensor cores can deliver 2x to 4x more throughput compared to the previous generation.
Ampere Tensor cores also include a new Fine-Grained Structured Sparsity feature, which uses only the subset of weights that have acquired a meaningful purpose during the learning process, which leads to more efficient inference acceleration with sparsity.
SM Memory Changes
The update from Pascal to Turing included an SM memory path redesign to unify shared memory, texture caching, and memory load caching into one unit. This provided two times more bandwidth and two times more capacity for L1 for common workloads. The amount of memory also increased from generation to generation.
Warp Scheduling Changes
In an NVIDIA GPU the basic unit for executing an instruction is the warp. A warp is a collection of threads that all share the same code and are all executed simultaneously by a Streaming Multiprocessor (SM). Multiple warps can be executed on an SM at once.
Pascal was designed to support many more active warps and threadblocks than previous architectures. Each warp scheduler was capable of dispatching two warp instructions per clock cycle.
Volta SM processing blocks each had a single warp scheduler and a single dispatch unit. This meant that Volta could only issue one independent instruction per clock cycle. However, it gained independent thread scheduling, it included a program counter and call stack per thread, and it included a schedule optimizer. Taken together this allows threads to diverge at sub-warp granularity, which helps to ensure optimal usage of the cores.
Turing and Ampere inherited all of the Volta improvements to warp scheduling, resulting in significant processing optimization.
Software Tools
NVIDIA provides numerous software tools to help developers to accelerate GPU-based application development. With each new GPU generation new tools and new features are added.
CUDA Toolkit and CUDA Compute
The CUDA Toolkit includes GPU-accelerated libraries, a compiler, development tools and the CUDA runtime. Each major new architecture release is accompanied by a new version of the CUDA Toolkit, which includes tips for using existing code on newer architecture GPUs, as well as instructions for using new features only available when using the newer GPU architecture.
CUDA Compute capability allows developers to determine the features supported by a GPU. Ampere GPUs have a CUDA Compute Capability of 8.6, Turing GPUs 7.5, and Pascal GPUs 6.1.
For specific information the NVIDIA CUDA Toolkit Documentation provides tables that list the “Feature Support per Compute Capability” and the “Technical Specifications per Compute Capability”.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
CUDA-X AI and CUDA-X HPC
CUDA-X is a collection of libraries, tools, and technologies built on top of CUDA specifically to support AI and HPC. These libraries work with NVIDIA GPUs which include Tensor cores.
NVIDIA also provides integrated support in a number of open source partner libraries, providing built-in GPU acceleration for numerous types of applications.
See: https://developer.nvidia.com/gpu-accelerated-libraries
See: https://developer.nvidia.com/hpc
Conclusion
With the release of each new GPU generation NVIDIA has continued to deliver huge increases in performance and revolutionary new features. Whether an application requires enhanced image quality or powerful compute and AI acceleration, upgrading to the latest NVIDIA Ampere architecture will provide significant performance improvements.
NVIDIA, the NVIDIA logo, and CUDA are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. All other trademarks are property of their respective owners.