This paper focuses on the key improvements found when upgrading an NVIDIA® GPU from the Pascal to the Turing to the Ampere architectures, specifically from GP104 to TU104 to GA104. (The Volta architecture that preceded Turing is mentioned but is not a focus of this paper.)
AI and machine vision projects are no different from other embedded processor projects that need to meet budget, performance, and the operational goals of a program. But the path to that success incorporates decidedly different development workflows. As an example, unlike typical embedded system deployments, which may require occasional hardware or software updates, the updates required for AI inference engines must be regularly scheduled and built into the operational plan to incorporate deep learning results from new (or changing) data. Another common occurrence in AI projects is the underestimation of the amount of data required for successful AI training.
Advances in graphics processors offer extremely powerful computing capabilities for image capture and processing suited to aerospace and defense applications, while addressing past concerns regarding both heat dissipation and power consumption that arise when implementing GPUs in constrained, rugged embedded environments. The NVIDIA Tegra K1 GPU architecture offers a performance/power combination ideally suited for high-performance embedded applications, especially those where there is a desire to offload the graphics processing from the main computing engine yet still facilitate data transfers between the two, typically over a PCI Express-based switched fabric connection such as XMC or VPX.
The small form factor MXC modules can be combined to make extremely powerful video capture, display and encoding solutions on VPX, VME, CompactPCI, COMExpress designs and OEM products. The envelope size for MXC is equal to the board outline, due to the interface connector used and its position on the bottom of the board, giving it a size advantage over MXM modules with similar functionality.
The VPX architecture is designed around the concept of a “system-in-a-chassis” topology. Each card performs a single function that adds to an overall system, connected through a backplane. In that model, too much functionality on any single card would quickly saturate the available bandwidth.
However, advancements in bandwidth speed and inter-connectors have greatly increased the data transfer rates between discrete cards. For instance, a 16-lane PCIe link using the ubiquitous v2.x serial fabric is capable of transmitting eight gigabytes of data per second, and the newer v3.0 can attain transfer rates of sixteen gigabytes per second.