Advances in graphics processors offer extremely powerful computing capabilities for image capture and processing suited to aerospace and defense applications, while addressing past concerns regarding both heat dissipation and power consumption that arise when implementing GPUs in constrained, rugged embedded environments. The NVIDIA Tegra K1 GPU architecture offers a performance/power combination ideally suited for high-performance embedded applications, especially those where there is a desire to offload the graphics processing from the main computing engine yet still facilitate data transfers between the two, typically over a PCI Express-based switched fabric connection such as XMC or VPX.
Tegra K1 Overview
The NVIDIA Tegra K1 is an advanced mobile graphics processor combining 192 CUDA graphics processing cores with 5 ARM cores (4 ARM Cortex-A15 cores for performance-intensive applications plus 1 battery-saver core for low-performance application processing/background tasks at idle). It is designed to accommodate up to 8GB of DDR3L memory. Running at a maximum clock speed of 2.3 GHz, it delivers 325 GFLOPs of compute performance while consuming less than 10 Watts. The CUDA parallel computing platform used by the Tegra K1 empowers developers who wish to use the GPU for general-purpose processing (GPGPU) applications. The parallel computing capability of this device coupled with its low power consumption makes it ideal for embedded deployments targeting applications such as vehicle/obstacle recognition and tracking, terrain analytics and unmanned navigation.
Figure 1: Tegra K1 integrated into a WOLF XMC module
PCI Express Peer-to-Peer Communications Basics
The Tegra K1 was originally developed to target mobile consumer products such as tablets. It has its own PCIe root complex, and thus its own PCIe domain hierarchy. By default, if implemented as part of an XMC or VPX peripheral, the primary computing engine would typically have its own PCIe root complex and domain hierarchy. These domains would be independent of each other, with no traversal path between PCIe memory spaces. Bridging these separate domains can be achieved by using a PCIe Inter-Domain Switch that incorporates Non-Transparent Bridges (NTBs). The NTB serves to isolate PCIe address domains and in essence makes a device attached to it appear to be a PCIe endpoint. In this arrangement, the primary computing engine would be considered the system domain and the endpoint computing peripheral as a local domain. Both the system and local domains require unique memory maps as well as their own initialization and enumeration. Traversal between these PCIe domains requires address translations between the respective PCIe address spaces, with the endpoint having an address translation unit.
Intelligent Peripheral Example
Examples of the Tegra K1 graphics processor implemented in intelligent computing peripherals are WOLF's MXC-TK1-FGX and WOLF’s XMC-TK1-FGX modules. Incorporating a PCIe inter-domain switch solution, these cards handle the analysis and processing of captured frame grabber data without disrupting the host computing blade to which it is integrated, yet still supports PCIe data transfers to and from the host via the XMC X15 connector or the MXC connector.
WOLF's Tegra K1 modules are capable of reaching maximum potential by utilizing the full 8 GB of DDR3L memory the device can support, which significantly reduces latency in GPGPU applications caused by memory constraints. Each WOLF module includes a Tegra K1 which is directly connected to WOLF’s embedded FGX frame grabber technology via the K1’s native PCIe lanes.
Figure 2: WOLF MXC with Tegra K1 and FGX
Implementing Multiple Tegra K1s
For intense imaging applications, multiple Tegra K1 engines connected via PCI Express can be configured to work in tandem. An example of such an implementation is WOLF’s VPX3U-TK1-DUAL-FGX board, which implements two full-featured Tegra K1-based modules on a 3U VPX card for a combined total of 10 ARM Cortex-A 15 cores, 16GB DDR3L memory and 650 GFLOPs of CUDA processing power. This 3U VXP module uses two MXC modules, each with a Tegra K1 processor module designed and manufactured with an industry-leading 8 GB of DDR3L memory and integrated with WOLF’s embedded FGX frame grabber engine, directly connected via the Tegra K1’s native PCIe bus. A PCIe inter-domain switch solution is implemented on the board to support peer-to-peer communications with other Intel or PowerPC computing blades in a 3U VPX chassis via the P1 connector.
Figure 3: WOLF 3U VPX with two MXC modules, each with a Tegra K1 and an FGX
The Tegra K1’s parallel processing performance coupled with its low power consumption makes it a natural fit for embedded systems with data-intensive image capture and processing requirements. Applications such as unmanned vehicles, surveillance and 3-D visualization of geospatial data that require a significant amount of computational horsepower yet need to be deployed within a SWaP environment are a perfect fit for Tegra K1-based solutions. In addition, the ability to leverage PCIe inter-domain switching technology for implementing the Tegra K1 in an intelligent peripheral capacity, as well as tiling multiple Tegra K1-based peripheral computing engines that can transfer data between themselves and a host computing blade, ideally suits deployments that require the maximum possible graphics computation capability.