To address these industry challenges, Meta AI has developed and is open-sourcing
AITemplate (AIT), a unified inference system with separate acceleration back ends for both AMD and NVIDIA GPU hardware. It delivers close to hardware-native Tensor Core (NVIDIA GPU) and Matrix Core (AMD GPU) performance on a variety of widely used AI models such as convolutional neural networks, transformers, and diffusers. With AIT, it is now possible to run performant inference on hardware from both GPU providers. We’ve used AIT to achieve performance improvements up to 12x on NVIDIA GPUs and 4x on AMD GPUs compared with eager mode within PyTorch.
AITemplate is a Python framework that transforms AI models into high-performance C++ GPU template code for accelerating inference. Our system is designed for speed and simplicity. There are two layers in AITemplate — a front-end layer, where we perform various graph transformations to optimize the graph, and a back-end layer, where we generate C++ kernel templates for the GPU target. In addition, AIT maintains a minimal dependency on external libraries. For example, the generated runtime library for inference is self-contained and only requires CUDA/ROCm runtime environments. (CUDA, NVIDIA’s Compute Unified Device Architecture, allows AI software to run efficiently on NVIDIA GPUs. ROCm is an open source software platform that does the same for AMD’s GPUs.)