Nvidia offers new software to tame LLMs, improve AI inference

Nvidia is set to release new open source software in the coming weeks that will help organizations accelerate and optimize large language model (LLM) inference, a move that comes as the company continues to increase its focus on AI inference after quickly becoming a dominant player in hardware and software for AI training.

LLMs live up to that first “L” as they are increasingly large and also increasingly challenging to work with in cost-effective ways. The new TensorRT-LLM open source software library is designed to help by integrating Nvidia’s TensorRT deep learning compiler and includes optimized kernels, pre- and post-processing steps, and multi-GPU/multi-node communication primitives to improve LLM performance on Nvidia GPUs. It does not require deep knowledge of C++ or Nvidia CUDA, so it can help developers experiment and get started more quickly with projects.

“We are seeing the large language model ecosystem advancing at the speed of light,” said Ian Buck, vice president and general manager of Nvidia’s Hyperscale and HPC business. “It's exploding in terms of the number of models, and the diversity of the different model architectures. Models are getting bigger and more intelligent, and inference is getting harder.” As LLMs “get smarter,” they expand beyond the scope of what a single GPU can manage and may need to run across multiple GPUs or racks, he added. 

Buck described AI inference as “one of the fastest growing parts of the datacenter,” and that and the challenges mentioned above mean organizations working with LLMs need a software stack that can scale and execute even across multiple racks of GPUs.

Nvidia drew from projects that it worked on with companies such as Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML (now part of Databricks), OctoML, Tabnine, and Together AI, and integrated innovations from those projects into TensorRT-LLM.

For example, developers looking to ratchet up performance for LLM inference previously had to rewrite and manually split the AI models into fragments and coordinate execution across multiple GPUs. TensorRT-LLM uses tensor parallelism, which allows individual weight matrices to be split across devices, enabling efficient inference at scale–with each model running in parallel across multiple GPUs connected through Nvidia’s NVLink and across multiple servers–without developer intervention or model changes.

Also, the new software uses an optimized scheduling technique called in-flight batching to help with processing of especially large and highly variable workloads that might otherwise slow down inference. Nvidia in a blog described the technique, stating it “takes advantage of the fact that the overall text generation process for an LLM can be broken down into multiple iterations of execution on the model. With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the TensorRT-LLM runtime immediately evicts finished sequences from the batch. It then begins executing new requests while other requests are still in flight. In-flight batching and the additional kernel-level optimizations enable improved GPU usage and minimally double the throughput on a benchmark of real-world LLM requests on H100 Tensor Core GPUs, helping to minimize [total cost of ownership].”

Regarding performance, Buck said that when combined with Nvidia’s Hopper GPU architecture, TensorRT-LLM delivered an 8x total increase in performance to deliver the highest throughput on an H100 GPU compared to Nvidia’s earlier-generation A100.