Machine learning (ML) models require a lot of computational power and memory to work. So when a model has several requests to handle, the workload must be distributed among available resources for maximum performance. Equally important is optimization, which helps reduce latency and enhance inference speed, crucial for real-time applications. Through efficient resource distribution and optimization, we can ensure fast, cost-effective, and reliable model hosting.
Scaling machine learning workloads, however, involves multiple complexities.
- Data complexity: Storing, versioning, and processing vast amounts of diverse data required to train ML models presents infrastructure, data quality, and cost challenges.
- Complex pipeline and algorithms: Intricate data pipelines and algorithms demand significant computational power.
- Complexity vs performance: Models with millions of parameters demand substantial resources. Balancing model complexity and computational limits can be tricky.
- Distributed computing and scaling: Scaling ML models requires distributed systems, which introduces challenges in data synchronization, latency, and consistency across nodes.
In this article, we will explore ways we can maximize the performance of available hardware resources using Nvidia Triton Inference Server, an open-source software that standardizes AI model deployment and execution.
Setting up Triton
Triton requires a model repository, where each model has its own directory. Inside the model directory, are versioned subfolders that store the model files. A sample directory structure for the model repository is given below.
models/
└── object_detection/
├── 1/
| └── model.pt
└── config.pbtxt
For the purpose of this post, we will use an image input of size 640x640 with 3 channels as an example. Let’s look at the basic configuration required to host a model with Triton.
config.pbtxt
---------------------------------------------------------------------
name: "object_detection"
platform: "pytorch_libtorch"
max_batch_size: 8
input [
{
name: "input_1"
data_type: TYPE_FP32
dims: [ 3, 640, 640 ]
}
]
output [
{
name: "output__pred_boxes"
data_type: TYPE_FP32
dims: [-1]
}
]
Memory Optimization
Limiting the Models Running on Triton
By default, Triton runs every model stored in the model directory, which can lead to unnecessary memory usage. To optimize memory consumption, you can load the specific model you need by using the load model argument while starting the Triton server.
The following example shows how to run only a model named ‘object_detection’ in the Triton server. You can also configure multiple models specifying their names.
docker run --rm -p8000:8000 -v ./models:/models
nvcr.io/nvidia/tritonserver:22.12-py3 tritonserver
--model-store=/models --model-control-mode=explicit
--load-model=object_detection
Limiting the Number of Model Versions
Triton typically runs all model versions in the server's model directory. To limit memory usage, you can restrict older model versions from being hosted on the server by adding version policy configuration to model.pbtxt as follows:
version_policy: {
latest: {
num_versions: 1
}
}
This will host the last n version (one in the above example) of the models in the model repository. Alternatively, you can choose to host specific versions as well using the below configuration:
version_policy: {
specific: {
versions: [1,3]
}
}
Performance Optimization
Running Multiple Instances of the Model
Running multiple instances of a model reduces latency and increases throughput, as multiple instances can handle requests simultaneously. By default, Triton hosts a single instance of a model, but you can configure the number of instances using instance group configuration. The below configuration will host two instances of the model.
instance_group [
{
count: 2
kind: KIND_GPU
}
]
Triton will automatically deploy the model on available GPU. But you can specify which GPU each instance of the model should use.
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
},
{
count: 2
kind: KIND_GPU
gpus: [ 1, 2 ]
}
]
This configuration will place one execution instance on GPU 0 and two execution instances on GPUs 1 and 2. Typically, having multiple instances of a model will improve performance because it allows overlap of memory transfer operations (for example, CPU to/from GPU) with inference compute. Multiple instances also improve GPU utilization by allowing more inference work to be executed simultaneously on the GPU, making efficient use of available resources.
Batching Inputs
To process multiple inputs as a batch, modify the input dimensions.
input [
{
name: "input_1"
data_type: TYPE_FP32
dims: [ 3, 640, 640, -1 ]
}
]
Increase dimension by one to enable batching. You can either specify the batch size or if the size varies, you can leave it as -1 as shown above. You will have to do the same with the output dimension as well.
Dynamic Batching
Triton implements multiple scheduling and batching algorithms, including a dynamic batcher that combines individual inference requests to improve throughput.
To enable dynamic batching, add the following to the model configuration.
dynamic_batching { }
Vertical Scaling
Vertical scaling involves adding more than one full vGPU profile to a single Triton virtual machine (VM). You can add --gpus all flag to the docker run command to host Triton server with all available GPUs. If you prefer a single GPU, you can specify the ID of the GPU. For example, --gpus 0,1.
You can run the Triton server with all available GPUs using the below command:
docker run --gpus all -p8000:8000 -v ./models:/models nvcr.io/nvidia/tritonserver:22.12-py3 tritonserver
Horizontal Scaling
Horizontal scaling requires setting up multiple Triton Inference Server VMs and configuring a load balancer to distribute requests efficiently among them.
Follow these steps to deploy the load balancer:
- On each Triton Server, edit the /etc/hosts file to add the IP address of the load balancer.
sudo nano /etc/hosts
- Add the following to the /etc/host file.
hostname-of-HAproxy IP-address-of-HAproxy
- On the load balancer, add the IPs of the Triton Servers to /etc/hosts and install HAproxy.
sudo apt install haproxy
- Configure HAproxy to point to the gRPC endpoint of both servers.
sudo nano /etc/haproxy/haproxy.cfg
frontend triton-frontend
bind 10.110.16.221:80 #IP of load balancer and port
mode http
default_backend triton-backend
backend triton-backend
balance roundrobin
server triton 10.110.16.186:8001 check #gRPC endpoint of Triton
server triton2 10.110.16.218:8001 check
listen stats
bind 10.110.16.221:8080 #port for showing load balancer statistics
mode http
option forwardfor
option httpclose
stats enable
stats show-legends
stats refresh 5s
stats uri /stats
stats realm Haproxy\Statistics
stats auth nvidia:nvidia #auth for statistics
- In your Triton gRPC client, update the server URL to point to the load balancer's IP address. This is how the client will interact with the load-balanced Triton servers.
- Restart the haproxy server.
sudo systemctl restart haproxy.service
Best Practices for Optimizing CPU/GPU Performance
To get the maximum performance out of NVIDIA Triton Inference Server, it is important to implement framework-specific optimizations for the models you host. Triton has several optimization settings that are controlled by the model configuration optimization policy. Please refer to this guide for more information.
Triton Inference Server uses "backends" to execute models. These backends serve as wrappers for frameworks such as PyTorch, TensorFlow, TensorRT, or ONNX Runtime (ORT). Users also have the option to create custom backends. Each backend offers specific optimization options.
This section focuses on optimizations for ONNX RunTime (ORT).
TensorRT Acceleration for GPU
TensorRT provides better optimizations than the CUDA execution provider. However, its effectiveness depends on the model structure, or more precisely, the operators used in the network. If all the operators are supported, conversion to TensorRT will yield better performance. Below is the configuration for adding TensorRT acceleration for GPU.
optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
parameters { key: "max_workspace_size_bytes" value: "1073741824" }
}]
}
}
CUDA Execution Provider Optimization for GPU
When GPU is activated for ORT, CUDA execution provider is automatically enabled. If TensorRT is also enabled, then CUDA EP serves as a fallback option (only comes into the picture for nodes that TensorRT cannot execute). If TensorRT is not enabled, then CUDA EP serves as the primary execution provider to execute the models.
optimization { execution_accelerators {
gpu_execution_accelerator : [ {
name : "cuda"
parameters { key: "cudnn_conv_use_max_workspace" value: "0" }
parameters { key: "use_ep_level_unified_stream" value: "1" }}
]
}}
OpenVINO for CPU Acceleration
Triton Inference Server supports acceleration for CPU-only model using OpenVINO. In the configuration file, you can add the following to enable CPU acceleration.
While OpenVINO provides software-level optimizations, it is important to consider the CPU hardware being used.
optimization {
execution_accelerators {
cpu_execution_accelerator : [{
name : "openvino"
}]
}
}
Sample Configuration
Following is a sample configuration we used in our project to host a YOLO object detection model:
name: "yolov7_object_detection"
platform: "pytorch_libtorch"
max_batch_size: 32
input [
{
name: "input_1"
data_type: TYPE_FP32
dims: [ 3, 640, 640 ]
}
]
output [
{
name: "output__pred_boxes"
data_type: TYPE_FP32
dims: [2]
}
]
version_policy: {
latest: {
num_versions: 2
}
}
This configuration will host the last two versions of the object detection model.