Scaling ML Workloads Using Nvidia Triton Inference Server

Machine learning (ML) models require a lot of computational power and memory to work. So when a model has several requests to handle, the workload must be distributed among available resources for maximum performance. Equally important is optimization, which helps reduce latency and enhance inference speed, crucial for real-time applications. Through efficient resource distribution and optimization, we can ensure fast, cost-effective, and reliable model hosting.

Scaling machine learning workloads, however, involves multiple complexities.

Data complexity: Storing, versioning, and processing vast amounts of diverse data required to train ML models presents infrastructure, data quality, and cost challenges.
Complex pipeline and algorithms: Intricate data pipelines and algorithms demand significant computational power.
Complexity vs performance: Models with millions of parameters demand substantial resources. Balancing model complexity and computational limits can be tricky.
Distributed computing and scaling: Scaling ML models requires distributed systems, which introduces challenges in data synchronization, latency, and consistency across nodes.

In this article, we will explore ways we can maximize the performance of available hardware resources using Nvidia Triton Inference Server, an open-source software that standardizes AI model deployment and execution.

Setting up Triton

Triton requires a model repository, where each model has its own directory. Inside the model directory, are versioned subfolders that store the model files. A sample directory structure for the model repository is given below.

models/                    
└── object_detection/          
    ├── 1/ 
    |   └── model.pt           
    └── config.pbtxt

For the purpose of this post, we will use an image input of size 640x640 with 3 channels as an example. Let’s look at the basic configuration required to host a model with Triton.

config.pbtxt  
---------------------------------------------------------------------                                                                     
name: "object_detection"
platform: "pytorch_libtorch"
max_batch_size: 8

input [
    {
        name: "input_1"
        data_type: TYPE_FP32
        dims: [ 3, 640, 640 ]

    }
]

output [
    {
        name: "output__pred_boxes"
        data_type: TYPE_FP32
        dims: [-1]
    }
]

Memory Optimization

Limiting the Models Running on Triton

By default, Triton runs every model stored in the model directory, which can lead to unnecessary memory usage. To optimize memory consumption, you can load the specific model you need by using the load model argument while starting the Triton server.

The following example shows how to run only a model named ‘object_detection’ in the Triton server. You can also configure multiple models specifying their names.

docker run --rm -p8000:8000 -v ./models:/models 
nvcr.io/nvidia/tritonserver:22.12-py3 tritonserver
--model-store=/models  --model-control-mode=explicit 
--load-model=object_detection

Limiting the Number of Model Versions

Triton typically runs all model versions in the server's model directory. To limit memory usage, you can restrict older model versions from being hosted on the server by adding version policy configuration to model.pbtxt as follows:

version_policy: { 
    latest: { 
        num_versions: 1
    }
}

This will host the last n version (one in the above example) of the models in the model repository. Alternatively, you can choose to host specific versions as well using the below configuration:

version_policy: { 
     specific: { 
        versions: [1,3]
     }
}

Performance Optimization

Running Multiple Instances of the Model

Running multiple instances of a model reduces latency and increases throughput, as multiple instances can handle requests simultaneously. By default, Triton hosts a single instance of a model, but you can configure the number of instances using instance group configuration. The below configuration will host two instances of the model.

instance_group [ 
    { 
        count: 2 
        kind: KIND_GPU
    }
]

Triton will automatically deploy the model on available GPU. But you can specify which GPU each instance of the model should use.

instance_group [
    {
        count: 1
        kind: KIND_GPU
        gpus: [ 0 ]
    },
    {
        count: 2
        kind: KIND_GPU
        gpus: [ 1, 2 ]
    }
]

This configuration will place one execution instance on GPU 0 and two execution instances on GPUs 1 and 2. Typically, having multiple instances of a model will improve performance because it allows overlap of memory transfer operations (for example, CPU to/from GPU) with inference compute. Multiple instances also improve GPU utilization by allowing more inference work to be executed simultaneously on the GPU, making efficient use of available resources.

Batching Inputs

To process multiple inputs as a batch, modify the input dimensions.

input [
    {
        name: "input_1"
        data_type: TYPE_FP32
        dims: [ 3, 640, 640, -1 ]

    }
]

Increase dimension by one to enable batching. You can either specify the batch size or if the size varies, you can leave it as -1 as shown above. You will have to do the same with the output dimension as well.

Dynamic Batching

Triton implements multiple scheduling and batching algorithms, including a dynamic batcher that combines individual inference requests to improve throughput.

To enable dynamic batching, add the following to the model configuration.

dynamic_batching { }

Vertical Scaling

Vertical scaling involves adding more than one full vGPU profile to a single Triton virtual machine (VM). You can add --gpus all flag to the docker run command to host Triton server with all available GPUs. If you prefer a single GPU, you can specify the ID of the GPU. For example, --gpus 0,1.

You can run the Triton server with all available GPUs using the below command:

docker run --gpus all -p8000:8000 -v ./models:/models nvcr.io/nvidia/tritonserver:22.12-py3 tritonserver

Horizontal Scaling

Horizontal scaling requires setting up multiple Triton Inference Server VMs and configuring a load balancer to distribute requests efficiently among them.

Follow these steps to deploy the load balancer:

On each Triton Server, edit the /etc/hosts file to add the IP address of the load balancer.

sudo nano /etc/hosts

Add the following to the /etc/host file.

hostname-of-HAproxy IP-address-of-HAproxy

On the load balancer, add the IPs of the Triton Servers to /etc/hosts and install HAproxy.

sudo apt install haproxy

Configure HAproxy to point to the gRPC endpoint of both servers.

sudo nano /etc/haproxy/haproxy.cfg

frontend triton-frontend
        bind     10.110.16.221:80 #IP of load balancer and port
        mode    http
        default_backend    triton-backend

backend triton-backend
        balance roundrobin
        server triton 10.110.16.186:8001 check #gRPC endpoint of Triton
        server triton2 10.110.16.218:8001 check

listen stats
        bind 10.110.16.221:8080 #port for showing load balancer statistics
        mode http
        option forwardfor
        option httpclose
        stats enable
        stats show-legends
        stats refresh 5s
        stats uri /stats
        stats realm Haproxy\Statistics
        stats auth nvidia:nvidia #auth for statistics

In your Triton gRPC client, update the server URL to point to the load balancer's IP address. This is how the client will interact with the load-balanced Triton servers.
Restart the haproxy server.

sudo systemctl restart haproxy.service

Best Practices for Optimizing CPU/GPU Performance

To get the maximum performance out of NVIDIA Triton Inference Server, it is important to implement framework-specific optimizations for the models you host. Triton has several optimization settings that are controlled by the model configuration optimization policy. Please refer to this guide for more information.

Triton Inference Server uses "backends" to execute models. These backends serve as wrappers for frameworks such as PyTorch, TensorFlow, TensorRT, or ONNX Runtime (ORT). Users also have the option to create custom backends. Each backend offers specific optimization options.

This section focuses on optimizations for ONNX RunTime (ORT).

TensorRT Acceleration for GPU

TensorRT provides better optimizations than the CUDA execution provider. However, its effectiveness depends on the model structure, or more precisely, the operators used in the network. If all the operators are supported, conversion to TensorRT will yield better performance. Below is the configuration for adding TensorRT acceleration for GPU.

optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters { key: "precision_mode" value: "FP16" }
      parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    }]
  }
}

CUDA Execution Provider Optimization for GPU

When GPU is activated for ORT, CUDA execution provider is automatically enabled. If TensorRT is also enabled, then CUDA EP serves as a fallback option (only comes into the picture for nodes that TensorRT cannot execute). If TensorRT is not enabled, then CUDA EP serves as the primary execution provider to execute the models.

optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "cuda"
    parameters { key: "cudnn_conv_use_max_workspace" value: "0" }
    parameters { key: "use_ep_level_unified_stream" value: "1" }}
  ]
}}

OpenVINO for CPU Acceleration

Triton Inference Server supports acceleration for CPU-only model using OpenVINO. In the configuration file, you can add the following to enable CPU acceleration.

While OpenVINO provides software-level optimizations, it is important to consider the CPU hardware being used.

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}

Sample Configuration

Following is a sample configuration we used in our project to host a YOLO object detection model:

name: "yolov7_object_detection"
platform: "pytorch_libtorch"
max_batch_size: 32

input [
    {
        name: "input_1"
        data_type: TYPE_FP32
        dims: [ 3, 640, 640 ]

    }
]

output [
    {
        name: "output__pred_boxes"
        data_type: TYPE_FP32
        dims: [2]
    }
]

version_policy: { 
    latest: { 
        num_versions: 2
    }
}

This configuration will host the last two versions of the object detection model.

Cloud Enablement

Data & AI

Digitalization

End-to-End

Digital Marketing

SaaS

Retail

Healthcare

Hospitality

Insurance

Productivity

Technology

Marketing

START A CONVERSATION