Triton Inference

Triton is an open-source, efficient inferencing serving software from Nvidia that offers best-in-class throughput on inference requests. It also enables multitude of options for client-server communication (like http, grpc, dynamic batching, async requests).

If you need to serve more than one model then the flexibility of Triton Inference and TIRs high performance GPU infrastructure is your best bet.

Utilization

Triton can be used to deploy models either on GPU or CPU. It maximizes GPU/CPU utilization with features such as dynamic batching and concurrent model execution. This means if you have a single GPU and can load more than one model if GPU memory is available.

Scalability

You can auto-scale replicas and auto-deploy models across all them with a click of a button.

Application Experience

You get http/rest and gRPC endpoints out of the box. There is also support for real-time, batch (includes in-flight batching) and streaming options for sending inference requests. Models can be updated in production without a downtime.

Quick Start: Tutorial

Create a directory model-dir to download the model repository.

Download the sample model repository:

Upload the sample models from local directory to a TIR Repository.

Create a model endpoint to access our model over REST/gRPC:

A model endpoint offers a way to serve your model over REST or gRPC endpoints. TIR will automatically provision a HTTP Server (handler) for your model.

Use Triton Client to call the model endpoint

Triton Client is extremely flexible as it can support a number of client settings. Apart from the support for variety of programming languages (C++, Java or Python), you can also utilize features async io, streaming, decoupled connections, etc.

For the sake of simplicity, we will use a python client for that synchronously calls the simple model endpoint.

The triton client repo has examples on more complex usecases like async, streaming, etc.

Troubleshooting and Operations Guide

Model Updates

To deploy a new model version or model config is a two step process: * Push updated model (files and config) to Model Repository * Restart the model endpoint service

When you restart the service, TIR will stop the existing container and start a new one. The new container will download the most recent files from the repository.

Metrics and Logging

You can use the dashboard to keep track of resource metrics as well as service level metrics like QPS, P99, P50 etc. The dashboard also reports detailed logs streamed from the running replicas.

Autoscaling

You can configure autoscaling to dynamically launch new replicas when the load is high. Please note, the scaling operation will depend on availability of resources.

Multi-Model Support

Triton allows sharing of GPUs between multiple models. TIR supports multi-model configuration. However, the option to use explicit model where you only load or unload a few selected models is not supported yet. In case this feature is important to you, please feel free to raise a support ticket.

Frequently Asked Questions

Can I use Triton Server to deploy Large Language Models (LLMs)?

Yes. We recommnad TensorRT-LLM or vLLM server
Can Triton server handle streaming or batching requests?

Yes. The triton client repo has serveral examples.