Model Deployments

It is easy to deploy containers that serve model API.

TIR offers two methods to create an inference service (API endpoint) for your AI model:

  • Deploy using pre-built (TIR provided) Containers

    Before you launch a service with pre-built containers, you must first create a TIR Model and upload model files to it. The pre-built containers are designed to auto-download the model files from EOS (E2E Object Storage) Bucket and launch the API server with them. Once an endpoint is ready, you can send synchronous request to the endpoint for inference.

  • Deploy using your own container

    You can provide a public or private docker image and launch an inference service with it. Once endpoint is ready, you can make synchronous requests to the endpoint for inference. You may also choose to attach a tir model to your service, to automate download of model files from EOS bucket to the container.

Pre-built Containers

TIR provides docker container images that you can run as pre-built containers. These containers provide inference servers (HTTP) that can serve inference requests with minimal configuration. They are also capable of connecting with E2E Object Storage and downloading models on the containers at the startup.

This section lists deployment guides for all the integrated frameworks that TIR supports.

TorchServe

Go through this complete guide for deploying a torchserve service.

NVIDIA Triton

Go through this detailed guide to deploy a triton service.

LLAMA2

Go through this tutorial to deploy LLAMA v2.

CodeLLMA

Go through this tutorial to deploy CodeLLMA Service.

Stable Diffusion

Go through this tutorial to deploy Stable Diffusion Inference Service.

Custom Containers

Go through this detailed tutorial to building a custom images for model deployments.