Deploy Inference Endpoint For Meta LLMA 2
In this tutorial, we will download Meta’s LLMA2 (7b) model and create an inference endpoint against it.
Download LLMA2-7b-Chat (by Meta) model from huggingface
Upload the model to Model Bucket (EOS)
Create an inference endpoint (model endpoint) in TIR to serve API requests
Step 1: Define a model in TIR Dashboard
Before we proceed with downloading or fine-tuning (optional) the model weights, let us first define a model in TIR dashboard.
Go to TIR Dashboard
Choose a project
Go to Model section
Click on Create Model
Enter a model name of your choosing (e.g. Meta-LLMA2-7b-Chat)a
Select Model Type as Custom or Pytorch
Click on CREATE
You will now see details of EOS (E2E Object Storage) bucket created for this model.
EOS Provides a S3 compatible API to upload or download content. We will be using Minio CLI in this tutorial.
Copy the Setup Host command from Setup Minio CLI tab to a notepad or leave it in the clipboard. We will soon use it to setup Minio CLI.
Note: In case you forget to copy the setup host command for Minio CLI, don’t worry. You can always go back to model details and get it again.
Step 2: Start a new Notebook
To work with the model weights, we will need to first need to download them to a local machine or a notebook instance.
In TIR Dashboard, Go to Notebooks
Launch a new Notebook with Transformers (or Pytorch) Image and a hardware plan (e.g. A10080). We recommand a GPU plan if you plan to test or fine-tune the model.
Click on the Notebook name or Launch Notebook option to start jupyter labs environment
In the jupyter labs, Click New Launcher and Select Terminal
Now, paste and run the command for setting up Minio CLI Host from Step 1
If the command works, you will have mc cli ready for uploading our model
Step 2: Download the LLMA2-7B-Chat (by Meta) model from notebook
Now, our EOS bucket is store the model weights. Let us download the weights from Hugging face.
Start a new notebook untitled.ipynb in jupyter labs
Add your huggingface API token to run the following command from a notebook cell. You will find the API token from account settings. If you dont prefer using API Token this way, alternatively you may run huggingface_login() in notebook cell.
!export HUGGING_FACE_HUB_TOKEN=`hf_xxxx.......`
Run the following commands in download the model. The model will be downloaded by huggignface sdk in the $HOME/.cache folder
from transformers import AutoTokenizer, AutoModelForCausalLM import transformers import torch model = AutoModelForCausalLM.from_pretrained(self.model_local_path, trust_remote_code=True, device_map='auto') tokenizer = AutoTokenizer.from_pretrained(self.model_local_path) pipeline = transformers.pipeline( "text-generation", model=model, torch_dtype=torch.float16, tokenizer=tokenizer, device_map="auto", )
Note
If you face any issues running above code in the notebook cell, you may be missing required libraries. This may happen if you did not launch the notebook with transformers image. In such situation, you can install the required libraries below:
!pip install transformers torch
Let us run a simple inference to test the model.
pipeline('It is said that life is beautiful when', do_sample=True, top_k=10, num_return_sequences=1, eos_token_id=self.tokenizer.eos_token_id, max_length=200, )
Note
Since llma-7b-hf is a base model and not trained on intructions or chat, it is no capable of answering question. However, the model is trained for sentence completion. So instead of asking - What is life?, an appropriate input will be - it is said that life is.
Step 2: Upload the model to Model Bucket (EOS)
Now that the model works as expected, you can fine-tune it with your own data or choose to serve the model as-is. This tutorial assumes you are uploading the model as-is to create inference endpoint. In case you fine-tune the model, you can follow similar steps to upload the model to EOS bucket.
# go to the directory that has the huggingface model code.
cd $HOME/.cache/huggingface/hub/meta-llma2-7b-chat/snapshots
# push the contents of the folder to EOS bucket.
# Go to TIR Dashboard >> Models >> Select your model >> Copy the cp command from Setup Minio CLI tab.
# The copy command would look like this:
# mc cp -r <MODEL_NAME> llma-7b/llma-7b-323f3
# here we replace <MODEL_NAME> with '*' to upload all contents of snapshots folder
mc cp -r * llma-7b/llma-7b-323f3
Note
The model directory name may be a little different (we assume it is meta-llma2-7b-chat). In case, this command does not work, list the directories in $HOME/.cache/huggingface/hub to identify the model directory
Step 3: Create an endpoint for our model When a model endpoint is created in TIR dashboard, in the background a model server is launched to serve the inference requests.
TIR platform supports a variety of model formats through pre-buit containers (e.g. pytorch, triton, meta/llma2).
For the scope of this tutorial, we will use pre-built container (LLMA-2-7B) for the model endpoint but you may choose to create your own custom container by following this tutorial .
In most cases, the pre-built container would work for your use case. The advantage is - you wont have to worry about building an API handler.
When you use pre-built containers, all you need to do is load your model weights (fine-tuned or not) to TIR Model’s EOS bucket and launch the endpoint. API handler will be automatically created for you.
Steps to create endpoint:
Go to TIR Dashboard
Go to Model Endpoints
Create a new Endpoint
Choose LLMA2-7B option
Pick a suitable GPU plan. we recommend A10080 and disk size of 20G
Select appropriate model (should be the EOS bucket that has your model weights)
Complete the endpoint creation
When your endpoint is ready, visit the Sample API request section to test your endpoint using curl.