Deploying Codellama As A REST API Service
Introduction
Generative AI continues to grow in popularity but the infrastructure required to support these models is still under active development. Closed source models like ChatGPT and Midjourney are attractive for people looking to quickly jump into this space as it is really convenient.
But will closed-source end up winning in the long term?
While models like ChatGPT are quick and easy to use, there are quite a few downsides:
- Price
- The cost of closed-source grows linearly with the number of requests sent to the model
- At scale, the cost may balloon out of control
- Security
- When a request is made to ChatGPT, all the data is sent directly to OpenAI's servers.
- There is no guarantee the data is stored securely
- Privacy
- Similar to security, since the data is being shared with OpenAI it is not private
- For certain use-cases, data privacy and security are of paramount importance, so a closed-source solution may not be viable
To address some of these challenges, smaller open-source models have been created such as LLama2 or Mistral. These open-source models are cheaper to host and can be run inside a private cloud environment. However, with open-source models a new set of challenges arises.
How do I deploy these models? What kind of compute do they require?
Many companies are actively building tools to help developers work with open-source generative AI. One such company is Replicate, and they've built an open-source tool called cog which allows developers to easily containerize AI models.
Cog automatically handles some of the tedious work around containerizing a model such as writing a Dockerfile, figuring out which CUDA version to use, and setting up an API server to handle requests.
In this blog post, we'll be deploying Codellama 7B as a REST API service using cog. Let's get started!
Tutorial
Step 1: Installing Cog
The first thing we need to do is install cog. If you're on Mac you can install it via homebrew. Simply open up your terminal and type the command below:
brew install cog
If you're on Linux you can install it using the following command:
sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/latest/download/cog_$(uname -s)_$(uname -m)"
sudo chmod +x /usr/local/bin/cog
To check that you have installed cog successfully, run cog --version
and you should see the version of cog that is installed.
Step 2: Setting up the project
Once cog is installed, we can set up the project. Open up your terminal and go to the directory you want to create your project in.
Next, type in cog init
The above command creates the files necessary for cog to run and package your model. You should see two files get created: predict.py
and cog.yaml
.
If you open up the predict.py
file you should see this:
# Prediction interface for Cog ⚙️
# https://github.com/replicate/cog/blob/main/docs/python.md
from cog import BasePredictor, Input, Path
class Predictor(BasePredictor):
def setup(self) -> None:
"""Load the model into memory to make running multiple predictions efficient"""
# self.model = torch.load("./weights.pth")
def predict(
self,
image: Path = Input(description="Grayscale input image"),
scale: float = Input(
description="Factor to scale image by", ge=0, le=10, default=1.5
),
) -> Path:
"""Run a single prediction on the model"""
# processed_input = preprocess(image)
# output = self.model(processed_image, scale)
# return postprocess(output)
What's going on here?
This file is where we will be writing all of the model-related code. It contains a class called Predictor with two functions setup
and predict
. Inside setup
,we will download the model and put it on the GPU. The predict
function will handle incoming requests to the model and will return the model output as a JSON object.
Behind the scenes, cog automatically turns this Predictor
class into a REST API service so we don't need to create a separate web server.
The other file that was created is the cog.yaml
. This file contains the container requirements to run our code. It includes things like Python dependencies and Python version required to run the cog.
Step 3: Adding the code for codellama
First, let import the necessary libraries for code llama.
from cog import BasePredictor, Input
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from typing import Dict
Next, let's write a quick system prompt
as chat based llama variants require the prompt to be in a specific format.
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest coding assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
Feel free to modify the DEFAULT_SYSTEM_PROMPT
to your liking.
Now, we need to fill out the setup
function in which we will load the model.
class Predictor(BasePredictor):
MODEL_NAME = "codellama/CodeLlama-7b-Instruct-hf"
def setup(self) -> None:
self.model = AutoModelForCausalLM.from_pretrained(
self.MODEL_NAME,
torch_dtype=torch.float16,
device_map="auto"
)
self.tokenizer = AutoTokenizer.from_pretrained(
self.MODEL_NAME
)
Using the transformers library we can download the model from hugging face hub using the AutoModelForCausalLM.from_pretrained
function. By specifying the torch_dtype
we will load the model weights as floats with 16-bit precision. device_map=auto
will automatically move the model weights to the GPU if one exists, otherwise, it will use the CPU instead.
Next, let's define the predict
function.
def predict(
self,
prompt: str = Input(
description=f"Input Prompt.", default="How can I create a basic API in python?"
),
max_tokens: int = Input(
description="Maximum number of tokens to generate. A word is generally 2-3 tokens",
ge=1,
default=100,
),
top_p: float = Input(
description="Valid if you choose top_p decoding. When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens",
ge=0.01,
le=1.0,
default=1.0,
),
temperature: float = Input(
description="Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.",
ge=0.01,
le=5,
default=0.75,
),
repetition_penalty: float = Input(
description="Penalty for repeated words in generated text; 1 is no penalty, values greater than 1 discourage repetition, less than 1 encourage it.",
ge=0.01,
le=5,
default=1.15,
),
) -> Dict:
with torch.no_grad():
# Llama needs the prompt to be formatted in this manner
# formatted_prompt = f"[INST] <<SYS>> {system_message} <</SYS>> {prompt} [/INST]"
formatted_prompt = (
f"{B_INST} {B_SYS} {DEFAULT_SYSTEM_PROMPT} {E_SYS} {prompt} {E_INST}"
)
input_ids = self.tokenizer(formatted_prompt, return_tensors='pt').input_ids.cuda()
output = self.model.generate(
inputs=input_ids,
max_new_tokens=max_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
repetition_penalty=repetition_penalty,
eos_token_id=self.tokenizer.eos_token_id,
)
llm_output = self.tokenizer.decode(output[0], skip_special_tokens=True)\
.replace(formatted_prompt, "").strip()
return {"output": llm_output}
The predict
function takes in a few arguments:
- prompt: The instruction the user has typed in
- max_tokens: The maximum number of tokens the LLM should output
- top_p: Controls the randomness of the LLM output by selecting tokens from a specific list.
- temperature: Also controls the randomness of the LLM with higher temperatures resulting in more creative and unpredictable results.
- repetition_penalty: This is used to counteract the model's tendency to repeat prompt text and get stuck in a loop
The formatted_prompt
variable formats the user's prompt in a specific way for llama to understand. AI models don't understand text, so a tokenizer gets used to convert text into a numerical representation.
The tokens along with the other parameters are sent to the model in the self.model.generate
function. Similar to the LLM input, the output of the model is also in numerical representation, so the same tokenizer needs to be used to convert the LLM output to text.
The output for codellama contains the system prompt along with the user prompt. To make the output a bit cleaner, we remove those two things by replacing them with an empty string.
Finally, we return the output as a JSON object.
Step 4: Defining the cog.yaml
The main thing that we need to define inside the cog.yaml
file are the python dependencies along with their specific version. Here is what it looks like for this project:
# Configuration for Cog ⚙️
# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
build:
# set to true if your model requires a GPU
gpu: true
cuda: "11.7"
# python version in the form '3.11' or '3.11.4'
python_version: "3.10"
python_packages:
- "accelerate==0.23.0"
- "safetensors==0.3.3"
- "scipy==1.10.1"
- "torch==2.0.1"
- "transformers==4.33.2"
# predict.py defines how predictions are run on your model
predict: "predict.py:Predictor"
Things to note:
- Under
build
,gpu: true
means that the cog is requesting a GPU - Python dependencies are listed under
python_packages
and it's a good idea to pin them to a specific version as the AI space moves fast and packages break all the time.
Step 5: Building the cog
Great, now that the coding part is complete, we need to build our container. Fortunately, cog comes with some built-in commands that simplify the container build process.
Go to the root of the cog project and run the command cog build
. This will build a docker container with a default name of cog-<project-name>
. If you noticed, we did not provide a Dockerfile. Cog will create the Dockerfile dynamically when you run the build command.
Once the build is complete, you can push your image to a remote registry such as docker hub. Use the commands below:
docker tag cog-codellamacog <your-docker-username>/cog-codellamacog
docker push <your-docker-username>/cog-codellamacog
If you don't want to build and push your own image, you can just use mine in the next step.
Step 6: Running the container
Now that the container is ready to go, all that's left to do is run it. There are a variety of ways to run docker containers, but to keep it simple I'll be using a service called Runpod. Runpod allows you to run any container on a GPU enabled machine.
Head over to their website and create an account along with an API key. You will also need to install the runpod python package via pip install runpod
.
Open up a blank python script and paste in the following code:
import runpod
import requests
runpod.api_key = "YOUR-RUNPOD-API-KEY"
pod = runpod.create_pod(
name="codellama-cog",
image_name="htrivedi05/cog-codellamacog",
gpu_type_id="NVIDIA RTX A5000",
cloud_type="SECURE",
gpu_count=1,
volume_in_gb=10,
container_disk_in_gb=50,
ports="5000/http",
volume_mount_path="/data"
)
print(pod)
Here we are creating a runpod deployment.
- image_name: Use the image you created in the previous step or you can just use mine as well
htrivedi05/cog-codellamacog
- gpu_type_id: Specifies which type of GPU we want to use. For the 7B version of code llama any GPU with 24GB of VRAM is enough.
- cloud_type: SECURE means our service will run in a data center. If you choose COMMUNITY instead, your service will run on an individual compute provider.
- gpu_count: The number of GPUs we want for our service
- volume_in_gb: Container volume disk space
- container_disk_in_gb: Overall disk space for our entire deployment
- ports: The port our service is accessible at. By default, cog exposes the container at port 5000 so we need to use that one.
If you run the code above, you should see a pod ID printed on your console. If you go to the Runpod website, under the pods sections you should be able to see the logs of the pod. First, you will see system-level logs of the deployment pulling your docker image. Then you can see the container logs of the model being downloaded and the REST API server starting up.
Once you see that the model is ready to receive requests from the container log, you can use the following code to send requests to the model.
prompt = "Can you write some python code for a snake game"
payload = {"input": {"prompt": prompt, "max_tokens": 128, "temperature": 0.75, "top_p": 1.0, "repetition_penalty": 1.15}}
headers = {'Content-Type': 'application/json'}
res = requests.post("https://<your-pod-id>-5000.proxy.runpod.net/predictions", headers=headers, json=payload)
print(res.json())
output = res.json().get("output")
print(output)
In the code above, replace your-pod-id
with your pod id.
Here is the output I got after sending the request:
{'output': 'Sure, here is an example of how you could implement a simple Snake game using Python:\n```\nimport turtle\n\n# Set up the screen and background color\nscreen = turtle.Screen()\nscreen.bgcolor("lightgreen")\n\n# Create a border around the screen\nborder_pen = turtle.Turtle()\nborder_pen.hideturtle()\nborder_pen.pensize(3)\nborder_pen.color("black")\nborder_pen.penup()\nborder_pen.goto(-200, -'}
Here is what the output looks like once formatted:
Sure, here is an example of how you could implement a simple Snake game using Python:
import turtle
# Set up the screen and background color
screen = turtle.Screen()
screen.bgcolor("lightgreen")
# Create a border around the screen
border_pen = turtle.Turtle()
border_pen.hideturtle()
border_pen.pensize(3)
border_pen.color("black")
border_pen.penup()
border_pen.goto(-200, -'
Feel free to play around with the input JSON by changing the prompt or max_tokens parameter to get a variety of results.
Once you're done experimenting, you can shut down the runpod deployment using the following python commands:
runpod.stop_pod("<YOUR-RUNPOD-POD-ID>")
runpod.terminate_pod("<YOUR-RUNPOD-POD-ID>")
This will stop and terminate your deployment so you no longer get charged.
Conclusion
With the rise of generative AI, there will be a greater number of AI models deployed into production. For some organizations, the gap between prototype and production can be quite large and it leads to a slower release cycle for AI products.
Using a service like Cog, it becomes very easy for developers to ship models to the cloud. Cog is designed for production-level workloads and works great with generative AI models. Once a container is built using cog, you can deploy it through any cloud vendor or service.
Cog is just one of many tools that developers can use to build and deploy AI models. As the ML infrastructure space continues to mature, we will get even better tools to help us build and manage AI models.
I hope you enjoyed this post. If you want to see more articles like this, please consider subscribing.
Member discussion