Prerequisites
Before diving into the deployment process, gather the necessary tokens and accepting Google’s terms. This step ensures that you have access to the model and are in compliance with usage policies. The next section will guide you through Setting up your Serverless Endpoint with Runpod.Get started
To begin, we’ll deploy a vLLM Worker as a Serverless Endpoint. Runpod simplifies the process of running large language models, offering an alternative to the more complex Docker and Kubernetes deployment methods. Follow these steps in the Runpod Serverless console to create your Endpoint.- Log in to the Runpod Serverless console.
- Select + New Endpoint.
-
Provide the following:
i. Endpoint name.
ii. Select a GPU.
iii. Configure the number of Workers.
iv. (optional) Select FlashBoot.
v. Enter the vLLM Worker image:
runpod/worker-vllm:stable-cuda11.8.0
orrunpod/worker-vllm:stable-cuda12.1.0
. vi. Specify enough storage for your model. vii. Add the following environment variables: a.MODEL_NAME
:google/gemma-7b-it
. b.HF_TOKEN
: your Hugging Face API token for private models. - Select Deploy.
Interact with your model
With the Endpoint up and running, it’s time to leverage its capabilities by sending requests to interact with the model. This section demonstrates how to use OpenAI APIs to communicate with your model. In this example, you’ll create a Python chat bot using theOpenAI
library; however, you can use any programming language and any library that supports HTTP requests.
Here’s how to get started:
Use the OpenAI
class to interact with the model. The OpenAI
class takes the following parameters:
base_url
: The base URL of the Serverless Endpoint.api_key
: Your Runpod API key.
Set your environment variables Where
RUNPOD_BASE_URL
and RUNPOD_API_KEY
to your Runpod API key and base URL. Your RUNPOD_BASE_URL
will be in the form of:${RUNPOD_ENDPOINT_ID}
is the ID of your Serverless Endpoint.client
to interact with the model. For example, you can use the chat.completions.create
method to generate a response from the model.
Provide the following parameters to the chat.completions.create
method:
model
:The model name
.messages
: A list of messages to send to the model.max_tokens
: The maximum number of tokens to generate.temperature
: The randomness of the generated text.top_p
: The cumulative probability of the generated text.max_tokens
: The maximum number of tokens to generate.