Skip to main content
Run an Ollama server on CPU for LLM inference. This tutorial focuses on CPU compute, but you can also select a GPU for faster performance.

What you’ll learn

  • Deploy an Ollama container as a Serverless .
  • Configure a to cache models and reduce times.
  • Send inference requests to your Ollama endpoint.

Requirements

Before starting, you’ll need:
  • A Runpod account with credits.
  • (Optional) A network volume to store models.

Step 1: Deploy a Serverless endpoint

We recommend attaching a network volume to store downloaded models. Without a network volume, the worker downloads the model on every cold start, increasing latency. You can attach a network volume to your endpoint after it’s deployed.
  1. Log in to the Runpod console.
  2. Navigate to Serverless and select New Endpoint.
  3. Choose CPU and select a configuration (for example, 8 vCPUs and 16 GB RAM).
  4. Configure your worker settings as needed.
  5. In the Container Image field, enter: pooyaharatian/runpod-ollama:0.0.8
  6. In the Container Start Command field, enter the model name (for example, orca-mini or llama3.1). See the Ollama library for available models.
  7. Allocate at least 20 GB of container disk space.
  8. (Optional) Add an environment variable with key OLLAMA_MODELS and value /runpod-volume to store models on your attached network volume.
  9. Select Deploy.
Wait for the model to download and the worker to become ready.

Step 2: Send a request

Once your endpoint is deployed:
  1. Go to the Requests section in the Runpod console.
  2. Enter the following JSON in the input field:
    {
      "input": {
        "method_name": "generate",
        "input": {
          "prompt": "Why is the sky blue?"
        }
      }
    }
    
  3. Select Run.
You’ll receive a response like this:
{
  "delayTime": 153,
  "executionTime": 4343,
  "id": "c2cb6af5-c822-4950-bca9-5349288c001d-u1",
  "output": {
    "model": "orca-mini",
    "response": "The sky appears blue because of a process called scattering...",
    "done": true
  },
  "status": "COMPLETED"
}
Your Ollama endpoint is now ready to integrate into your applications using the Runpod API.

Next steps