Use Hugging Face models - Runpod Documentation

Hugging Face provides thousands of pre-trained models for natural language processing, computer vision, audio processing, and more. You can integrate these models into your to deploy AI capabilities without training models from scratch. This guide shows you how to load and use Hugging Face models in your Serverless handlers, using sentiment analysis as an example that you can adapt for other model types. This guide covers two approaches:

Downloading models at runtime (simpler, good for development).
Using cached models (recommended for production).

Install dependencies

Your handler needs the transformers library to load Hugging Face models, and torch to run inference. Install both in your development environment:

pip install torch transformers

When deploying to Runpod, you’ll need to include these dependencies in your Dockerfile or requirements file.

Load models at runtime

Create a file named handler.py and follow these steps to build a handler that performs sentiment analysis using a Hugging Face model.

Import libraries

Start by importing the necessary libraries:

handler.py

import runpod
from transformers import pipeline

The pipeline function from the transformers library provides a simple interface for using pre-trained models. It handles tokenization, model inference, and post-processing automatically.

Load the model efficiently

Load your model outside the handler function to avoid reloading it on every request. This significantly improves performance by initializing the model only once when the starts:

handler.py

# Load model once when worker starts
model = pipeline(
    "sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
)

The pipeline function takes two arguments: the task type (like "sentiment-analysis", "text-generation", or "image-classification") and the specific model identifier from the Hugging Face model hub.

Define the handler function

Create a handler function that extracts input text from the request, validates it, runs inference, and returns results:

handler.py

def handler(job):
    # Extract input from the job
    job_input = job["input"]
    text = job_input.get("text")

    # Validate input
    if not text:
        return {"error": "No text provided for analysis."}

    # Run inference
    result = model(text)[0]

    # Return formatted results
    return {
        "sentiment": result["label"],
        "score": float(result["score"])
    }

The handler follows Runpod’s standard pattern: extract input, validate it, process it, and return results. The model returns a list of predictions, so we take the first result with [0] and extract the label and confidence score.

Start the Serverless worker

Add this line at the end of your file to register the handler and start the worker:

handler.py

runpod.serverless.start({"handler": handler})

Complete implementation

Here’s the complete code:

handler.py

import runpod
from transformers import pipeline

# Load model once when worker starts
model = pipeline(
    "sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
)

def handler(job):
    # Extract input from the job
    job_input = job["input"]
    text = job_input.get("text")

    # Validate input
    if not text:
        return {"error": "No text provided for analysis."}

    # Run inference
    result = model(text)[0]

    # Return formatted results
    return {
        "sentiment": result["label"],
        "score": float(result["score"])
    }

runpod.serverless.start({"handler": handler})

Test locally

Create a test input file to verify your handler works correctly:

test_input.json

{
  "input": {
    "text": "This is absolutely wonderful and amazing!"
  }
}

Run your handler locally using the Runpod SDK:

python handler.py --rp_server_api

You should see output indicating successful sentiment analysis:

--- Starting Serverless Worker |  Version 1.6.2 ---
INFO   | Using test_input.json as job input.
DEBUG  | Retrieved local job: {'input': {'text': 'This is absolutely wonderful and amazing!'}, 'id': 'local_test'}
INFO   | local_test | Started.
DEBUG  | local_test | Handler output: {'sentiment': 'POSITIVE', 'score': 0.999880313873291}
INFO   | Job local_test completed successfully.

The first time you run this, Hugging Face will download the model files. Subsequent runs will use the cached model.

Adapt for other models

This pattern works for any Hugging Face model. To use a different model:

Choose your model: Browse the Hugging Face model hub to find a model for your task.

Update the pipeline: Change the task type and model identifier:

# Text generation example
model = pipeline("text-generation", model="gpt2")

# Image classification example
model = pipeline("image-classification", model="google/vit-base-patch16-224")

# Translation example
model = pipeline("translation_en_to_fr", model="t5-base")

Adjust input/output handling: Different models expect different input formats and return different output structures. Check the model’s documentation on Hugging Face to understand its API.

Use cached models

The example above downloads models when workers start, which works fine for development and testing. For production endpoints, we highly recommend using cached models instead. Cached models provide faster cold starts (seconds instead of minutes) and eliminate charges for model download time.

Enable model caching

To enable cached models on your endpoint:

Open endpoint settings

Navigate to the Serverless section of the console. Either create a new endpoint or select Manage → Edit Endpoint on an existing one.

Configure the model

Scroll to the Model field and enter your Hugging Face model identifier.For example: distilbert/distilbert-base-uncased-finetuned-sst-2-english

Deploy

Save your endpoint configuration. Runpod will automatically cache the model and make it available to your workers.

Locate cached models

Cached models are stored at /runpod-volume/huggingface-cache/hub/ following Hugging Face cache conventions. Add this helper function to your handler to resolve the correct snapshot path:

import os

HF_CACHE_ROOT = "/runpod-volume/huggingface-cache/hub"


def resolve_snapshot_path(model_id: str) -> str:
    """
    Resolve the local snapshot path for a cached model.

    Args:
        model_id: The model name from Hugging Face
            (e.g., 'distilbert/distilbert-base-uncased-finetuned-sst-2-english')

    Returns:
        The full path to the cached model snapshot
    """
    if "/" not in model_id:
        raise ValueError(f"model_id '{model_id}' must be in 'org/name' format")

    org, name = model_id.split("/", 1)
    model_root = os.path.join(HF_CACHE_ROOT, f"models--{org}--{name}")
    refs_main = os.path.join(model_root, "refs", "main")
    snapshots_dir = os.path.join(model_root, "snapshots")

    # Read the snapshot hash from refs/main
    if os.path.isfile(refs_main):
        with open(refs_main, "r") as f:
            snapshot_hash = f.read().strip()
        candidate = os.path.join(snapshots_dir, snapshot_hash)
        if os.path.isdir(candidate):
            return candidate

    # Fall back to first available snapshot
    if os.path.isdir(snapshots_dir):
        versions = [
            d for d in os.listdir(snapshots_dir)
            if os.path.isdir(os.path.join(snapshots_dir, d))
        ]
        if versions:
            versions.sort()
            return os.path.join(snapshots_dir, versions[0])

    raise RuntimeError(f"Cached model not found: {model_id}")

Adapt your handler for cached models

Once model caching is enabled, you need to update your handler to load the model from the local cache instead of downloading it. Here’s how the code changes:

Runtime download
Cached model

from transformers import pipeline

# Downloads model when worker starts
model = pipeline(
    "sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
)

import os
from transformers import pipeline

# Force offline mode to prevent accidental downloads
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"

# Resolve the cached model path
LOCAL_PATH = resolve_snapshot_path(
    "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
)

# Load from local cache
model = pipeline(
    "sentiment-analysis",
    model=LOCAL_PATH,
    local_files_only=True
)

The key differences are:

Offline mode: Setting HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE prevents accidental downloads if the model isn’t cached.
Local path: Instead of a model identifier, you pass the resolved local path to the cached model files.
local_files_only: This flag tells the transformers library to only use local files.

Complete cached implementation

Here’s the complete handler using cached models:

handler.py

import os
import runpod
from transformers import pipeline

MODEL_ID = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
HF_CACHE_ROOT = "/runpod-volume/huggingface-cache/hub"

# Force offline mode to use only cached models
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"


def resolve_snapshot_path(model_id: str) -> str:
    """Resolve the local snapshot path for a cached model."""
    if "/" not in model_id:
        raise ValueError(f"model_id '{model_id}' must be in 'org/name' format")

    org, name = model_id.split("/", 1)
    model_root = os.path.join(HF_CACHE_ROOT, f"models--{org}--{name}")
    refs_main = os.path.join(model_root, "refs", "main")
    snapshots_dir = os.path.join(model_root, "snapshots")

    if os.path.isfile(refs_main):
        with open(refs_main, "r") as f:
            snapshot_hash = f.read().strip()
        candidate = os.path.join(snapshots_dir, snapshot_hash)
        if os.path.isdir(candidate):
            return candidate

    if os.path.isdir(snapshots_dir):
        versions = [
            d for d in os.listdir(snapshots_dir)
            if os.path.isdir(os.path.join(snapshots_dir, d))
        ]
        if versions:
            versions.sort()
            return os.path.join(snapshots_dir, versions[0])

    raise RuntimeError(f"Cached model not found: {model_id}")


# Load model once when worker starts
LOCAL_PATH = resolve_snapshot_path(MODEL_ID)
model = pipeline("sentiment-analysis", model=LOCAL_PATH, local_files_only=True)


def handler(job):
    job_input = job["input"]
    text = job_input.get("text")

    if not text:
        return {"error": "No text provided for analysis."}

    result = model(text)[0]

    return {
        "sentiment": result["label"],
        "score": float(result["score"])
    }


runpod.serverless.start({"handler": handler})

For a complete walkthrough including Dockerfile creation and deployment, see the cached model tutorial.

Other best practices

When deploying Hugging Face models to production endpoints, keep these additional considerations in mind:

Model size: Larger models require more VRAM and take longer to load. Choose the smallest model that meets your accuracy requirements.
GPU utilization: Most Hugging Face models run faster on GPUs. Ensure your endpoint uses GPU workers for optimal performance.
Batch processing: If your model supports batching, process multiple inputs together to improve throughput.

Next steps

Learn more about how cached models work.
Create a Dockerfile to package your handler with its dependencies.
Deploy your worker to a Runpod endpoint.
Explore optimization techniques to improve performance.

​Install dependencies

​Load models at runtime

​Complete implementation

​Test locally

​Adapt for other models

​Use cached models

​Enable model caching

​Locate cached models

​Adapt your handler for cached models

​Complete cached implementation

​Other best practices

​Next steps

Install dependencies

Load models at runtime

Complete implementation

Test locally

Adapt for other models

Use cached models

Enable model caching

Locate cached models

Adapt your handler for cached models

Complete cached implementation

Other best practices

Next steps