> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Use Hugging Face models

> Integrate pre-trained Hugging Face models into your Serverless handler functions.

export const WorkerTooltip = () => {
  return <Tooltip headline="Worker" tip="A container that runs your application code and processes requests to your Serverless endpoint. Workers are automatically started and stopped by Runpod to handle traffic spikes and ensure optimal resource utilization." cta="Learn more about workers" href="/serverless/workers/overview">worker</Tooltip>;
};

export const HandlerFunctionTooltip = () => {
  return <Tooltip headline="Handler function" tip="The core of a Runpod Serverless application. These functions define how a worker processes incoming requests and returns results." cta="Learn more about handler functions" href="/serverless/workers/handler-functions">handler function</Tooltip>;
};

Hugging Face provides thousands of pre-trained models for natural language processing, computer vision, audio processing, and more. You can integrate these models into your <HandlerFunctionTooltip /> to deploy AI capabilities without training models from scratch.

This guide shows you how to load and use Hugging Face models in your Serverless handlers, using sentiment analysis as an example that you can adapt for other model types.

This guide covers two approaches:

* [Downloading models at runtime](#load-models-at-runtime) (simpler, good for development).
* [Using cached models](#use-cached-models) (recommended for production).

## Install dependencies

Your handler needs the `transformers` library to load Hugging Face models, and `torch` to run inference. Install both in your development environment:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
pip install torch transformers
```

When deploying to Runpod, you'll need to include these dependencies in your [Dockerfile](/serverless/workers/create-dockerfile) or requirements file.

## Load models at runtime

Create a file named `handler.py` and follow these steps to build a handler that performs sentiment analysis using a Hugging Face model.

<Steps>
  <Step title="Import libraries">
    Start by importing the necessary libraries:

    ```python handler.py theme={"theme":{"light":"github-light","dark":"github-dark"}}
    import runpod
    from transformers import pipeline
    ```

    The `pipeline` function from the `transformers` library provides a simple interface for using pre-trained models. It handles tokenization, model inference, and post-processing automatically.
  </Step>

  <Step title="Load the model efficiently">
    Load your model outside the handler function to avoid reloading it on every request. This significantly improves performance by initializing the model only once when the <WorkerTooltip /> starts:

    ```python handler.py theme={"theme":{"light":"github-light","dark":"github-dark"}}
    # Load model once when worker starts
    model = pipeline(
        "sentiment-analysis",
        model="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
    )
    ```

    The `pipeline` function takes two arguments: the task type (like `"sentiment-analysis"`, `"text-generation"`, or `"image-classification"`) and the specific model identifier from the Hugging Face model hub.
  </Step>

  <Step title="Define the handler function">
    Create a handler function that extracts input text from the request, validates it, runs inference, and returns results:

    ```python handler.py theme={"theme":{"light":"github-light","dark":"github-dark"}}
    def handler(job):
        # Extract input from the job
        job_input = job["input"]
        text = job_input.get("text")

        # Validate input
        if not text:
            return {"error": "No text provided for analysis."}

        # Run inference
        result = model(text)[0]

        # Return formatted results
        return {
            "sentiment": result["label"],
            "score": float(result["score"])
        }
    ```

    The handler follows Runpod's standard pattern: extract input, validate it, process it, and return results. The model returns a list of predictions, so we take the first result with `[0]` and extract the label and confidence score.
  </Step>

  <Step title="Start the Serverless worker">
    Add this line at the end of your file to register the handler and start the worker:

    ```python handler.py theme={"theme":{"light":"github-light","dark":"github-dark"}}
    runpod.serverless.start({"handler": handler})
    ```
  </Step>
</Steps>

### Complete implementation

Here's the complete code:

```python handler.py theme={"theme":{"light":"github-light","dark":"github-dark"}}
import runpod
from transformers import pipeline

# Load model once when worker starts
model = pipeline(
    "sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
)

def handler(job):
    # Extract input from the job
    job_input = job["input"]
    text = job_input.get("text")

    # Validate input
    if not text:
        return {"error": "No text provided for analysis."}

    # Run inference
    result = model(text)[0]

    # Return formatted results
    return {
        "sentiment": result["label"],
        "score": float(result["score"])
    }

runpod.serverless.start({"handler": handler})
```

### Test locally

Create a test input file to verify your handler works correctly:

```json test_input.json theme={"theme":{"light":"github-light","dark":"github-dark"}}
{
  "input": {
    "text": "This is absolutely wonderful and amazing!"
  }
}
```

Run your handler locally using the Runpod SDK:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
python handler.py --rp_server_api
```

You should see output indicating successful sentiment analysis:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
--- Starting Serverless Worker |  Version 1.6.2 ---
INFO   | Using test_input.json as job input.
DEBUG  | Retrieved local job: {'input': {'text': 'This is absolutely wonderful and amazing!'}, 'id': 'local_test'}
INFO   | local_test | Started.
DEBUG  | local_test | Handler output: {'sentiment': 'POSITIVE', 'score': 0.999880313873291}
INFO   | Job local_test completed successfully.
```

The first time you run this, Hugging Face will download the model files. Subsequent runs will use the cached model.

### Adapt for other models

This pattern works for any Hugging Face model. To use a different model:

1. **Choose your model**: Browse the [Hugging Face model hub](https://huggingface.co/models) to find a model for your task.

2. **Update the pipeline**: Change the task type and model identifier:

   ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
   # Text generation example
   model = pipeline("text-generation", model="gpt2")

   # Image classification example
   model = pipeline("image-classification", model="google/vit-base-patch16-224")

   # Translation example
   model = pipeline("translation_en_to_fr", model="t5-base")
   ```

3. **Adjust input/output handling**: Different models expect different input formats and return different output structures. Check the model's documentation on Hugging Face to understand its API.

## Use cached models

The example above downloads models when workers start, which works fine for development and testing.

For production endpoints, we highly recommend using [cached models](/serverless/endpoints/model-caching) instead. Cached models provide faster cold starts (seconds instead of minutes) and eliminate charges for model download time.

### Enable model caching

To enable cached models on your endpoint:

<Steps>
  <Step title="Open endpoint settings">
    Navigate to the [Serverless section](https://www.console.runpod.io/serverless) of the console. Either create a new endpoint or select **Manage → Edit Endpoint** on an existing one.
  </Step>

  <Step title="Configure the model">
    Scroll to the **Model** field and enter your Hugging Face model identifier.

    For example: `distilbert/distilbert-base-uncased-finetuned-sst-2-english`
  </Step>

  <Step title="Deploy">
    Save your endpoint configuration. Runpod will automatically cache the model and make it available to your workers.
  </Step>
</Steps>

### Locate cached models

Cached models are stored at `/runpod-volume/huggingface-cache/hub/` following Hugging Face cache conventions. Add this helper function to your handler to resolve the correct snapshot path:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
import os

HF_CACHE_ROOT = "/runpod-volume/huggingface-cache/hub"


def resolve_snapshot_path(model_id: str) -> str:
    """
    Resolve the local snapshot path for a cached model.

    Args:
        model_id: The model name from Hugging Face
            (e.g., 'distilbert/distilbert-base-uncased-finetuned-sst-2-english')

    Returns:
        The full path to the cached model snapshot
    """
    if "/" not in model_id:
        raise ValueError(f"model_id '{model_id}' must be in 'org/name' format")

    org, name = model_id.split("/", 1)
    model_root = os.path.join(HF_CACHE_ROOT, f"models--{org}--{name}")
    refs_main = os.path.join(model_root, "refs", "main")
    snapshots_dir = os.path.join(model_root, "snapshots")

    # Read the snapshot hash from refs/main
    if os.path.isfile(refs_main):
        with open(refs_main, "r") as f:
            snapshot_hash = f.read().strip()
        candidate = os.path.join(snapshots_dir, snapshot_hash)
        if os.path.isdir(candidate):
            return candidate

    # Fall back to first available snapshot
    if os.path.isdir(snapshots_dir):
        versions = [
            d for d in os.listdir(snapshots_dir)
            if os.path.isdir(os.path.join(snapshots_dir, d))
        ]
        if versions:
            versions.sort()
            return os.path.join(snapshots_dir, versions[0])

    raise RuntimeError(f"Cached model not found: {model_id}")
```

### Adapt your handler for cached models

Once model caching is enabled, you need to update your handler to load the model from the local cache instead of downloading it. Here's how the code changes:

<Tabs>
  <Tab title="Runtime download">
    ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
    from transformers import pipeline

    # Downloads model when worker starts
    model = pipeline(
        "sentiment-analysis",
        model="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
    )
    ```
  </Tab>

  <Tab title="Cached model">
    ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
    import os
    from transformers import pipeline

    # Force offline mode to prevent accidental downloads
    os.environ["HF_HUB_OFFLINE"] = "1"
    os.environ["TRANSFORMERS_OFFLINE"] = "1"

    # Resolve the cached model path
    LOCAL_PATH = resolve_snapshot_path(
        "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
    )

    # Load from local cache
    model = pipeline(
        "sentiment-analysis",
        model=LOCAL_PATH,
        local_files_only=True
    )
    ```
  </Tab>
</Tabs>

The key differences are:

* **Offline mode**: Setting `HF_HUB_OFFLINE` and `TRANSFORMERS_OFFLINE` prevents accidental downloads if the model isn't cached.
* **Local path**: Instead of a model identifier, you pass the resolved local path to the cached model files.
* **local\_files\_only**: This flag tells the transformers library to only use local files.

### Complete cached implementation

Here's the complete handler using cached models:

```python handler.py theme={"theme":{"light":"github-light","dark":"github-dark"}}
import os
import runpod
from transformers import pipeline

MODEL_ID = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
HF_CACHE_ROOT = "/runpod-volume/huggingface-cache/hub"

# Force offline mode to use only cached models
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"


def resolve_snapshot_path(model_id: str) -> str:
    """Resolve the local snapshot path for a cached model."""
    if "/" not in model_id:
        raise ValueError(f"model_id '{model_id}' must be in 'org/name' format")

    org, name = model_id.split("/", 1)
    model_root = os.path.join(HF_CACHE_ROOT, f"models--{org}--{name}")
    refs_main = os.path.join(model_root, "refs", "main")
    snapshots_dir = os.path.join(model_root, "snapshots")

    if os.path.isfile(refs_main):
        with open(refs_main, "r") as f:
            snapshot_hash = f.read().strip()
        candidate = os.path.join(snapshots_dir, snapshot_hash)
        if os.path.isdir(candidate):
            return candidate

    if os.path.isdir(snapshots_dir):
        versions = [
            d for d in os.listdir(snapshots_dir)
            if os.path.isdir(os.path.join(snapshots_dir, d))
        ]
        if versions:
            versions.sort()
            return os.path.join(snapshots_dir, versions[0])

    raise RuntimeError(f"Cached model not found: {model_id}")


# Load model once when worker starts
LOCAL_PATH = resolve_snapshot_path(MODEL_ID)
model = pipeline("sentiment-analysis", model=LOCAL_PATH, local_files_only=True)


def handler(job):
    job_input = job["input"]
    text = job_input.get("text")

    if not text:
        return {"error": "No text provided for analysis."}

    result = model(text)[0]

    return {
        "sentiment": result["label"],
        "score": float(result["score"])
    }


runpod.serverless.start({"handler": handler})
```

<Tip>
  For a complete walkthrough including Dockerfile creation and deployment, see the [cached model tutorial](/tutorials/serverless/model-caching-text).
</Tip>

## Other best practices

When deploying Hugging Face models to production endpoints, keep these additional considerations in mind:

* **Model size**: Larger models require more VRAM and take longer to load. Choose the smallest model that meets your accuracy requirements.

* **GPU utilization**: Most Hugging Face models run faster on GPUs. Ensure your endpoint uses GPU workers for optimal performance.

* **Batch processing**: If your model supports batching, process multiple inputs together to improve throughput.

## Next steps

* Learn more about [how cached models work](/serverless/endpoints/model-caching).
* [Create a Dockerfile](/serverless/workers/create-dockerfile) to package your handler with its dependencies.
* [Deploy your worker](/serverless/workers/deploy) to a Runpod endpoint.
* Explore [optimization techniques](/serverless/development/optimization) to improve performance.
