- Downloading models at runtime (simpler, good for development).
- Using cached models (recommended for production).
Install dependencies
Your handler needs thetransformers library to load Hugging Face models, and torch to run inference. Install both in your development environment:
Load models at runtime
Create a file namedhandler.py and follow these steps to build a handler that performs sentiment analysis using a Hugging Face model.
Import libraries
Start by importing the necessary libraries:The
handler.py
pipeline function from the transformers library provides a simple interface for using pre-trained models. It handles tokenization, model inference, and post-processing automatically.Load the model efficiently
Load your model outside the handler function to avoid reloading it on every request. This significantly improves performance by initializing the model only once when the starts:The
handler.py
pipeline function takes two arguments: the task type (like "sentiment-analysis", "text-generation", or "image-classification") and the specific model identifier from the Hugging Face model hub.Define the handler function
Create a handler function that extracts input text from the request, validates it, runs inference, and returns results:The handler follows Runpod’s standard pattern: extract input, validate it, process it, and return results. The model returns a list of predictions, so we take the first result with
handler.py
[0] and extract the label and confidence score.Complete implementation
Here’s the complete code:handler.py
Test locally
Create a test input file to verify your handler works correctly:test_input.json
Adapt for other models
This pattern works for any Hugging Face model. To use a different model:- Choose your model: Browse the Hugging Face model hub to find a model for your task.
-
Update the pipeline: Change the task type and model identifier:
- Adjust input/output handling: Different models expect different input formats and return different output structures. Check the model’s documentation on Hugging Face to understand its API.
Use cached models
The example above downloads models when workers start, which works fine for development and testing. For production endpoints, we highly recommend using cached models instead. Cached models provide faster cold starts (seconds instead of minutes) and eliminate charges for model download time.Enable model caching
To enable cached models on your endpoint:Open endpoint settings
Navigate to the Serverless section of the console. Either create a new endpoint or select Manage → Edit Endpoint on an existing one.
Configure the model
Scroll to the Model field and enter your Hugging Face model identifier.For example:
distilbert/distilbert-base-uncased-finetuned-sst-2-englishLocate cached models
Cached models are stored at/runpod-volume/huggingface-cache/hub/ following Hugging Face cache conventions. Add this helper function to your handler to resolve the correct snapshot path:
Adapt your handler for cached models
Once model caching is enabled, you need to update your handler to load the model from the local cache instead of downloading it. Here’s how the code changes:- Runtime download
- Cached model
- Offline mode: Setting
HF_HUB_OFFLINEandTRANSFORMERS_OFFLINEprevents accidental downloads if the model isn’t cached. - Local path: Instead of a model identifier, you pass the resolved local path to the cached model files.
- local_files_only: This flag tells the transformers library to only use local files.
Complete cached implementation
Here’s the complete handler using cached models:handler.py
Other best practices
When deploying Hugging Face models to production endpoints, keep these additional considerations in mind:- Model size: Larger models require more VRAM and take longer to load. Choose the smallest model that meets your accuracy requirements.
- GPU utilization: Most Hugging Face models run faster on GPUs. Ensure your endpoint uses GPU workers for optimal performance.
- Batch processing: If your model supports batching, process multiple inputs together to improve throughput.
Next steps
- Learn more about how cached models work.
- Create a Dockerfile to package your handler with its dependencies.
- Deploy your worker to a Runpod endpoint.
- Explore optimization techniques to improve performance.