> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Cached models

> Accelerate worker cold starts and reduce costs by using cached models.

export const InferenceTooltip = () => {
  return <Tooltip headline="AI inference" tip="The execution phase where a trained model makes predictions on new data. When you prompt a model and it responds, that's inference.">inference</Tooltip>;
};

export const HandlerFunctionTooltip = () => {
  return <Tooltip headline="Handler function" tip="The core of a Runpod Serverless application. These functions define how a worker processes incoming requests and returns results." cta="Learn more about handler functions" href="/serverless/workers/handler-functions">handler function</Tooltip>;
};

export const WorkersTooltip = () => {
  return <Tooltip headline="Worker" tip="A container that runs your application code and processes requests to your Serverless endpoint. Workers are automatically started and stopped by Runpod to handle traffic spikes and ensure optimal resource utilization." cta="Learn more about workers" href="/serverless/workers/overview">workers</Tooltip>;
};

export const ColdStartTooltip = () => {
  return <Tooltip headline="Cold start" tip="The time between when an endpoint with no running workers receives a request, and when a worker is fully warmed up and ready to handle the request." cta="Learn more about cold starts" href="/serverless/overview#cold-starts">cold start</Tooltip>;
};

export const MachinesTooltip = () => {
  return <Tooltip headline="Machine" tip="The physical server hardware within a data center that hosts your compute resources.">machines</Tooltip>;
};

export const MachineTooltip = () => {
  return <Tooltip headline="Machine" tip="The physical server hardware within a data center that hosts your compute resources.">machine</Tooltip>;
};

<Tip>
  To learn how to use cached models with the Hugging Face Transformers library, see [Use Hugging Face models](/serverless/development/huggingface-models#use-cached-models). For a complete end-to-end deployment walkthrough, see the [cached model tutorial](/tutorials/serverless/model-caching-text).
</Tip>

Enabling cached models on your endpoints can reduce <ColdStartTooltip /> times and dramatically reduce the cost for loading large models.

## Why use cached models?

* **Faster cold starts:** Using cached models can reduce <ColdStartTooltip /> times to just a few seconds, even for large models.
* **Reduced costs:** You aren't billed for worker time while your model is being downloaded. This is especially impactful for large models that can take several minutes to load.
* **Accelerated deployment:** You can deploy cached models instantly without waiting for external downloads or transfers.
* **Smaller container images:** By decoupling models from your container image, you can create smaller, more focused images that contain only your application logic.
* **Shared across workers:** Multiple <WorkersTooltip /> running on the same host <MachineTooltip /> can reference the same cached model, eliminating redundant downloads and saving disk space.

## Cached model compatibility

Cached models work with any model hosted on Hugging Face, including:

* **Public models:** Any publicly available model on Hugging Face.
* **Gated models:** Models that require you to accept terms (provide a Hugging Face access token).
* **Private models:** Private models your Hugging Face token has access to.

<Tip>
  Cached models aren't suitable if your model is private and not hosted on Hugging Face. In that case, [bake it into your Docker image](/serverless/workers/deploy#including-models-and-external-files) instead.
</Tip>

## How it works

When you select a cached model for your endpoint, Runpod automatically tries to start your workers on hosts that already contain the selected model.

If no cached host <MachinesTooltip /> are available, the system delays starting your workers until the model is downloaded onto the machine where your workers will run, ensuring you still won't be charged for the download time.

<div style={{ marginLeft: '4rem'}}>
  ```mermaid theme={"theme":{"light":"github-light","dark":"github-dark"}}
  %%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'15px','fontFamily':'font-inter'}}}%%

  flowchart TD
      Start([Request received]) --> CheckWorkers{Worker<br/>ready?}
      
      CheckWorkers -->|"&nbsp;&nbsp;Yes&nbsp;&nbsp;"| Process[Process request]
      
      CheckWorkers -->|"&nbsp;&nbsp;No&nbsp;&nbsp;"| CheckCache{Cached model<br/>host available?}
      
      CheckCache -->|"&nbsp;&nbsp;Yes&nbsp;&nbsp;"| FastStart[Start worker on<br/>cached host]
      FastStart --> Ready1[Worker ready<br/>in seconds]
      Ready1 --> Process
      
      CheckCache -->|"&nbsp;&nbsp;No&nbsp;&nbsp;"| WaitForCache[Wait for model download<br/>on target host]
      WaitForCache --> Ready2[Worker ready<br/>after download]
      Ready2 --> Process
      
      Process --> Response([Return response])

      style Start fill:#5F4CFE,stroke:#5F4CFE,color:#FFFFFF,stroke-width:2px
      style Response fill:#5F4CFE,stroke:#5F4CFE,color:#FFFFFF,stroke-width:2px
      
      style CheckWorkers fill:#f87171,stroke:#f87171,color:#000000,stroke-width:2px
      style CheckCache fill:#fb923c,stroke:#fb923c,color:#000000,stroke-width:2px
      
      style Process fill:#22C55E,stroke:#22C55E,color:#000000,stroke-width:2px
      
      style FastStart fill:#22C55E,stroke:#22C55E,color:#000000,stroke-width:2px
      style Ready1 fill:#22C55E,stroke:#22C55E,color:#000000,stroke-width:2px
      
      style WaitForCache fill:#ecc94b,stroke:#ecc94b,color:#000000,stroke-width:2px
      style Ready2 fill:#ecc94b,stroke:#ecc94b,color:#000000,stroke-width:2px

      linkStyle default stroke-width:2px,stroke:#5F4CFE
  ```
</div>

## Enable cached models

Follow these steps to select and add a cached model to your endpoint:

<Steps>
  <Step title="Create a new endpoint">
    Navigate to the [Serverless section](https://www.console.runpod.io/serverless) of the console and click **New Endpoint**.
  </Step>

  <Step title="Configure the model">
    In the **Endpoint Configuration** step, scroll down to **Model** and add the link or path for the model you want to use.

    For example, `Qwen/qwen3-32b-awq`.

    <Frame alt="Cached model setting">
      <img src="https://mintcdn.com/runpod-b18f5ded/Yxk8joMX7rldAU9k/images/model-cache-setting.png?fit=max&auto=format&n=Yxk8joMX7rldAU9k&q=85&s=3ac8824c9c1e598b5aedc39333e3c8a4" width="2586" height="1830" data-path="images/model-cache-setting.png" />
    </Frame>
  </Step>

  <Step title="Add an access token (if needed)">
    If you're using a gated model, you'll need to enter a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens).
  </Step>

  <Step title="Deploy the endpoint">
    Complete your endpoint configuration and click **Deploy Endpoint** .
  </Step>
</Steps>

You can add a cached model to an existing endpoint by selecting **Manage → Edit Endpoint** in the endpoint details page and updating the **Model** field.

Once it's deployed, your workers will all have access to the cached model for <InferenceTooltip />.

## Using cached models in your workers

When using [vLLM workers](/serverless/vllm/overview) or other official Runpod worker images, you can usually just set the **Model** field as shown above (or use the `MODEL_NAME` environment variable), and your workers will automatically use the cached model for inference.

To use cached models with [custom workers](/serverless/workers/custom-worker), you'll need to manually locate the cached model path and integrate it into your worker code.

### Where cached models are stored

Cached models are available to your workers at `/runpod-volume/huggingface-cache/hub/` following Hugging Face cache conventions. The directory structure replaces forward slashes (`/`) from the original model name with double dashes (`--`), and includes a version hash subdirectory.

<Note>
  While cached models use the same mount path as network volumes (`/runpod-volume/`), the model loaded from the cache will load significantly faster than the same model loaded from a network volume.
</Note>

For example, here is how the model `gensyn/qwen2.5-0.5b-instruct` would be stored:

<Tree>
  <Tree.Folder name="runpod-volume" defaultOpen>
    <Tree.Folder name="huggingface-cache" defaultOpen>
      <Tree.Folder name="hub" defaultOpen>
        <Tree.Folder name="models--gensyn--qwen2.5-0.5b-instruct" defaultOpen>
          <Tree.Folder name="refs" defaultOpen>
            <Tree.File name="main" comment="Contains the commit hash of the 'main' branch" />
          </Tree.Folder>

          <Tree.Folder name="snapshots" defaultOpen>
            <Tree.Folder name="abcdef1234567890..." comment="Actual model files, named by commit hash" />
          </Tree.Folder>
        </Tree.Folder>
      </Tree.Folder>
    </Tree.Folder>
  </Tree.Folder>
</Tree>

### Locate cached models in your handler

To use a cached model in your <HandlerFunctionTooltip />, you need to resolve the local path to the model files. The path follows a predictable pattern based on the model identifier:

```
/runpod-volume/huggingface-cache/hub/models--{org}--{name}/snapshots/{hash}/
```

For example, `Qwen/Qwen2.5-0.5B-Instruct` would be stored at:

```
/runpod-volume/huggingface-cache/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/abc123.../
```

For complete implementation details including a helper function to resolve these paths dynamically, see [Use Hugging Face models](/serverless/development/huggingface-models#use-cached-models).

### Examples and resources

<CardGroup cols={2}>
  <Card title="Hugging Face integration" icon="face-smile" href="/serverless/development/huggingface-models#use-cached-models">
    Learn how to adapt your Transformers code to use cached models
  </Card>

  <Card title="Cached model tutorial" icon="graduation-cap" href="/tutorials/serverless/model-caching-text">
    End-to-end walkthrough deploying Phi-3 with model caching
  </Card>

  <Card title="Example repository" icon="github" href="https://github.com/runpod-workers/model-store-cache-example">
    Sample worker using cached models for LLM inference
  </Card>

  <Card title="vLLM workers" icon="bolt" href="/serverless/vllm/overview">
    Pre-built workers with automatic cached model support
  </Card>
</CardGroup>

## Current limitations

* Each endpoint is currently limited to one cached model at a time.
* If a Hugging Face repository contains multiple quantization versions of a model (for example, 4-bit AWQ and 8-bit GPTQ versions), the system currently downloads all quantization versions. The ability to select specific quantizations will be available in a future update.
