Learn what to expect during planned and unplanned maintenance events, and how to keep your data safe.Runpod operates on shared infrastructure. Like any cloud platform, maintenance and unexpected outages can occur. This page explains how these situations are handled and what you can do to protect your work.
Planned maintenance
When scheduled maintenance is required on a machine hosting your pod, Runpod notifies you in advance. Notifications are sent via email before the maintenance window begins so you have time to save your work, back up data, or migrate to another pod. During a maintenance window, Runpod does not charge you for the time your pod is unavailable. If you cannot wait for maintenance to complete, you can deploy another resource in the meantime. If you have questions about a maintenance window or believe your pod was impacted, contact Runpod Support.Unplanned outages
Hardware failures and sudden crashes can happen without warning. In these cases:- Runpod may only be able to notify you after the outage has begun, not before.
- You will be notified as soon as the issue is identified.
Data safety
Pods use temporary container storage by default. If your pod is interrupted, restarted, stopped, or terminated, any data that is only stored on container storage will be lost. To protect your work, always store important data on a network volume or an external backup.Use a network volume
Attach a network volume to your pod to persist data across restarts and pod deletions. This is the most reliable way to ensure your data survives unexpected outages.Set up checkpointing
For long-running jobs, implement checkpointing to save progress periodically (every hour to every few hours depending on job length). This limits the amount of work lost if a pod restarts unexpectedly. Most machine learning frameworks include built-in checkpointing support. See your framework’s documentation to get started:- PyTorch: Saving and Loading Models.
- Hugging Face Transformers: Checkpointing with Trainer.
- Hugging Face Accelerate: Checkpointing guide.
- PyTorch Lightning: Checkpointing.
Maintain backups
The industry standard for data protection is the 3-2-1 rule:- 3 copies of your data
- 2 different storage types (for example, a network volume and an external object store)
- 1 copy stored offsite or in a separate location
Network volumes
Set up persistent, portable storage that survives pod restarts and deletions.
Storage options
Compare container disk, volume disk, and network volume storage types.