📄️ Overview
Deploy a highly optimized vLLM Worker as a serverless endpoint, leveraging Hugging Face LLMs and OpenAI's API with ease, featuring ease of use, open compatibility, dynamic batch size, and customization options for a scalable and cost-effective solution.
📄️ Get started
Deploy a Serverless Endpoint for large language models (LLMs) with RunPod, a simple and efficient way to run vLLM Workers with minimal configuration.
📄️ OpenAI compatibility
Discover the vLLM Worker, a cloud-based AI model that integrates with OpenAI's API for seamless interaction. With its streaming and non-streaming capabilities, it's ideal for chatbots, conversational AI, and natural language processing applications.
📄️ Environment variables
Configure your vLLM Worker with environment variables to control model selection, access credentials, and operational parameters for optimal performance. This guide provides a reference for CUDA versions, image tags, and environment variable settings for model-specific configurations.
📄️ Configurable Endpoints
Deploy large language models with ease using RunPod's Configurable Endpoints feature, leveraging vLLM to simplify model loading, hardware configuration, and execution, allowing you to focus on model selection and customization.