Skip to main content

Self-hosted models (vLLM)

Run your own open-weight models behind vLLM and plug them into Pai the same way as any hosted provider. Pai treats vLLM as an OpenAI-compatible endpoint — same agent spec, same Gateway, same guards.

Use this when you want to keep inference on-prem, run a model the public providers don't host, or hit a price/latency target only self-hosting can deliver.

Prerequisites

  • A running vLLM deployment with its OpenAI-compatible HTTP server enabled. vLLM exposes /v1/chat/completions, /v1/completions, and /v1/models by default when you run python -m vllm.entrypoints.openai.api_server.
  • A URL the Pai Gateway can reach. For in-cluster vLLM, this is typically a Kubernetes Service URL like http://vllm.vllm-system.svc.cluster.local:8000/v1. For an external host, an HTTPS URL on a reachable network.
  • Optional: an API key if you've enabled vLLM's --api-key flag. Open vLLM servers can skip this.

Setup

Open vLLM (no auth)

apiVersion: pai.io/v1
kind: ModelProvider
metadata:
name: vllm-local
spec:
provider: vllm
endpoint: http://vllm.vllm-system.svc.cluster.local:8000/v1
allowedModels:
- meta-llama/Meta-Llama-3-70B-Instruct
- mistralai/Mistral-7B-Instruct-v0.3

Apply with pai apply -f model-provider.yaml.

Token-gated vLLM

If vLLM is started with --api-key <token>, store the token in a Secret and reference it:

pai add secret vllm-token --from-literal api-key=YOUR_VLLM_TOKEN
apiVersion: pai.io/v1
kind: ModelProvider
metadata:
name: vllm-local
spec:
provider: vllm
endpoint: http://vllm.vllm-system.svc.cluster.local:8000/v1
apiKeySecretRef:
name: vllm-token
key: api-key
allowedModels:
- meta-llama/Meta-Llama-3-70B-Instruct

Verify:

pai get model-providers
# NAME PROVIDER ENDPOINT MAX/DAY LAST USED AGE
# vllm-local vllm http://vllm.vllm-system.svc.cluster.local:8000/v1 — 5s

Supported models

Whatever your vLLM deployment is serving. vLLM accepts any model on the Hugging Face Hub compatible with its supported architectures — see the vLLM supported-models list.

allowedModels is the source of truth: list every model your vLLM server has loaded that agents should be able to use. Pai cannot auto-discover what's available.

Use in an agent

spec:
models:
- vllm-local/meta-llama/Meta-Llama-3-70B-Instruct

The model reference is <modelprovider-name>/<model-id> — exactly the same shape as with hosted providers.

Mix with hosted providers for fallback:

spec:
models:
- vllm-local/meta-llama/Meta-Llama-3-70B-Instruct # primary: self-hosted
- anthropic/claude-haiku-4-5 # fallback: hosted for reliability

Tips

  • Run vLLM on GPU nodes, reach it from Pai pods via a ClusterIP Service. Put vLLM in its own namespace (vllm-system is a common choice) so you can give it GPU nodeSelectors / tolerations without touching other workloads.
  • Pre-pull weights. vLLM streams model weights on first request by default; pre-pulling (or using a PVC cache) avoids a cold-start penalty the first time an agent calls the model.
  • Scale vLLM independently. Agents scale via Pai's autoscaler; vLLM scales via its own replicas. Keep the two decoupled.

Token budgets

Cap how many tokens this self-hosted deployment serves per day across every agent. Less about cost (you own the GPUs) and more about backpressure — keeping a noisy agent from monopolising your inference cluster.

apiVersion: pai.io/v1
kind: ModelProvider
metadata:
name: vllm-local
spec:
provider: vllm
endpoint: http://vllm.vllm-system.svc.cluster.local:8000/v1
maxTokensPerDay: 50000000 # daily cap shared across all agents
maxTokensPerRequest: 32000 # per-request context-window limit
allowedModels:
- meta-llama/Meta-Llama-3-70B-Instruct

When the daily cap is hit, the gateway returns HTTP 429 until midnight UTC. Agents that list another provider in spec.models automatically fall over to it.

Expose via the LLM Gateway

Set externalAccess.enabled: true to let developers outside the cluster — laptops, CI, scripts — route their own LLM traffic to your self-hosted models. Useful for letting the team prototype against in-house GPUs without VPN'ing into the cluster or distributing vLLM tokens.

spec:
externalAccess:
enabled: true
maxTokensPerDay: 5000000 # separate budget for external usage

Once enabled, developers connect with three commands:

pai login https://api.pairun.dev --access-key pak_...
eval $(pai gateway env)
# OpenAI-compatible clients now reach vLLM through Pai

See LLM Gateway for the full onboarding flow, AccessKey management, and per-developer rate limits.

Access control

Narrow which models agents may call on this provider with allowedModels / deniedModels, or attach prompt-injection guards. See Security controls on the Model page for the full field list.