Skip to main content

LLM Gateway

The Pai LLM Gateway lets developers outside the cluster route their LLM requests through Pai. This gives teams centralized API key management, per-user token budgets, model access control, and audit logging — without distributing API keys to individual machines.

This is the companion to the MCP Gateway: same token type (AccessKey), same pai gateway … CLI pattern, different upstream (LLM provider instead of MCP server).

How it works

Developer laptop                    Pai cluster                    LLM Provider
+---------------+ HTTPS +------------------+ HTTPS +------------+
| Claude Code | -----------> | Pai Gateway | ----------> | Anthropic |
| (any LLM | pak_... | - auth check | real key | OpenAI |
| client) | AccessKey | - token budget | from MP | Gemini |
+---------------+ | - audit log | +------------+
+------------------+
  1. Admin creates a ModelProvider with externalAccess.enabled: true.
  2. Admin (or the developer, via CLI) mints an AccessKey bound to the ModelProvider.
  3. Developer runs pai login + eval $(pai gateway env).
  4. Claude Code / any OpenAI- or Anthropic-compatible client routes through Pai.

The developer's machine never sees the real LLM API key.

Setup

1. Create a ModelProvider with external access

apiVersion: pai.io/v1
kind: ModelProvider
metadata:
name: anthropic
spec:
provider: anthropic
apiKeySecretRef:
name: anthropic-key
key: api-key
externalAccess:
enabled: true
# maxTokensPerDay: 2000000 # optional: separate budget for external usage

Or via CLI:

pai create model-provider anthropic --provider anthropic --api-key sk-ant-...

2. Mint an AccessKey for the developer

pai access-key create --name alice-laptop \
--model-provider anthropic \
--allowed-cidr 10.0.0.0/8 \
--allowed-model claude-haiku-4-5

The CLI prints the raw pak_... once — store it securely (1Password, vault, etc.). AccessKey restrictions can only narrow what the ModelProvider already permits. Rotate with pai access-key rotate alice-laptop.

3. Developer onboarding (3 commands)

# Connect to Pai
pai login https://api.pairun.dev --access-key pak_...

# Configure local environment
eval $(pai gateway env)

# Use Claude Code normally — requests route through Pai
claude

The pai gateway env command outputs:

export ANTHROPIC_BASE_URL=https://api.pairun.dev/ext/v1
export ANTHROPIC_API_KEY=sk-ant-api03-pak-...-AA

For a more detailed setup with available models listed:

pai gateway setup

4. Add to shell profile (optional)

# Add to ~/.zshrc or ~/.bashrc for persistence
pai gateway env >> ~/.zshrc

Gateway endpoints

The external proxy is served under the /ext/v1/ prefix:

EndpointCompatible with
POST /ext/v1/chat/completionsOpenAI SDK, LangChain, CrewAI
POST /ext/v1/messagesAnthropic SDK, Claude Code
POST /ext/v1/messages/count_tokensAnthropic SDK pre-flight token counting (Anthropic providers only; other providers return 501)

Authentication: the AccessKey is wrapped in an Anthropic-compatible API key format (sk-ant-api03-pak-<key>-AA) so Claude Code accepts it without modification. Clients that accept a raw bearer (OpenAI SDK, LangChain) can pass the pak_... directly.

Access control

ControlWhereEffect
spec.externalAccess.enabledModelProviderMust be true for external requests
spec.externalAccess.maxTokensPerDayModelProviderSeparate daily budget across all external usage
spec.allowedModels / deniedModelsModelProviderWhich models this provider exposes
spec.restrictions.allowedModelsAccessKeyPer-key model allowlist (narrows the ModelProvider)
spec.restrictions.allowedCIDRsAccessKeyClient IP allowlist
spec.limits.maxTokensPerDayAccessKeyPer-key daily token cap
Client IP for allowedCIDRs

The gateway trusts X-Forwarded-For only when the direct peer is in the configured gateway.externalProviderGateway.trustedProxies. Otherwise the peer address is used. See the External Provider Gateway guide for details.

Token format

The AccessKey is wrapped to look like a valid Anthropic API key:

pak_a7f3b9c2e1d5  ->  sk-ant-api03-pak-a7f3b9c2e1d5-AA

The gateway strips the wrapper, looks up the AccessKey by hash, validates the per-key restrictions, and then uses the ModelProvider's real API key for the upstream LLM call. The developer never sees the real key.

Audit

All external proxy requests are logged in the gateway audit chain with workload=ext:<accesskey-name>, making it easy to track per-developer usage. Token consumption is visible via pai get metrics and on each key: AccessKey.status.tokensToday.

Cost tracking

The gateway computes per-call USD cost from a built-in price table and exposes it three ways:

  • Response header x-pai-cost-usd on every LLM response.
  • Prometheus pai_cost_usd_total{workload,model,kind} where kind is input | cached_input | output.
  • pai get agents — a COST column showing today's cumulative spend.

Prices are USD per 1M tokens. The table ships with defaults for common Anthropic, OpenAI, and Gemini models; override or extend via the helm chart:

# values.yaml
gateway:
modelPrices:
"claude-sonnet-4-*":
input: 3.00
output: 15.00
cached_input: 0.30
"my-vllm-llama-70b":
input: 0.50
output: 1.00

Keys support a trailing wildcard (claude-sonnet-4-*); exact matches win over wildcards. Overrides merge on top of the built-ins and are hot-reloaded from the pai-model-prices ConfigMap every 60s — no gateway restart needed. When no price is known for a model, tokens still flow and the call is audited; only the $ figure is omitted.

Daily cost cap

Pair cost tracking with spec.rateLimits.maxCostPerDayUSD on an Agent to enforce a hard per-agent spend ceiling (HTTP 429 once the cap is hit).

Reliability: retries + fallbacks

The gateway wraps every upstream call in a retry loop (429, 5xx, connection errors) with exponential backoff + jitter. When retries exhaust on the primary model, the ModelProvider.spec.fallbacks chain takes over. See ModelProvider → Reliability for full details.

Tracing headers

Every LLM response carries x-pai-call-id, x-pai-model-id, x-pai-input-tokens, x-pai-output-tokens, x-pai-duration-ms, x-pai-retries, and — when applicable — x-pai-cost-usd, x-pai-cached-input-tokens, x-pai-fell-back-from. Clients can correlate calls without access to gateway logs.