How much are you overpaying for inference?

Compare open-source models on GPU CLI against closed APIs.

Sketch it with our form or import a usage CSV from your current provider.

Sets typical input/output token ratio.

We'll compare against an equivalent open-source model.

5M tokens
100K5M100M

GPU CLI supports LoRA fine-tuning on deployed models. We factor this into the recommendation.

Recommended open-source models for your workload

These shift as you change your use case, current model, and monthly volume. Pick one to deploy — or bring your own configuration.

Small
Qwen3.5 9B

Comparable to Claude Haiku 4.5.

$0.1/$0.3per 1M
Est. monthly
$104
Deploy this model
Medium
Qwen3.5 35B-A3B

Comparable to GPT-5.4 Mini.

$0.3/$1.2per 1M
Est. monthly
$102
Deploy this model
Recommended
Large
Qwen3.5 122B-A10B

Comparable to Claude Sonnet 4.6.

$0.5/$2per 1M
Est. monthly
$104
Deploy this model
Prefer to configure your own?

Pick any open-weight model, GPU class, and schedule in the deploy wizard.

Configure your own

Your data. Your GPUs. No exceptions.

When you deploy through GPU CLI, inference runs on GPU instances you control. Prompts, completions, and fine-tuning data stay on your compute — not routed through us or a third party.

Inference on your GPUs

Every request is handled by a GPU instance your org provisioned — GPU CLI stays out of the data path.

Isolated deployments

Dedicated instances run in an isolated environment — no shared memory, no shared storage, no shared network.

Credentials in the OS keychain

Your GPU provider API key stays in your system keychain. It's never stored in a config file or transmitted to us.

Drop-in replacement for the OpenAI API.

Works with any library that supports OpenAI-compatible endpoints — the Python SDK, LangChain, LlamaIndex, Vercel AI SDK, and more.

python
from openai import OpenAI
# Point at your GPU CLI endpoint — everything else stays the same.client = OpenAI(    api_key="gpu-cli-...",    base_url="https://your-instance.gpu-cli.sh/v1",)
response = client.chat.completions.create(    model="qwen3.5-122b-a10b",    messages=[{"role": "user", "content": "Hello"}],)print(response.choices[0].message.content)

More info in the docs.