How much are you overpaying for inference?

Compare open-source models on GPU CLI against closed APIs.

How should we calculate your usage?

Sketch it with our form or import a usage CSV from your current provider.

What are you building?

Sets typical input/output token ratio.

Current or planned model

We'll compare against an equivalent open-source model.

Monthly token volume

5M tokens

100K5M100M

Data sensitivity

Standard

Shared infrastructure. Ideal for most workloads.

Private / dedicated

Dedicated instance with stricter isolation. +40% base rate.

Need fine-tuning support?

GPU CLI supports LoRA fine-tuning on deployed models. We factor this into the recommendation.

Recommended open-source models for your workload

These shift as you change your use case, current model, and monthly volume. Pick one to deploy — or bring your own configuration.

Small

Qwen3.5 9B

Comparable to Claude Haiku 4.5.

$0.1/$0.3per 1M

Est. monthly

$104

Deploy this model

Medium

Qwen3.5 35B-A3B

Comparable to GPT-5.4 Mini.

$0.3/$1.2per 1M

Est. monthly

$102

Deploy this model

Recommended

Large

Qwen3.5 122B-A10B

Comparable to Claude Sonnet 4.6.

$0.5/$2per 1M

Est. monthly

$104

Deploy this model

Prefer to configure your own?

Pick any open-weight model, GPU class, and schedule in the deploy wizard.

Configure your own

Your data. Your GPUs. No exceptions.

When you deploy through GPU CLI, inference runs on GPU instances you control. Prompts, completions, and fine-tuning data stay on your compute — not routed through us or a third party.

Inference on your GPUs

Every request is handled by a GPU instance your org provisioned — GPU CLI stays out of the data path.

Isolated deployments

Dedicated instances run in an isolated environment — no shared memory, no shared storage, no shared network.

Credentials in the OS keychain

Your GPU provider API key stays in your system keychain. It's never stored in a config file or transmitted to us.

Drop-in replacement for the OpenAI API.

Works with any library that supports OpenAI-compatible endpoints — the Python SDK, LangChain, LlamaIndex, Vercel AI SDK, and more.

python

from openai import OpenAI
# Point at your GPU CLI endpoint — everything else stays the same.client = OpenAI(    api_key="gpu-cli-...",    base_url="https://your-instance.gpu-cli.sh/v1",)
response = client.chat.completions.create(    model="qwen3.5-122b-a10b",    messages=[{"role": "user", "content": "Hello"}],)print(response.choices[0].message.content)

More info in the docs.