Compare open-source models on GPU CLI against closed APIs.
Sketch it with our form or import a usage CSV from your current provider.
Sets typical input/output token ratio.
We'll compare against an equivalent open-source model.
GPU CLI supports LoRA fine-tuning on deployed models. We factor this into the recommendation.
These shift as you change your use case, current model, and monthly volume. Pick one to deploy — or bring your own configuration.
Comparable to Claude Haiku 4.5.
Comparable to GPT-5.4 Mini.
Comparable to Claude Sonnet 4.6.
Pick any open-weight model, GPU class, and schedule in the deploy wizard.
When you deploy through GPU CLI, inference runs on GPU instances you control. Prompts, completions, and fine-tuning data stay on your compute — not routed through us or a third party.
Every request is handled by a GPU instance your org provisioned — GPU CLI stays out of the data path.
Dedicated instances run in an isolated environment — no shared memory, no shared storage, no shared network.
Your GPU provider API key stays in your system keychain. It's never stored in a config file or transmitted to us.
Works with any library that supports OpenAI-compatible endpoints — the Python SDK, LangChain, LlamaIndex, Vercel AI SDK, and more.
from openai import OpenAI
# Point at your GPU CLI endpoint — everything else stays the same.client = OpenAI( api_key="gpu-cli-...", base_url="https://your-instance.gpu-cli.sh/v1",)
response = client.chat.completions.create( model="qwen3.5-122b-a10b", messages=[{"role": "user", "content": "Hello"}],)print(response.choices[0].message.content)More info in the docs.
