Inference model selection
The router's model: "auto" setting is the right default for ~80% of inference traffic. This page is for the other 20% — when you want to know which model to pin and why.
The available models (v1)
| Model | Open weights? | Best at | Typical J/request* |
|---|---|---|---|
llama-3.3-405b-instruct | Yes | Long-context reasoning, code generation | ~14 |
llama-3.3-70b-instruct | Yes | General-purpose chat, RAG aggregation, summarization | ~3.2 |
deepseek-v3 | Yes | Reasoning chains, math, code | ~6.5 |
qwen2.5-72b | Yes | Multilingual, especially CJK, plus general tasks | ~3.0 |
mixtral-8x22b | Yes | MoE; balanced; fast on quality plateau | ~2.4 |
gpt-oss-120b | Yes | OpenAI's open-weight model; OpenAI-tuned behaviour | ~5.1 |
claude-haiku-4.5 | No (licensed) | Anthropic-tuned tone; long-context; tool use | ~3.8 |
flux-schnell | Yes | Image generation, fast tier | ~110 (per image) |
flux-dev | Yes | Image generation, quality tier | ~280 (per image) |
sdxl-1.0 | Yes | Image generation, classic / fine-tuneable | ~95 (per image) |
whisper-large-v3 | Yes | Speech-to-text | ~0.4 (per audio-second) |
text-embedding-3-small | OAI-compat | Embeddings (1536-dim) | ~0.02 (per chunk) |
text-embedding-3-large | OAI-compat | Embeddings (3072-dim) | ~0.05 (per chunk) |
*Per-request joule numbers assume mid-tier silicon and an average prompt length for that model class. Real numbers always land in the response header.
When to use "auto"
The classifier sniffs your prompt (free, sub-millisecond) and picks the cheapest model that should not regress on quality. "auto" is right when:
- You have mixed traffic: lookups, classifications, summaries, the occasional hard reasoning question.
- You care about the aggregate bill more than per-call latency.
- You're running an internal tool where 10% latency variance is invisible to users.
Across a realistic mixed workload, "auto" typically lands 30-60% cheaper than pinning llama-3.3-70b-instruct for everything.
When to pin a model
Pin a model when:
- You need deterministic behaviour for testing or evals — the same model on every call.
- You're running a customer-facing brand voice — the chosen model's output style is part of your product.
- You're hitting a specialty: Qwen for Chinese, DeepSeek for math chains, FLUX-dev for image quality.
- You're benchmarking and want apples-to-apples.
Pick-by-job rules of thumb
| Your job | Default pin | Why |
|---|---|---|
| RAG aggregation | llama-3.3-70b-instruct | Great context handling at mid-energy |
| Code generation | deepseek-v3 or llama-3.3-405b | Specialised on code reasoning chains |
| Long-context summarization | llama-3.3-405b | Holds 128k+ context without quality slump |
| Multilingual support | qwen2.5-72b | Strongest on CJK + EU langs at this size |
| Classification / tagging | mixtral-8x22b | MoE shape is cheap for short outputs |
| Chat with a "Claude" voice | claude-haiku-4.5 | Pin if your customers expect Claude responses |
| Image gen, fast iteration | flux-schnell | 4 steps; cheap |
| Image gen, final assets | flux-dev | 20+ steps; higher quality |
| Embeddings, RAG default | text-embedding-3-small | 1536-dim is enough for most |
| Embeddings, max recall | text-embedding-3-large | 3072-dim; pay double |
Mixed-model patterns
The strongest cost optimisation is to use cheap models for cheap parts and expensive models for expensive parts within one chain:
# classify the question cheaply
question_class = client.chat.completions.create(
model="mixtral-8x22b",
messages=[{"role":"user","content": f"Classify: {q}. Reply factual/reasoning/other."}],
max_tokens=4,
)
# then route
if "reasoning" in question_class.choices[0].message.content:
expensive = client.chat.completions.create(model="llama-3.3-405b", ...)
else:
cheap = client.chat.completions.create(model="llama-3.3-70b", ...)
This is what the "auto" router does internally; doing it explicitly lets you tune the threshold to your domain.
Quality evals
Before pinning, run a small held-out eval comparing 2-3 candidates on your actual prompts. We don't publish leaderboard scores — your task is the only benchmark that matters. The jc evals CLI runs a parallel batch:
jc evals run --candidates llama-3.3-70b,deepseek-v3,mixtral-8x22b \
--prompts ./my-eval-set.jsonl \
--judge gpt-oss-120b \
--output ./eval-results.csv
Model deprecation policy
Open-weight models on the router get a minimum 6-month deprecation notice before removal. The closed-weight Claude models follow Anthropic's schedule (we mirror their deprecation dates). Subscribe to the changelog for explicit notices.