Inference model selection

The router's model: "auto" setting is the right default for ~80% of inference traffic. This page is for the other 20% — when you want to know which model to pin and why.

The available models (v1)

Model	Open weights?	Best at	Typical J/request*
`llama-3.3-405b-instruct`	Yes	Long-context reasoning, code generation	~14
`llama-3.3-70b-instruct`	Yes	General-purpose chat, RAG aggregation, summarization	~3.2
`deepseek-v3`	Yes	Reasoning chains, math, code	~6.5
`qwen2.5-72b`	Yes	Multilingual, especially CJK, plus general tasks	~3.0
`mixtral-8x22b`	Yes	MoE; balanced; fast on quality plateau	~2.4
`gpt-oss-120b`	Yes	OpenAI's open-weight model; OpenAI-tuned behaviour	~5.1
`claude-haiku-4.5`	No (licensed)	Anthropic-tuned tone; long-context; tool use	~3.8
`flux-schnell`	Yes	Image generation, fast tier	~110 (per image)
`flux-dev`	Yes	Image generation, quality tier	~280 (per image)
`sdxl-1.0`	Yes	Image generation, classic / fine-tuneable	~95 (per image)
`whisper-large-v3`	Yes	Speech-to-text	~0.4 (per audio-second)
`text-embedding-3-small`	OAI-compat	Embeddings (1536-dim)	~0.02 (per chunk)
`text-embedding-3-large`	OAI-compat	Embeddings (3072-dim)	~0.05 (per chunk)

*Per-request joule numbers assume mid-tier silicon and an average prompt length for that model class. Real numbers always land in the response header.

When to use `"auto"`

The classifier sniffs your prompt (free, sub-millisecond) and picks the cheapest model that should not regress on quality. "auto" is right when:

You have mixed traffic: lookups, classifications, summaries, the occasional hard reasoning question.
You care about the aggregate bill more than per-call latency.
You're running an internal tool where 10% latency variance is invisible to users.

Across a realistic mixed workload, "auto" typically lands 30-60% cheaper than pinning llama-3.3-70b-instruct for everything.

When to pin a model

Pin a model when:

You need deterministic behaviour for testing or evals — the same model on every call.
You're running a customer-facing brand voice — the chosen model's output style is part of your product.
You're hitting a specialty: Qwen for Chinese, DeepSeek for math chains, FLUX-dev for image quality.
You're benchmarking and want apples-to-apples.

Pick-by-job rules of thumb

Your job	Default pin	Why
RAG aggregation	`llama-3.3-70b-instruct`	Great context handling at mid-energy
Code generation	`deepseek-v3` or `llama-3.3-405b`	Specialised on code reasoning chains
Long-context summarization	`llama-3.3-405b`	Holds 128k+ context without quality slump
Multilingual support	`qwen2.5-72b`	Strongest on CJK + EU langs at this size
Classification / tagging	`mixtral-8x22b`	MoE shape is cheap for short outputs
Chat with a "Claude" voice	`claude-haiku-4.5`	Pin if your customers expect Claude responses
Image gen, fast iteration	`flux-schnell`	4 steps; cheap
Image gen, final assets	`flux-dev`	20+ steps; higher quality
Embeddings, RAG default	`text-embedding-3-small`	1536-dim is enough for most
Embeddings, max recall	`text-embedding-3-large`	3072-dim; pay double

Mixed-model patterns

The strongest cost optimisation is to use cheap models for cheap parts and expensive models for expensive parts within one chain:

# classify the question cheaply
question_class = client.chat.completions.create(
    model="mixtral-8x22b",
    messages=[{"role":"user","content": f"Classify: {q}. Reply factual/reasoning/other."}],
    max_tokens=4,
)

# then route
if "reasoning" in question_class.choices[0].message.content:
    expensive = client.chat.completions.create(model="llama-3.3-405b", ...)
else:
    cheap = client.chat.completions.create(model="llama-3.3-70b", ...)

This is what the "auto" router does internally; doing it explicitly lets you tune the threshold to your domain.

Quality evals

Before pinning, run a small held-out eval comparing 2-3 candidates on your actual prompts. We don't publish leaderboard scores — your task is the only benchmark that matters. The jc evals CLI runs a parallel batch:

jc evals run --candidates llama-3.3-70b,deepseek-v3,mixtral-8x22b \
  --prompts ./my-eval-set.jsonl \
  --judge gpt-oss-120b \
  --output ./eval-results.csv

Model deprecation policy

Open-weight models on the router get a minimum 6-month deprecation notice before removal. The closed-weight Claude models follow Anthropic's schedule (we mirror their deprecation dates). Subscribe to the changelog for explicit notices.