Inference model selection

The router's model: "auto" setting is the right default for ~80% of inference traffic. This page is for the other 20% — when you want to know which model to pin and why.

The available models (v1)

ModelOpen weights?Best atTypical J/request*
llama-3.3-405b-instructYesLong-context reasoning, code generation~14
llama-3.3-70b-instructYesGeneral-purpose chat, RAG aggregation, summarization~3.2
deepseek-v3YesReasoning chains, math, code~6.5
qwen2.5-72bYesMultilingual, especially CJK, plus general tasks~3.0
mixtral-8x22bYesMoE; balanced; fast on quality plateau~2.4
gpt-oss-120bYesOpenAI's open-weight model; OpenAI-tuned behaviour~5.1
claude-haiku-4.5No (licensed)Anthropic-tuned tone; long-context; tool use~3.8
flux-schnellYesImage generation, fast tier~110 (per image)
flux-devYesImage generation, quality tier~280 (per image)
sdxl-1.0YesImage generation, classic / fine-tuneable~95 (per image)
whisper-large-v3YesSpeech-to-text~0.4 (per audio-second)
text-embedding-3-smallOAI-compatEmbeddings (1536-dim)~0.02 (per chunk)
text-embedding-3-largeOAI-compatEmbeddings (3072-dim)~0.05 (per chunk)

*Per-request joule numbers assume mid-tier silicon and an average prompt length for that model class. Real numbers always land in the response header.

When to use "auto"

The classifier sniffs your prompt (free, sub-millisecond) and picks the cheapest model that should not regress on quality. "auto" is right when:

Across a realistic mixed workload, "auto" typically lands 30-60% cheaper than pinning llama-3.3-70b-instruct for everything.

When to pin a model

Pin a model when:

Pick-by-job rules of thumb

Your jobDefault pinWhy
RAG aggregationllama-3.3-70b-instructGreat context handling at mid-energy
Code generationdeepseek-v3 or llama-3.3-405bSpecialised on code reasoning chains
Long-context summarizationllama-3.3-405bHolds 128k+ context without quality slump
Multilingual supportqwen2.5-72bStrongest on CJK + EU langs at this size
Classification / taggingmixtral-8x22bMoE shape is cheap for short outputs
Chat with a "Claude" voiceclaude-haiku-4.5Pin if your customers expect Claude responses
Image gen, fast iterationflux-schnell4 steps; cheap
Image gen, final assetsflux-dev20+ steps; higher quality
Embeddings, RAG defaulttext-embedding-3-small1536-dim is enough for most
Embeddings, max recalltext-embedding-3-large3072-dim; pay double

Mixed-model patterns

The strongest cost optimisation is to use cheap models for cheap parts and expensive models for expensive parts within one chain:

# classify the question cheaply
question_class = client.chat.completions.create(
    model="mixtral-8x22b",
    messages=[{"role":"user","content": f"Classify: {q}. Reply factual/reasoning/other."}],
    max_tokens=4,
)

# then route
if "reasoning" in question_class.choices[0].message.content:
    expensive = client.chat.completions.create(model="llama-3.3-405b", ...)
else:
    cheap = client.chat.completions.create(model="llama-3.3-70b", ...)

This is what the "auto" router does internally; doing it explicitly lets you tune the threshold to your domain.

Quality evals

Before pinning, run a small held-out eval comparing 2-3 candidates on your actual prompts. We don't publish leaderboard scores — your task is the only benchmark that matters. The jc evals CLI runs a parallel batch:

jc evals run --candidates llama-3.3-70b,deepseek-v3,mixtral-8x22b \
  --prompts ./my-eval-set.jsonl \
  --judge gpt-oss-120b \
  --output ./eval-results.csv

Model deprecation policy

Open-weight models on the router get a minimum 6-month deprecation notice before removal. The closed-weight Claude models follow Anthropic's schedule (we mirror their deprecation dates). Subscribe to the changelog for explicit notices.