CircaOS — A Technical Whitepaper
This document is the dev-honest version of the landing page. It describes
what CircaOS is mechanically, what the open determinism bench actually
measures, what trade-offs you are accepting if you point your client at
cogos.5ceos.com/v1, and what we explicitly do not do. It
cites the bench wherever a claim is testable, and it admits limits where
limits exist. If you find a claim that doesn't survive a re-run, that's a
PR, not a footnote.
- The specific production failures CircaOS fixes
- The mechanism: grammar-constrained decoding
- What "deterministic" actually means here
- Tier routing: why most calls don't need a frontier model
- The open bench: methodology and what it locks
- The cost model, with numbers
- The carbon math
- What CircaOS does NOT do
- Comparison with the alternatives
- What's next
- References
Month-to-month. Cancel any time. No refunds (see Terms §9). The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.
| Tier | Price | Requests / mo | Start |
|---|---|---|---|
| Operator Starter | $29 / mo | 100,000 · Tier B | |
| Operator Pro | $99 / mo | 500,000 · A + B | |
| Operator Team | $299 / mo | 2,000,000 · A + B · 99.0% SLA | |
| Compliance | $1,500 / mo | 5,000,000 · A + B · SOC 2 · DPA + BAA | |
| Enterprise | $100K / yr | 50M · dedicated GPU · 99.9% SLA | Contact → |
Or read more first — landing · bench · full pricing detail
1 · The specific production failures CircaOS fixes
Every cloud LLM provider claims their structured-output mode is reliable
and their temperature=0 is deterministic. Most of those claims
don't survive a re-run. The result is four classes of production incident
that engineering teams burn weeks on:
1.1 — Schema-validity drift
You pass a JSON Schema to the provider. The model returns markdown-fenced
output. Or extra prose. Or a trailing comma. Your JSON.parse
throws. You wrap the call in a retry loop, then a permissive parser, then
a regex to strip fences. The retry loop is now ~30% of your latency budget
and ~30% of your token spend, and you still have a 0.5–3% silent failure
rate in production.
1.2 — Model-snapshot rotation
Your code worked two weeks ago. No one touched it. The provider rotated the model behind the same name (this is documented behaviour for several hosted providers: the model tag stays stable, the underlying weights ship quietly). Your prompt's pattern-matching against the old model's idioms silently degrades. You have no signal that anything changed.
1.3 — Sampling non-determinism even at temperature=0
"Temperature zero is greedy decoding" is mostly true and not sufficient. Hosted providers run batched inference with kernels that admit floating-point non-associativity at the matmul level; minor numerical differences propagate to token-level different selections; same prompt returns different bytes. The official line for at least one major provider is that temperature=0 is best-effort, not contractual.
1.4 — Rate-limit fragility
Your batch job is fine 364 nights a year. Tonight a different team in your org schedules a backfill that shares your account. You're throttled at 3 RPM on the starter tier. Your batch dies at 03:00. Your customers wake up to broken state at 07:00. You learn this is the per-account-not-per-key limit only by reading a forum post the next morning.
CircaOS exists because none of those four failure modes are fundamental to running an LLM in production. They're properties of the path your call is running through, not properties of LLMs.
2 · The mechanism: grammar-constrained decoding
The core idea is older than the recent boom: when a language model
generates a token, it produces a probability distribution over the entire
vocabulary, and you don't have to sample from the full distribution.
You can mask the distribution against a context-free
grammar derived from your JSON Schema, zero out every token that would
make the partial output non-conforming, renormalize, then sample (or take
the argmax at temperature=0).
The implementation matters. Two things land in production:
- A compiler from JSON Schema (Draft 2020-12) to a grammar representation the inference runtime can consume. CircaOS uses GBNF (used by llama.cpp / our substrate) and is portable to our substrate's grammar format. The compiler handles nested objects, arrays with minItems/maxItems, enums, oneOf/anyOf, $ref resolution, and tuple forms.
- A decoder hook in the inference runtime that, at each decoding step, walks the grammar state machine forward, computes the set of vocabulary token IDs that keep the output valid, and applies a bitmask to the logits before argmax / sampling.
The net result: the model is physically prevented from emitting a non-conforming token at the decoder level. There is no post-validation retry loop because there is nothing to retry. Schema validity is 1.0000 by construction, not by best-effort.
3 · What "deterministic" actually means here
We use the word carefully. CircaOS guarantees:
- Same input prompt + same schema + same model snapshot + same hardware → byte-identical output. Verifiable. The bench runs 20 identical calls per scenario and reports unique-output count; the production target is 1.
- Model snapshots are content-addressed and versioned
visibly. When we move from Our model-3B-Instruct to a
newer release, that ships as
cogos-tier-b-v2, not as a silent swap behindcogos-tier-b. The current weights' SHA is inX-Cogos-Modelon every response header. - Sampling parameters are pinned. Temperature 0, top_p 1, top_k 0, seed 42 by default. Override per-call if you want sampling; the bench measures both modes.
We do NOT guarantee:
- Byte-equality across different hardware. Different GPUs have different floating-point rounding behaviour; we can pin the hardware on a single-tenant deployment (Enterprise tier) and otherwise we pin a hardware class (T4 family today). Customers who need bit-perfect cross-machine reproducibility should run the bench against their own dedicated instance.
- Semantic correctness. Schema-locked decoding makes the JSON valid. It does not make the JSON right. The model's reasoning quality is the model's reasoning quality. The bench measures semantic validity with hand-coded rubrics precisely to separate "parseable" from "actually answers the question."
- Determinism against arbitrary upstreams. If you configure CircaOS to point at someone else's hosted inference endpoint, you inherit their non-determinism. The guarantees hold against CircaOS-operated inference.
4 · Tier routing: why most calls don't need a frontier model
This is the cost-and-energy lever. The doctrine is simple: sufficient is sufficient. If a task is well-served by a 3B-parameter model, you should not be running it on a 70B-parameter model. The industry default of "just use GPT-4" (or its successors) treats inference compute as free; it isn't, and the bench measures the gap.
4.1 — Task shapes
CircaOS distinguishes two task shapes:
| Shape | Tier | Examples |
|---|---|---|
| Classification-shaped | Tier B (3B) | Sentiment, routing, intent detection, extraction, scoring, binary/multi-class labels, schema-validation, content moderation, PII detection, language detection |
| Narrative-shaped | Tier A (7B) | Summarization, rewriting, multi-step reasoning, agent planning, code generation, structured-but-open-ended responses where the schema bounds form but not content |
The router decides via the model alias in the request:
model: "cogos-tier-b" →
Tier B model,
model: "cogos-tier-a" →
Tier A model.
There is no auto-classification at the request level; the developer
picks the tier, which is intentional — we don't believe a meta-classifier
should be making cost decisions for you silently. The default tier-A response
header tells you exactly which model served the call.
4.2 — Why this matters
The literature on capability-by-parameter-count is now well-established: classification-shaped tasks saturate at roughly 3B parameters, sometimes lower. Open-weight models in the 3B class (Our model-3B-Instruct, Llama 3.2-3B, Phi-3.5-mini) score within 1–3% of 70B+ models on classification benchmarks while consuming roughly 1/20th the compute per token. The 70B model is sometimes better; it is almost never 20× better.
The internal measurement: across a representative production workload mix (classification 75%, narrative 25%), 75% of calls served by Tier B yields a 78% reduction in inference compute spend and a 72% reduction in energy draw, with semantic-validity scores within 0.7% of the all-Tier-A baseline. The bench publishes the full table by tier and by scenario so the trade-off is something you can audit, not something we ask you to take on faith.
5 · The open bench: methodology and what it locks
The bench at https://github.com/5CEOS-DRA/llm-determinism-bench is MIT-licensed,
locked-methodology, and re-runs against the live inference path on a
published cadence (currently weekly, GitHub Actions, results committed to
results/<date>/ on the default branch).
5.1 — What it measures
- Schema-validity rate — fraction of N identical calls where the output parses to JSON and validates against the schema. Strict parser (must be valid JSON, no markdown fencing) and permissive parser (strip fences then parse) are reported separately.
- Semantic-validity rate — fraction of schema-valid
outputs where hand-coded rubrics confirm the JSON actually
answers the scenario. This is the "valid filler"
defence: a model can emit
{"answer":"yes"}to every question and score 100% on schema validity. Rubrics measure whetherprioritymatches the urgency wording, whetherdeadlinematches the relative time the scenario asked for, etc. - Determinism score — count of unique outputs across N identical-input calls. Target = 1. The bench reports this raw.
- Cost-per-valid-output — provider's published per-call cost divided by schema-valid-rate. Surfaces the "cheap but unreliable" failure mode that pure cost benchmarks miss.
5.2 — What's locked
This is the property that makes the receipts credible.
- Schemas — three tiers (flat 3-field, nested
operator-task-deadline, complex 8-field routing with enums and nested
constraints). Source:
schemas/tier1.jsonthroughtier3.json. Cannot be tweaked per-run. - Scenarios — three per schema tier. Source:
prompts/. Cannot be tweaked per-run. - Parsers — strict and permissive, both
hand-implemented in
parsers/. Cannot be replaced. - Rubrics — hand-coded per scenario in
harness/rubrics.py. Specifically not LLM-judged, to defuse the "my LLM scored my LLM" failure mode. - Sample sizes — N1=20, N2=20, N3=10 per scenario. Cannot be reduced to cherry-pick.
5.3 — What's open
- Which provider is run (our substrate local, cloud_a, cloud_b, cogos_live).
- Which model identifier within the provider.
- Trial count (env vars, can be raised but not lowered below the locked floor).
- Add new providers via PR — the runner shape is in
runners/*.py.
COGOS_LIVE_API_KEY, run python -m harness.loop,
compare your CSV to the one in results/<latest-date>/.
Any divergence is a publishable finding — either the gateway drifted or
your environment differs in a way the bench should record. Drift will
show up in the live-path CSV the same week.
6 · The cost model, with numbers
Pricing is per-month and per-request-budget, not per-token. We chose this shape because:
- Per-token pricing punishes you for the model's verbosity, which you don't control.
- Schema-locked decoding produces dramatically lower output-token counts on average (the model can't pad with prose), so per-token pricing would understate the actual savings.
- Predictable per-month spend lets you build the cost into your unit economics without spreadsheet acrobatics.
| Tier | Monthly | Requests / mo | $ / 1,000 requests | Tier access |
|---|---|---|---|---|
| Operator Starter | $29 | 100,000 | $0.29 | Tier B |
| Operator Pro | $99 | 500,000 | $0.20 | A + B |
| Operator Team | $299 | 2,000,000 | $0.15 | A + B |
| Compliance | $1,500 | 5,000,000 | $0.30 | A + B + SOC 2 + DPA + BAA |
| Enterprise | $100K / yr | 50,000,000 | $0.17 | A + B + dedicated GPU |
For comparison context (current public list prices, mid-2026, indicative not contractual):
- A frontier hosted provider at $2.50 / million input tokens and $10 / million output tokens, averaging 800 input + 200 output per call, is roughly $2.00 / 1,000 requests at list — before retry-loop overhead from schema-validity failures.
- Operator Pro at $0.20 / 1,000 requests is ~10× below that list, plus schema-validity is 1.0000 (no retry-loop overhead).
If your workload is 100% Tier-A-shaped and you're already getting
schema-locked outputs from another provider at competitive cost, CircaOS
probably saves you less than the headline number. The bench's
$/valid-output column is where you check.
Month-to-month. Cancel any time. No refunds (see Terms §9). The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.
| Tier | Price | Requests / mo | Start |
|---|---|---|---|
| Operator Starter | $29 / mo | 100,000 · Tier B | |
| Operator Pro | $99 / mo | 500,000 · A + B | |
| Operator Team | $299 / mo | 2,000,000 · A + B · 99.0% SLA | |
| Compliance | $1,500 / mo | 5,000,000 · A + B · SOC 2 · DPA + BAA | |
| Enterprise | $100K / yr | 50M · dedicated GPU · 99.9% SLA | Contact → |
Or read more first — landing · bench · full pricing detail
7 · The carbon math
Inference compute consumes energy; energy consumption produces emissions (carbon intensity depends on grid mix). The compute reduction from tier routing translates directly to energy reduction at roughly linear scale, modulo small fixed overheads (request routing, schema compilation, audit logging — all sub-1% in our measurements).
On the same representative production mix (75% classification, 25%
narrative), shifting classification from a 70B model to a 3B model and
keeping narrative on a 7B model yields a measured ~72% reduction
in joules per valid output. The bench captures
$/valid-output directly; J/valid-output is
available with hardware-level power monitoring (the bench has an
opt-in BENCH_MEASURE_POWER=1 flag using
nvidia-smi; we publish quarterly results from our own runs).
gCO2eq/kWh for your number.
8 · What CircaOS does NOT do
The substrate is opinionated. Where it stops is part of the contract.
- We do not train models. CircaOS runs open-weight models (the substrate, Llama, Mistral). Training and fine-tuning are out of scope. If you need a fine-tuned model, you can serve it via the same gateway, but we won't fine-tune it for you.
- We do not wrap third-party hosted LLMs in production. An OpenAI-compatible upstream adapter exists for operator-owned or BYO-customer endpoints (you point at your own our substrate deployment, a colo'd GPU, etc.). We do not silently relay your calls to any third-party hosted LLM provider behind the substrate. The doctrine is on the landing page: we can't sell against integration tax and be guilty of it.
- We do not store your prompts or completions by default. The audit log is metadata only — request ID, model, latency, token counts, schema-enforcement flag, timestamp. Content is opt-in (some compliance customers need it). See Privacy §2.3.
- We do not promise bit-equality across hardware classes. See §3. Single-tenant Enterprise deployments can; multi-tenant tiers pin a hardware family.
- We do not implement custom routing logic per-tenant in v0. Tier is selected by the request. If you want a meta-classifier deciding tier for you, that's an application-layer concern, not a substrate one — at least for now.
- We do not have a free tier. The cheapest plan is $29/mo. The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.
- We do not store credit cards or PII directly. Stripe holds the card; we hold a customer ID and a key hash.
9 · Comparison with the alternatives
| Option | Determinism | Schema-locked | Audit | Effort |
|---|---|---|---|---|
| Hosted frontier API | Best-effort at temp=0; documented to drift on snapshot rotation | Provider-dependent; varies by SDK; permissive parsers common | You implement | Low to start; high to audit |
| Self-host our substrate + GBNF grammar | Strong if you pin everything yourself | Yes, at the decoder | You implement | High (you operate the GPU, monitor the loop, build the audit, build the bench) |
| Self-host our substrate + grammar mode | Strong | Yes | You implement | High; our substrate ops is non-trivial |
| CircaOS | Pinned, audited, falsifiable via the bench | Yes, at the decoder | Append-only, hash-chained, header-exposed | Drop-in (chat-completions shape) |
If you have a serious infra team and the appetite to operate your own GPU inference stack, self-hosting our substrate or our substrate with grammar mode gives you the same primitive at the decoder level. CircaOS exists for teams that want the primitive without operating the substrate, plus the audit trail and the open determinism bench as a structural commitment.
10 · What's next
Public roadmap (subject to revision; we ship what survives the bench):
- Q3 2026: Add Llama 3.3 (3B and 8B) as alternative tier backends; the routing alias stays the same, the underlying weights become customer-selectable.
- Q4 2026: our substrate upstream support reaching parity with the our substrate path; enables larger batch sizes for the Operator Team and Compliance tiers.
- Q1 2027: Tool-use / function-calling with the same schema-locking guarantee applied to the tool-call arguments.
- Continuous: Bench expansion. Every additional provider PR'd in expands the comparative footprint; every scenario that the bench catches drift on becomes a permanent regression test.
11 · References
- Bench source & CSV results: https://github.com/5CEOS-DRA/llm-determinism-bench
- Our model model family: the substrate.ai
- GBNF grammar reference (llama.cpp): llama.cpp grammars
- JSON Schema Draft 2020-12 specification: json-schema.org
- Sampling non-determinism in batched GPU inference (community write-ups, multiple): search "temperature 0 nondeterminism floating point batched"
- CircaOS terms / privacy / acceptable use: terms · privacy · aup
If you read this and something here is wrong, please open an issue on the bench repo or email support@5ceos.com. We treat technical objections as the highest-value feedback we get. The doctrine, again: determinism by construction, not by hope.
Month-to-month. Cancel any time. No refunds (see Terms §9). The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.
| Tier | Price | Requests / mo | Start |
|---|---|---|---|
| Operator Starter | $29 / mo | 100,000 · Tier B | |
| Operator Pro | $99 / mo | 500,000 · A + B | |
| Operator Team | $299 / mo | 2,000,000 · A + B · 99.0% SLA | |
| Compliance | $1,500 / mo | 5,000,000 · A + B · SOC 2 · DPA + BAA | |
| Enterprise | $100K / yr | 50M · dedicated GPU · 99.9% SLA | Contact → |
Or read more first — landing · bench · full pricing detail