CircaOS — A Technical Whitepaper

v0.4 — 2026-05-13 · written for developers · ~15 min read

This document is the dev-honest version of the landing page. It describes what CircaOS is mechanically, what the open determinism bench actually measures, what trade-offs you are accepting if you point your client at cogos.5ceos.com/v1, and what we explicitly do not do. It cites the bench wherever a claim is testable, and it admits limits where limits exist. If you find a claim that doesn't survive a re-run, that's a PR, not a footnote.

Contents

The specific production failures CircaOS fixes
The mechanism: grammar-constrained decoding
What "deterministic" actually means here
Tier routing: why most calls don't need a frontier model
The open bench: methodology and what it locks
The cost model, with numbers
The carbon math
What CircaOS does NOT do
Comparison with the alternatives
What's next
References

Ready to try?

Month-to-month. Cancel any time. No refunds (see Terms §9). The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.

Tier	Price	Requests / mo	Start
Operator Starter	$29 / mo	100,000 · Tier B
Operator Pro	$99 / mo	500,000 · A + B
Operator Team	$299 / mo	2,000,000 · A + B · 99.0% SLA
Compliance	$1,500 / mo	5,000,000 · A + B · SOC 2 · DPA + BAA
Enterprise	$100K / yr	50M · dedicated GPU · 99.9% SLA	Contact →

Or read more first — landing · bench · full pricing detail

1 · The specific production failures CircaOS fixes

Every cloud LLM provider claims their structured-output mode is reliable and their temperature=0 is deterministic. Most of those claims don't survive a re-run. The result is four classes of production incident that engineering teams burn weeks on:

1.1 — Schema-validity drift

You pass a JSON Schema to the provider. The model returns markdown-fenced output. Or extra prose. Or a trailing comma. Your JSON.parse throws. You wrap the call in a retry loop, then a permissive parser, then a regex to strip fences. The retry loop is now ~30% of your latency budget and ~30% of your token spend, and you still have a 0.5–3% silent failure rate in production.

1.2 — Model-snapshot rotation

Your code worked two weeks ago. No one touched it. The provider rotated the model behind the same name (this is documented behaviour for several hosted providers: the model tag stays stable, the underlying weights ship quietly). Your prompt's pattern-matching against the old model's idioms silently degrades. You have no signal that anything changed.

1.3 — Sampling non-determinism even at `temperature=0`

"Temperature zero is greedy decoding" is mostly true and not sufficient. Hosted providers run batched inference with kernels that admit floating-point non-associativity at the matmul level; minor numerical differences propagate to token-level different selections; same prompt returns different bytes. The official line for at least one major provider is that temperature=0 is best-effort, not contractual.

1.4 — Rate-limit fragility

Your batch job is fine 364 nights a year. Tonight a different team in your org schedules a backfill that shares your account. You're throttled at 3 RPM on the starter tier. Your batch dies at 03:00. Your customers wake up to broken state at 07:00. You learn this is the per-account-not-per-key limit only by reading a forum post the next morning.

CircaOS exists because none of those four failure modes are fundamental to running an LLM in production. They're properties of the path your call is running through, not properties of LLMs.

2 · The mechanism: grammar-constrained decoding

The core idea is older than the recent boom: when a language model generates a token, it produces a probability distribution over the entire vocabulary, and you don't have to sample from the full distribution. You can mask the distribution against a context-free grammar derived from your JSON Schema, zero out every token that would make the partial output non-conforming, renormalize, then sample (or take the argmax at temperature=0).

The implementation matters. Two things land in production:

A compiler from JSON Schema (Draft 2020-12) to a grammar representation the inference runtime can consume. CircaOS uses GBNF (used by llama.cpp / our substrate) and is portable to our substrate's grammar format. The compiler handles nested objects, arrays with minItems/maxItems, enums, oneOf/anyOf, $ref resolution, and tuple forms.
A decoder hook in the inference runtime that, at each decoding step, walks the grammar state machine forward, computes the set of vocabulary token IDs that keep the output valid, and applies a bitmask to the logits before argmax / sampling.

The net result: the model is physically prevented from emitting a non-conforming token at the decoder level. There is no post-validation retry loop because there is nothing to retry. Schema validity is 1.0000 by construction, not by best-effort.

This is not a CircaOS invention. Grammar-constrained decoding is implemented in llama.cpp, our substrate (0.5+), our substrate, Outlines, and several research stacks. What CircaOS does is operationalize it as a hosted loop — schema compilation, tier routing, provenance, audit, billing — wrapped around the underlying mechanism. The substrate is what's novel; the decoder layer is shoulders we stand on.

3 · What "deterministic" actually means here

We use the word carefully. CircaOS guarantees:

Same input prompt + same schema + same model snapshot + same hardware → byte-identical output. Verifiable. The bench runs 20 identical calls per scenario and reports unique-output count; the production target is 1.
Model snapshots are content-addressed and versioned visibly. When we move from Our model-3B-Instruct to a newer release, that ships as cogos-tier-b-v2, not as a silent swap behind cogos-tier-b. The current weights' SHA is in X-Cogos-Model on every response header.
Sampling parameters are pinned. Temperature 0, top_p 1, top_k 0, seed 42 by default. Override per-call if you want sampling; the bench measures both modes.

We do NOT guarantee:

Byte-equality across different hardware. Different GPUs have different floating-point rounding behaviour; we can pin the hardware on a single-tenant deployment (Enterprise tier) and otherwise we pin a hardware class (T4 family today). Customers who need bit-perfect cross-machine reproducibility should run the bench against their own dedicated instance.
Semantic correctness. Schema-locked decoding makes the JSON valid. It does not make the JSON right. The model's reasoning quality is the model's reasoning quality. The bench measures semantic validity with hand-coded rubrics precisely to separate "parseable" from "actually answers the question."
Determinism against arbitrary upstreams. If you configure CircaOS to point at someone else's hosted inference endpoint, you inherit their non-determinism. The guarantees hold against CircaOS-operated inference.

4 · Tier routing: why most calls don't need a frontier model

This is the cost-and-energy lever. The doctrine is simple: sufficient is sufficient. If a task is well-served by a 3B-parameter model, you should not be running it on a 70B-parameter model. The industry default of "just use GPT-4" (or its successors) treats inference compute as free; it isn't, and the bench measures the gap.

4.1 — Task shapes

CircaOS distinguishes two task shapes:

Shape	Tier	Examples
Classification-shaped	Tier B (3B)	Sentiment, routing, intent detection, extraction, scoring, binary/multi-class labels, schema-validation, content moderation, PII detection, language detection
Narrative-shaped	Tier A (7B)	Summarization, rewriting, multi-step reasoning, agent planning, code generation, structured-but-open-ended responses where the schema bounds form but not content

The router decides via the model alias in the request: model: "cogos-tier-b" → Tier B model, model: "cogos-tier-a" → Tier A model. There is no auto-classification at the request level; the developer picks the tier, which is intentional — we don't believe a meta-classifier should be making cost decisions for you silently. The default tier-A response header tells you exactly which model served the call.

4.2 — Why this matters

The literature on capability-by-parameter-count is now well-established: classification-shaped tasks saturate at roughly 3B parameters, sometimes lower. Open-weight models in the 3B class (Our model-3B-Instruct, Llama 3.2-3B, Phi-3.5-mini) score within 1–3% of 70B+ models on classification benchmarks while consuming roughly 1/20th the compute per token. The 70B model is sometimes better; it is almost never 20× better.

The internal measurement: across a representative production workload mix (classification 75%, narrative 25%), 75% of calls served by Tier B yields a 78% reduction in inference compute spend and a 72% reduction in energy draw, with semantic-validity scores within 0.7% of the all-Tier-A baseline. The bench publishes the full table by tier and by scenario so the trade-off is something you can audit, not something we ask you to take on faith.

5 · The open bench: methodology and what it locks

The bench at https://github.com/5CEOS-DRA/llm-determinism-bench is MIT-licensed, locked-methodology, and re-runs against the live inference path on a published cadence (currently weekly, GitHub Actions, results committed to results/<date>/ on the default branch).

5.1 — What it measures

Schema-validity rate — fraction of N identical calls where the output parses to JSON and validates against the schema. Strict parser (must be valid JSON, no markdown fencing) and permissive parser (strip fences then parse) are reported separately.
Semantic-validity rate — fraction of schema-valid outputs where hand-coded rubrics confirm the JSON actually answers the scenario. This is the "valid filler" defence: a model can emit {"answer":"yes"} to every question and score 100% on schema validity. Rubrics measure whether priority matches the urgency wording, whether deadline matches the relative time the scenario asked for, etc.
Determinism score — count of unique outputs across N identical-input calls. Target = 1. The bench reports this raw.
Cost-per-valid-output — provider's published per-call cost divided by schema-valid-rate. Surfaces the "cheap but unreliable" failure mode that pure cost benchmarks miss.

5.2 — What's locked

This is the property that makes the receipts credible.

Schemas — three tiers (flat 3-field, nested operator-task-deadline, complex 8-field routing with enums and nested constraints). Source: schemas/tier1.json through tier3.json. Cannot be tweaked per-run.
Scenarios — three per schema tier. Source: prompts/. Cannot be tweaked per-run.
Parsers — strict and permissive, both hand-implemented in parsers/. Cannot be replaced.
Rubrics — hand-coded per scenario in harness/rubrics.py. Specifically not LLM-judged, to defuse the "my LLM scored my LLM" failure mode.
Sample sizes — N1=20, N2=20, N3=10 per scenario. Cannot be reduced to cherry-pick.

5.3 — What's open

Which provider is run (our substrate local, cloud_a, cloud_b, cogos_live).
Which model identifier within the provider.
Trial count (env vars, can be raised but not lowered below the locked floor).
Add new providers via PR — the runner shape is in runners/*.py.

Customer-side acceptance test: clone the bench, set COGOS_LIVE_API_KEY, run python -m harness.loop, compare your CSV to the one in results/<latest-date>/. Any divergence is a publishable finding — either the gateway drifted or your environment differs in a way the bench should record. Drift will show up in the live-path CSV the same week.

6 · The cost model, with numbers

Pricing is per-month and per-request-budget, not per-token. We chose this shape because:

Per-token pricing punishes you for the model's verbosity, which you don't control.
Schema-locked decoding produces dramatically lower output-token counts on average (the model can't pad with prose), so per-token pricing would understate the actual savings.
Predictable per-month spend lets you build the cost into your unit economics without spreadsheet acrobatics.

Tier	Monthly	Requests / mo	$ / 1,000 requests	Tier access
Operator Starter	$29	100,000	$0.29	Tier B
Operator Pro	$99	500,000	$0.20	A + B
Operator Team	$299	2,000,000	$0.15	A + B
Compliance	$1,500	5,000,000	$0.30	A + B + SOC 2 + DPA + BAA
Enterprise	$100K / yr	50,000,000	$0.17	A + B + dedicated GPU

For comparison context (current public list prices, mid-2026, indicative not contractual):

A frontier hosted provider at $2.50 / million input tokens and $10 / million output tokens, averaging 800 input + 200 output per call, is roughly $2.00 / 1,000 requests at list — before retry-loop overhead from schema-validity failures.
Operator Pro at $0.20 / 1,000 requests is ~10× below that list, plus schema-validity is 1.0000 (no retry-loop overhead).

If your workload is 100% Tier-A-shaped and you're already getting schema-locked outputs from another provider at competitive cost, CircaOS probably saves you less than the headline number. The bench's $/valid-output column is where you check.

Ready to try?

Month-to-month. Cancel any time. No refunds (see Terms §9). The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.

Tier	Price	Requests / mo	Start
Operator Starter	$29 / mo	100,000 · Tier B
Operator Pro	$99 / mo	500,000 · A + B
Operator Team	$299 / mo	2,000,000 · A + B · 99.0% SLA
Compliance	$1,500 / mo	5,000,000 · A + B · SOC 2 · DPA + BAA
Enterprise	$100K / yr	50M · dedicated GPU · 99.9% SLA	Contact →

Or read more first — landing · bench · full pricing detail

7 · The carbon math

Inference compute consumes energy; energy consumption produces emissions (carbon intensity depends on grid mix). The compute reduction from tier routing translates directly to energy reduction at roughly linear scale, modulo small fixed overheads (request routing, schema compilation, audit logging — all sub-1% in our measurements).

On the same representative production mix (75% classification, 25% narrative), shifting classification from a 70B model to a 3B model and keeping narrative on a 7B model yields a measured ~72% reduction in joules per valid output. The bench captures $/valid-output directly; J/valid-output is available with hardware-level power monitoring (the bench has an opt-in BENCH_MEASURE_POWER=1 flag using nvidia-smi; we publish quarterly results from our own runs).

Honest qualifier: power-savings numbers depend on (a) your workload mix (if you run all Tier A, the savings on power are zero), and (b) the grid carbon intensity at your inference site. We publish J/valid-output; we do not publish a single "CircaOS reduces your carbon footprint by X%" figure, because that figure depends on your specific workload and grid. The bench gives you the joules; multiply by your grid's gCO2eq/kWh for your number.

8 · What CircaOS does NOT do

The substrate is opinionated. Where it stops is part of the contract.

We do not train models. CircaOS runs open-weight models (the substrate, Llama, Mistral). Training and fine-tuning are out of scope. If you need a fine-tuned model, you can serve it via the same gateway, but we won't fine-tune it for you.
We do not wrap third-party hosted LLMs in production. An OpenAI-compatible upstream adapter exists for operator-owned or BYO-customer endpoints (you point at your own our substrate deployment, a colo'd GPU, etc.). We do not silently relay your calls to any third-party hosted LLM provider behind the substrate. The doctrine is on the landing page: we can't sell against integration tax and be guilty of it.
We do not store your prompts or completions by default. The audit log is metadata only — request ID, model, latency, token counts, schema-enforcement flag, timestamp. Content is opt-in (some compliance customers need it). See Privacy §2.3.
We do not promise bit-equality across hardware classes. See §3. Single-tenant Enterprise deployments can; multi-tenant tiers pin a hardware family.
We do not implement custom routing logic per-tenant in v0. Tier is selected by the request. If you want a meta-classifier deciding tier for you, that's an application-layer concern, not a substrate one — at least for now.
We do not have a free tier. The cheapest plan is $29/mo. The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.
We do not store credit cards or PII directly. Stripe holds the card; we hold a customer ID and a key hash.

9 · Comparison with the alternatives

Option	Determinism	Schema-locked	Audit	Effort
Hosted frontier API	Best-effort at temp=0; documented to drift on snapshot rotation	Provider-dependent; varies by SDK; permissive parsers common	You implement	Low to start; high to audit
Self-host our substrate + GBNF grammar	Strong if you pin everything yourself	Yes, at the decoder	You implement	High (you operate the GPU, monitor the loop, build the audit, build the bench)
Self-host our substrate + grammar mode	Strong	Yes	You implement	High; our substrate ops is non-trivial
CircaOS	Pinned, audited, falsifiable via the bench	Yes, at the decoder	Append-only, hash-chained, header-exposed	Drop-in (chat-completions shape)

If you have a serious infra team and the appetite to operate your own GPU inference stack, self-hosting our substrate or our substrate with grammar mode gives you the same primitive at the decoder level. CircaOS exists for teams that want the primitive without operating the substrate, plus the audit trail and the open determinism bench as a structural commitment.

10 · What's next

Public roadmap (subject to revision; we ship what survives the bench):

Q3 2026: Add Llama 3.3 (3B and 8B) as alternative tier backends; the routing alias stays the same, the underlying weights become customer-selectable.
Q4 2026: our substrate upstream support reaching parity with the our substrate path; enables larger batch sizes for the Operator Team and Compliance tiers.
Q1 2027: Tool-use / function-calling with the same schema-locking guarantee applied to the tool-call arguments.
Continuous: Bench expansion. Every additional provider PR'd in expands the comparative footprint; every scenario that the bench catches drift on becomes a permanent regression test.

11 · References

Bench source & CSV results: https://github.com/5CEOS-DRA/llm-determinism-bench
Our model model family: the substrate.ai
GBNF grammar reference (llama.cpp): llama.cpp grammars
JSON Schema Draft 2020-12 specification: json-schema.org
Sampling non-determinism in batched GPU inference (community write-ups, multiple): search "temperature 0 nondeterminism floating point batched"
CircaOS terms / privacy / acceptable use: terms · privacy · aup

If you read this and something here is wrong, please open an issue on the bench repo or email support@5ceos.com. We treat technical objections as the highest-value feedback we get. The doctrine, again: determinism by construction, not by hope.

Ready to try?

Month-to-month. Cancel any time. No refunds (see Terms §9). The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.

Tier	Price	Requests / mo	Start
Operator Starter	$29 / mo	100,000 · Tier B
Operator Pro	$99 / mo	500,000 · A + B
Operator Team	$299 / mo	2,000,000 · A + B · 99.0% SLA
Compliance	$1,500 / mo	5,000,000 · A + B · SOC 2 · DPA + BAA
Enterprise	$100K / yr	50M · dedicated GPU · 99.9% SLA	Contact →

Or read more first — landing · bench · full pricing detail

CircaOS — A Technical Whitepaper

1 · The specific production failures CircaOS fixes

1.1 — Schema-validity drift

1.2 — Model-snapshot rotation

1.3 — Sampling non-determinism even at temperature=0

1.4 — Rate-limit fragility

2 · The mechanism: grammar-constrained decoding

3 · What "deterministic" actually means here

4 · Tier routing: why most calls don't need a frontier model

4.1 — Task shapes

4.2 — Why this matters

5 · The open bench: methodology and what it locks

5.1 — What it measures

5.2 — What's locked

5.3 — What's open

6 · The cost model, with numbers

7 · The carbon math

8 · What CircaOS does NOT do

9 · Comparison with the alternatives

10 · What's next

11 · References

1.3 — Sampling non-determinism even at `temperature=0`