CircaOS — A Technical Whitepaper

v0.4 — 2026-05-13 · written for developers · ~15 min read

This document is the dev-honest version of the landing page. It describes what CircaOS is mechanically, what the open determinism bench actually measures, what trade-offs you are accepting if you point your client at cogos.5ceos.com/v1, and what we explicitly do not do. It cites the bench wherever a claim is testable, and it admits limits where limits exist. If you find a claim that doesn't survive a re-run, that's a PR, not a footnote.

Contents
  1. The specific production failures CircaOS fixes
  2. The mechanism: grammar-constrained decoding
  3. What "deterministic" actually means here
  4. Tier routing: why most calls don't need a frontier model
  5. The open bench: methodology and what it locks
  6. The cost model, with numbers
  7. The carbon math
  8. What CircaOS does NOT do
  9. Comparison with the alternatives
  10. What's next
  11. References
Ready to try?

Month-to-month. Cancel any time. No refunds (see Terms §9). The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.

Tier Price Requests / mo Start
Operator Starter$29 / mo100,000 · Tier B
Operator Pro$99 / mo500,000 · A + B
Operator Team$299 / mo2,000,000 · A + B · 99.0% SLA
Compliance$1,500 / mo5,000,000 · A + B · SOC 2 · DPA + BAA
Enterprise$100K / yr50M · dedicated GPU · 99.9% SLAContact →

Or read more first — landing · bench · full pricing detail


1 · The specific production failures CircaOS fixes

Every cloud LLM provider claims their structured-output mode is reliable and their temperature=0 is deterministic. Most of those claims don't survive a re-run. The result is four classes of production incident that engineering teams burn weeks on:

1.1 — Schema-validity drift

You pass a JSON Schema to the provider. The model returns markdown-fenced output. Or extra prose. Or a trailing comma. Your JSON.parse throws. You wrap the call in a retry loop, then a permissive parser, then a regex to strip fences. The retry loop is now ~30% of your latency budget and ~30% of your token spend, and you still have a 0.5–3% silent failure rate in production.

1.2 — Model-snapshot rotation

Your code worked two weeks ago. No one touched it. The provider rotated the model behind the same name (this is documented behaviour for several hosted providers: the model tag stays stable, the underlying weights ship quietly). Your prompt's pattern-matching against the old model's idioms silently degrades. You have no signal that anything changed.

1.3 — Sampling non-determinism even at temperature=0

"Temperature zero is greedy decoding" is mostly true and not sufficient. Hosted providers run batched inference with kernels that admit floating-point non-associativity at the matmul level; minor numerical differences propagate to token-level different selections; same prompt returns different bytes. The official line for at least one major provider is that temperature=0 is best-effort, not contractual.

1.4 — Rate-limit fragility

Your batch job is fine 364 nights a year. Tonight a different team in your org schedules a backfill that shares your account. You're throttled at 3 RPM on the starter tier. Your batch dies at 03:00. Your customers wake up to broken state at 07:00. You learn this is the per-account-not-per-key limit only by reading a forum post the next morning.

CircaOS exists because none of those four failure modes are fundamental to running an LLM in production. They're properties of the path your call is running through, not properties of LLMs.


2 · The mechanism: grammar-constrained decoding

The core idea is older than the recent boom: when a language model generates a token, it produces a probability distribution over the entire vocabulary, and you don't have to sample from the full distribution. You can mask the distribution against a context-free grammar derived from your JSON Schema, zero out every token that would make the partial output non-conforming, renormalize, then sample (or take the argmax at temperature=0).

The implementation matters. Two things land in production:

The net result: the model is physically prevented from emitting a non-conforming token at the decoder level. There is no post-validation retry loop because there is nothing to retry. Schema validity is 1.0000 by construction, not by best-effort.

This is not a CircaOS invention. Grammar-constrained decoding is implemented in llama.cpp, our substrate (0.5+), our substrate, Outlines, and several research stacks. What CircaOS does is operationalize it as a hosted loop — schema compilation, tier routing, provenance, audit, billing — wrapped around the underlying mechanism. The substrate is what's novel; the decoder layer is shoulders we stand on.

3 · What "deterministic" actually means here

We use the word carefully. CircaOS guarantees:

  1. Same input prompt + same schema + same model snapshot + same hardware → byte-identical output. Verifiable. The bench runs 20 identical calls per scenario and reports unique-output count; the production target is 1.
  2. Model snapshots are content-addressed and versioned visibly. When we move from Our model-3B-Instruct to a newer release, that ships as cogos-tier-b-v2, not as a silent swap behind cogos-tier-b. The current weights' SHA is in X-Cogos-Model on every response header.
  3. Sampling parameters are pinned. Temperature 0, top_p 1, top_k 0, seed 42 by default. Override per-call if you want sampling; the bench measures both modes.

We do NOT guarantee:


4 · Tier routing: why most calls don't need a frontier model

This is the cost-and-energy lever. The doctrine is simple: sufficient is sufficient. If a task is well-served by a 3B-parameter model, you should not be running it on a 70B-parameter model. The industry default of "just use GPT-4" (or its successors) treats inference compute as free; it isn't, and the bench measures the gap.

4.1 — Task shapes

CircaOS distinguishes two task shapes:

ShapeTierExamples
Classification-shaped Tier B (3B) Sentiment, routing, intent detection, extraction, scoring, binary/multi-class labels, schema-validation, content moderation, PII detection, language detection
Narrative-shaped Tier A (7B) Summarization, rewriting, multi-step reasoning, agent planning, code generation, structured-but-open-ended responses where the schema bounds form but not content

The router decides via the model alias in the request: model: "cogos-tier-b"Tier B model, model: "cogos-tier-a"Tier A model. There is no auto-classification at the request level; the developer picks the tier, which is intentional — we don't believe a meta-classifier should be making cost decisions for you silently. The default tier-A response header tells you exactly which model served the call.

4.2 — Why this matters

The literature on capability-by-parameter-count is now well-established: classification-shaped tasks saturate at roughly 3B parameters, sometimes lower. Open-weight models in the 3B class (Our model-3B-Instruct, Llama 3.2-3B, Phi-3.5-mini) score within 1–3% of 70B+ models on classification benchmarks while consuming roughly 1/20th the compute per token. The 70B model is sometimes better; it is almost never 20× better.

The internal measurement: across a representative production workload mix (classification 75%, narrative 25%), 75% of calls served by Tier B yields a 78% reduction in inference compute spend and a 72% reduction in energy draw, with semantic-validity scores within 0.7% of the all-Tier-A baseline. The bench publishes the full table by tier and by scenario so the trade-off is something you can audit, not something we ask you to take on faith.


5 · The open bench: methodology and what it locks

The bench at https://github.com/5CEOS-DRA/llm-determinism-bench is MIT-licensed, locked-methodology, and re-runs against the live inference path on a published cadence (currently weekly, GitHub Actions, results committed to results/<date>/ on the default branch).

5.1 — What it measures

  1. Schema-validity rate — fraction of N identical calls where the output parses to JSON and validates against the schema. Strict parser (must be valid JSON, no markdown fencing) and permissive parser (strip fences then parse) are reported separately.
  2. Semantic-validity rate — fraction of schema-valid outputs where hand-coded rubrics confirm the JSON actually answers the scenario. This is the "valid filler" defence: a model can emit {"answer":"yes"} to every question and score 100% on schema validity. Rubrics measure whether priority matches the urgency wording, whether deadline matches the relative time the scenario asked for, etc.
  3. Determinism score — count of unique outputs across N identical-input calls. Target = 1. The bench reports this raw.
  4. Cost-per-valid-output — provider's published per-call cost divided by schema-valid-rate. Surfaces the "cheap but unreliable" failure mode that pure cost benchmarks miss.

5.2 — What's locked

This is the property that makes the receipts credible.

5.3 — What's open

Customer-side acceptance test: clone the bench, set COGOS_LIVE_API_KEY, run python -m harness.loop, compare your CSV to the one in results/<latest-date>/. Any divergence is a publishable finding — either the gateway drifted or your environment differs in a way the bench should record. Drift will show up in the live-path CSV the same week.

6 · The cost model, with numbers

Pricing is per-month and per-request-budget, not per-token. We chose this shape because:

Tier Monthly Requests / mo $ / 1,000 requests Tier access
Operator Starter$29100,000$0.29Tier B
Operator Pro$99500,000$0.20A + B
Operator Team$2992,000,000$0.15A + B
Compliance$1,5005,000,000$0.30A + B + SOC 2 + DPA + BAA
Enterprise$100K / yr50,000,000$0.17A + B + dedicated GPU

For comparison context (current public list prices, mid-2026, indicative not contractual):

If your workload is 100% Tier-A-shaped and you're already getting schema-locked outputs from another provider at competitive cost, CircaOS probably saves you less than the headline number. The bench's $/valid-output column is where you check.

Ready to try?

Month-to-month. Cancel any time. No refunds (see Terms §9). The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.

Tier Price Requests / mo Start
Operator Starter$29 / mo100,000 · Tier B
Operator Pro$99 / mo500,000 · A + B
Operator Team$299 / mo2,000,000 · A + B · 99.0% SLA
Compliance$1,500 / mo5,000,000 · A + B · SOC 2 · DPA + BAA
Enterprise$100K / yr50M · dedicated GPU · 99.9% SLAContact →

Or read more first — landing · bench · full pricing detail


7 · The carbon math

Inference compute consumes energy; energy consumption produces emissions (carbon intensity depends on grid mix). The compute reduction from tier routing translates directly to energy reduction at roughly linear scale, modulo small fixed overheads (request routing, schema compilation, audit logging — all sub-1% in our measurements).

On the same representative production mix (75% classification, 25% narrative), shifting classification from a 70B model to a 3B model and keeping narrative on a 7B model yields a measured ~72% reduction in joules per valid output. The bench captures $/valid-output directly; J/valid-output is available with hardware-level power monitoring (the bench has an opt-in BENCH_MEASURE_POWER=1 flag using nvidia-smi; we publish quarterly results from our own runs).

Honest qualifier: power-savings numbers depend on (a) your workload mix (if you run all Tier A, the savings on power are zero), and (b) the grid carbon intensity at your inference site. We publish J/valid-output; we do not publish a single "CircaOS reduces your carbon footprint by X%" figure, because that figure depends on your specific workload and grid. The bench gives you the joules; multiply by your grid's gCO2eq/kWh for your number.

8 · What CircaOS does NOT do

The substrate is opinionated. Where it stops is part of the contract.


9 · Comparison with the alternatives

OptionDeterminismSchema-lockedAuditEffort
Hosted frontier API Best-effort at temp=0; documented to drift on snapshot rotation Provider-dependent; varies by SDK; permissive parsers common You implement Low to start; high to audit
Self-host our substrate + GBNF grammar Strong if you pin everything yourself Yes, at the decoder You implement High (you operate the GPU, monitor the loop, build the audit, build the bench)
Self-host our substrate + grammar mode Strong Yes You implement High; our substrate ops is non-trivial
CircaOS Pinned, audited, falsifiable via the bench Yes, at the decoder Append-only, hash-chained, header-exposed Drop-in (chat-completions shape)

If you have a serious infra team and the appetite to operate your own GPU inference stack, self-hosting our substrate or our substrate with grammar mode gives you the same primitive at the decoder level. CircaOS exists for teams that want the primitive without operating the substrate, plus the audit trail and the open determinism bench as a structural commitment.


10 · What's next

Public roadmap (subject to revision; we ship what survives the bench):


11 · References

If you read this and something here is wrong, please open an issue on the bench repo or email support@5ceos.com. We treat technical objections as the highest-value feedback we get. The doctrine, again: determinism by construction, not by hope.

Ready to try?

Month-to-month. Cancel any time. No refunds (see Terms §9). The bench is free and runs against our substrate locally if you want to validate the methodology before paying anything.

Tier Price Requests / mo Start
Operator Starter$29 / mo100,000 · Tier B
Operator Pro$99 / mo500,000 · A + B
Operator Team$299 / mo2,000,000 · A + B · 99.0% SLA
Compliance$1,500 / mo5,000,000 · A + B · SOC 2 · DPA + BAA
Enterprise$100K / yr50M · dedicated GPU · 99.9% SLAContact →

Or read more first — landing · bench · full pricing detail