CircaOS

The substrate behind 5CEOs. Available standalone.

Reproducible LLM calls. No retry loops. No model drift. Schema-locked decoding means the model physically can't emit malformed JSON. Same call → same bytes out. Same call next month → same bytes out. Same call under load → no rate limit, no throttle. 78% less inference spend, 72% less carbon on a typical production mix — because most of your calls don't need a frontier model. Stop debugging the LLM. Ship the feature.

🟢 Live now: this gateway is serving real traffic. Hit /health for the heartbeat. Every claim below is verifiable in the public bench — open-source, MIT, run it yourself with any provider's credentials.

🔒 Verifiable security, not vendor-attested. Every /v1 response carries an HMAC tamper signature and an Ed25519 attestation receipt binding the request, the response, the running build, and our audit chain head. Customer authentication can be Ed25519 keypair-based — you hold the private key, we hold only the public. Each customer's audit log is hash-chained per-(tenant, app), and the content of every row is encrypted to your X25519 public key at write time — we can't decrypt the rows we just wrote. Server-side HMAC secrets and the attestation signing key sit at rest as AES-256-GCM ciphertext under a Key-Vault-resolved DEK — a stolen disk yields ciphertext only. Customers self-rotate via POST /v1/keys/rotate with a 24-hour grace window, and a leaked key auto-quarantines the moment a scanner pattern fires from the same IP. Container images are cosign-signed, the runtime is distroless with no shell and no admin endpoint on the internet, and a Container App Job runs continuous probes against this domain daily. See /trust → · verify a signature →

→ Run the 90-second proof. Copy-paste code that proves determinism, schema-locking, and cost on your own machine. Open the demo →

🎁 Try free, no card. 100 requests/day on Tier B (3B the substrate, schema-locked, signed responses). One curl to get a key.

Stay informed. Drop your email and we'll ping you when there's news worth your attention — new tier, new substrate primitive, customer milestone, breaking benchmark result. No spam, no upsells, no third-party trackers.

Or browse the cookbook. Six archetypal patterns — extraction, classification, routing, scoring, agent step, multi-extract. Copy-paste recipes →

Or read the dev-honest version. The technical whitepaper — mechanism, bench methodology, cost math, and the explicit list of things CircaOS does not do. ~15 min read.

Why this matters

Most AI failures don't come from your code — they come from drift, retries, malformed JSON, and silent provider changes that break features without warning. But the deeper cost is bigger than debugging. When a company can't get a truthful, stable view of its own operations, everyone pays for it: higher prices, slower products, wasted compute, wasted labor, and decisions made on bad data.

This loop removes that waste. Reproducible LLM calls with no drift, no retry storms, and no malformed output because schema-locked decoding makes invalid JSON physically impossible. Same call → same bytes out. Same call next month → same bytes out. Same call under load → no throttles, no surprises.

The result is 78% less inference spend, 72% less carbon, and a business that finally sees what's actually happening instead of guessing.

Developers stop debugging.

Employers stop burning money.

Customers stop paying for the company's confusion.

The mechanism

Deterministic

Every call is a closed function: input → bytes out. Schema-locked at the decoder level (the model physically can't emit non-conforming JSON). Sampling settings pinned, temperature 0 by default. Run the same prompt 20 times, get 20 identical responses. Verifiable via the public bench — we re-run it against our live inference path on a published cadence so determinism is something you can audit, not something we ask you to take on faith.

Uptime

Local inference, no third-party rate limit, no provider snapshot rotation, no ToS surface that can change under you. Your plan's request budget is yours — burst as hard as you need within it. The loop stays up because there's no remote dependency to fail.

Loop

Request → constrained decode → schema-validated response → provenance event → metered usage. Every step deterministic, every step observable, every step replayable from the hash-chained event log. The substrate isn't an LLM endpoint; it's a loop you can build production code on.

What breaks without it

What breaks in production today	What CircaOS guarantees
The model returned malformed JSON in prod. Worked fine in dev. You're debugging the LLM, not your code.	Schema-locked decoding at the token level. Pass a JSON Schema, the decoder is physically constrained. Non-conforming output is impossible — not retried, prevented.
Your code stopped working two weeks ago. No one touched it. The provider rotated the model behind the same name.	The public bench runs against our live path on a published cadence. Drift shows up in the CSV the same day. Customers see the same audit we see. No "trust us" — the receipts are open.
3 requests per minute on the starter tier. Your batch job runs at 3am. You wake to angry customers at 7.	100,000 requests/month, no per-minute throttle. Burst as hard as your business needs. No tier ladder to climb before you can scale.
"Temperature zero" is best-effort. Same input, different bytes, no reproducible test runs.	Byte-identical outputs at temperature 0. Verifiable — 20 identical calls return 1 unique output. Determinism = 1.0000. Provable.
Compliance asks where the inference happens. You don't know exactly. Their counsel doesn't sign off.	Local inference, no data egress to third-party clouds. Your provenance log is hash-chained, queryable, auditable.

How the loop is built

A runtime, not a model

Open-weight models (the substrate, Llama, Mistral) are commodities. CircaOS is the runtime layer above them — grammar-constrained decoders, tier routing per task shape, provenance events on every call, and an open determinism bench that audits the inference path on a published cadence. The model is the CPU. CircaOS is the OS that makes it operable. The loop is what you ship against.

Drop-in for your existing chat-completions client

The API speaks the same POST /v1/chat/completions shape your current SDK already sends. Point your client at https://cogos.5ceos.com/v1 and try it. If you don't like it, change it back in ten seconds.

Tier-routed by task, not by guess

Use model: "cogos-tier-b" for classification-shaped work, "cogos-tier-a" for narrative. The router picks the right size of open-weight model per shape — sufficient is sufficient, the GreenOps doctrine.

Power savings, by construction

Most production LLM workloads are classification-shaped — sentiment, routing, extraction, scoring — and burning frontier-model wattage on them is just lighting money on fire. The router runs that traffic on Tier B (3B params) and reserves Tier A (7B) for narrative. Internal measurements on a representative production mix: 78% reduction in inference spend, 72% reduction in energy draw, and ~75% of all calls served by Tiny/Mid tiers — while 100% of outputs remain schema-locked and auditable. The bench publishes $/valid-output by tier so the savings are something you can audit, not something we ask you to take on faith.

About us

We're privately backed. Not VC-funded.

Which means we get to hold pricing, refuse the growth-at-all-costs playbook, and keep the substrate determinism-first — instead of optimizing for the next funding round. Your tier won't get re-priced under you, the audit trail won't become a paid add-on, the bench stays open, and the substrate stays the substrate. We get to dream and build instead of pitch and exit.

Pricing

Free

3,000 requests/mo · Tier B · schema-locked decoding · deterministic at temp=0

Free tier — 100 requests/day, 1000 fallback tokens/day, Tier B (3B) only. No card required.

Operator Starter

$25/mo

100,000 requests/mo · Tier B · schema-locked decoding · deterministic at temp=0

100,000 schema-locked requests per month on Tier-B (classification-shaped workloads).

Enterprise

$100,000/yr

50M requests/mo · dedicated GPU container · single-tenant · 99.9% SLA · SOC 2 Type II · MSA + DPA + BAA · quarterly business review · 12-month minimum

Real deals close at $100K–$250K depending on add-ons (extra GPUs, 99.95% SLA, on-prem deployment, dedicated CSM).

Try it in 30 seconds (after signup)

curl https://cogos.5ceos.com/v1/chat/completions \
  -H "Authorization: Bearer sk-cogos-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cogos-tier-b",
    "messages": [{"role":"user","content":"Capital of France?"}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "answer",
        "strict": true,
        "schema": {
          "type": "object",
          "required": ["country","capital"],
          "properties": {
            "country": {"type":"string"},
            "capital": {"type":"string"}
          }
        }
      }
    }
  }'

FAQ

Why should I trust you on determinism?

Don't. Clone the bench and run it. MIT-licensed, open methodology, hand-coded rubrics — every claim on this page becomes a CSV you can publish or attack.

What models?

Our model (3B and 7B) today. Open-weight, content-addressed. New tiers (Llama 3.3, Mistral) land as discrete versioned upgrades — no silent swaps. The bench is re-run against the live inference path so any drift is published, not hidden.

What happens at your monthly quota?

A clean 429 with X-Cogos-Quota-Reset pointing at the start of the next billing cycle. Upgrade to a higher-quota package or wait for next cycle. Plans aren't lottery tickets — you know what you're getting.