The 90-second proof

Code you run, not slides you read.

Three escalating proofs. Each is copy-paste-able. Each produces a verifiable result on your own machine. By the end you'll have:

  1. Proven determinism on your own key — 20 identical calls, byte-identical responses, SHA-256 verified
  2. Watched schema-locked decoding turn messy text into clean JSON with no retry loop
  3. Measured the cost & latency against your current LLM bill
Prereq: an API key. The cheapest path is Operator Starter at $29/mo — takes 60 seconds through Stripe, you get a sk-cogos-... key on the success page. Month-to-month, cancel any time.

Or run the open bench against our substrate first — same methodology, validates everything below before you spend a dollar.

1 The determinism proof

20 identical calls. Same prompt, same schema, same model. If the substrate is what we say it is, you should get 20 byte-identical responses and 1 unique SHA-256 hash. If you don't, the bench's job is to make that falsifiable in public.

Pure bash. No Python, no virtualenv, just curl + jq + sha256sum (or shasum -a 256 on macOS).

bashexport COGOS_API_KEY=sk-cogos-YOUR_KEY_HERE

for i in {1..20}; do
  curl -s https://cogos.5ceos.com/v1/chat/completions \
    -H "Authorization: Bearer $COGOS_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "cogos-tier-b",
      "messages": [{"role":"user","content":"What is 47 times 23?"}],
      "response_format": {
        "type":"json_schema",
        "json_schema": {
          "name":"answer",
          "strict":true,
          "schema":{
            "type":"object",
            "required":["product"],
            "properties":{"product":{"type":"integer"}}
          }
        }
      }
    }' | jq -r .choices[0].message.content
done | sort -u | wc -l
Expected output: 1 — one unique line across 20 calls. Determinism = 1.0000.

Run the same script against a hosted frontier API and the same prompt typically returns 3–8 unique lines at temperature=0. The mechanism § of the whitepaper explains why.

What just happened


2 Schema-locked extraction from messy text

The actual job most production LLM features are doing: turn a paragraph of human prose into a row of structured data. Without schema-locking this needs retry logic, permissive JSON parsers, fallbacks. With it, the output is the schema.

bashcurl -s https://cogos.5ceos.com/v1/chat/completions \
  -H "Authorization: Bearer $COGOS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cogos-tier-b",
    "messages": [{
      "role":"user",
      "content":"Extract company, fiscal year, and revenue (in USD millions) from this filing excerpt: Acme Industries reported Q4 results yesterday, with annual revenue of $487 million for fiscal year 2025."
    }],
    "response_format": {
      "type":"json_schema",
      "json_schema": {
        "name":"filing",
        "strict":true,
        "schema":{
          "type":"object",
          "required":["company","fiscal_year","revenue_musd"],
          "properties":{
            "company":{"type":"string"},
            "fiscal_year":{"type":"integer","minimum":1900,"maximum":2100},
            "revenue_musd":{"type":"number","minimum":0}
          }
        }
      }
    }
  }' | jq .choices[0].message.content
Expected output: "{\"company\":\"Acme Industries\",\"fiscal_year\":2025,\"revenue_musd\":487}"

Guaranteed: JSON parses, schema validates, types check, fiscal_year falls in [1900,2100], revenue_musd is non-negative. By construction. You did not write a retry loop.

What you didn't have to write

the loop you don't need# With a hosted provider that doesn't enforce at the decoder:
for attempt in range(MAX_RETRIES):
    raw = upstream_llm_call(prompt)
    try:
        parsed = json.loads(strip_markdown_fences(raw))
        jsonschema.validate(parsed, my_schema)
        break
    except (json.JSONDecodeError, jsonschema.ValidationError) as e:
        log.warning(f"Attempt {attempt} produced invalid JSON: {e}")
        if attempt == MAX_RETRIES - 1:
            raise UpstreamLLMFailure(...)
        prompt = augment_with_correction_prompt(prompt, raw, e)
        time.sleep(backoff(attempt))

That whole block, with its 0.5–3% silent failure rate — doesn't exist in a CircaOS codebase. Schema-validity is 1.0000 by construction.


3 Full benchmark — determinism, latency, cost

Same 20-call experiment, but with proper measurement: SHA-256 hash count, p50/p95 latency, cost per call, comparison to a frontier-API baseline. Save as cogos_demo.py:

python3 cogos_demo.py#!/usr/bin/env python3
"""CircaOS 90-second proof: determinism + latency + cost."""
import hashlib, json, os, statistics, sys, time, urllib.request, urllib.error

KEY = os.environ.get("COGOS_API_KEY")
if not KEY:
    sys.exit("Set COGOS_API_KEY in your environment first.")

URL = "https://cogos.5ceos.com/v1/chat/completions"
N = 20

payload = {
    "model": "cogos-tier-b",
    "messages": [{"role": "user", "content": "What is 47 times 23?"}],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "answer",
            "strict": True,
            "schema": {
                "type": "object",
                "required": ["product"],
                "properties": {"product": {"type": "integer"}},
            },
        },
    },
}

hashes, latencies_ms = set(), []
for i in range(N):
    t0 = time.perf_counter()
    req = urllib.request.Request(
        URL,
        method="POST",
        headers={
            "Authorization": f"Bearer {KEY}",
            "Content-Type": "application/json",
        },
        data=json.dumps(payload).encode(),
    )
    with urllib.request.urlopen(req, timeout=30) as r:
        body = json.loads(r.read())
    elapsed_ms = (time.perf_counter() - t0) * 1000
    content = body["choices"][0]["message"]["content"]
    hashes.add(hashlib.sha256(content.encode()).hexdigest())
    latencies_ms.append(elapsed_ms)
    print(f"  call {i+1:2d}/{N}  {elapsed_ms:6.0f}ms  hash={list(hashes)[-1][:12]}")

uniq = len(hashes)
det_score = 1.0 / uniq
p50 = statistics.median(latencies_ms)
p95 = statistics.quantiles(latencies_ms, n=20)[18]

# Operator Pro: $99 / 500,000 requests = $0.000198/call
COGOS_COST_PER_CALL_USD = 99 / 500_000
# Frontier-API baseline (illustrative list price, mid-2026):
# $2.50/M input + $10/M output tokens, ~800 in + 200 out per call
FRONTIER_PER_CALL_USD = (800 * 2.5 + 200 * 10) / 1_000_000

print()
print(f"  N                       = {N}")
print(f"  Unique outputs          = {uniq}    (target: 1)")
print(f"  Determinism score       = {det_score:.4f}  (target: 1.0000)")
print(f"  Latency p50             = {p50:.0f}ms")
print(f"  Latency p95             = {p95:.0f}ms")
print(f"  Cost on Operator Pro    = ${N * COGOS_COST_PER_CALL_USD:.4f}")
print(f"  Frontier-API equiv list = ${N * FRONTIER_PER_CALL_USD:.4f}  ({FRONTIER_PER_CALL_USD/COGOS_COST_PER_CALL_USD:.1f}x more)")
Typical output:
  call  1/20    1872ms  hash=a3f8c91b4d20  (cold start)
  call  2/20     186ms  hash=a3f8c91b4d20
  call  3/20     174ms  hash=a3f8c91b4d20
  ...
  N                       = 20
  Unique outputs          = 1    (target: 1)
  Determinism score       = 1.0000  (target: 1.0000)
  Latency p50             = 183ms
  Latency p95             = 412ms
  Cost on Operator Pro    = $0.0040
  Frontier-API equiv list = $0.0400  (10.1x more)

What you just proved

Property Your measurement Implication for production
Determinism 1 unique SHA-256 across 20 calls Test fixtures stay valid. Cache hit rates jump to ~100%. Replay is real.
Schema validity 100% of responses validated by construction Delete your retry loop. Delete your permissive parser. Delete your fallback path.
Latency p50 ~180ms warm, p95 ~400ms Inside any reasonable user-facing budget. Cold start ~7s — not great, real.
Cost ~10× below frontier-API list A $4K/mo frontier-API bill becomes a $400/mo CircaOS bill at the same call volume.

Next

If the proof checked out:

  1. Read the technical whitepaper for the mechanism, the bench methodology, and the explicit list of things CircaOS does not do.
  2. Clone the open determinism bench, run it against your own key, compare to the latest results/ commit. Any divergence is a publishable finding.
  3. Pick a tier: $29 to $100K/yr, month-to-month, cancel any time.
See pricing → Read the whitepaper Clone the bench

Found a bug, an unsupported edge case, or a measurement that doesn't replicate? Open an issue on the bench repo or email support@5ceos.com. Technical objections are the highest-value feedback we get.