Systems

Control surfacesfor productionAI systems.

The public proof layer behind the practice: eval harnesses, artifact runtimes, observability loops, domain fixtures, and cost controls extracted from real work.

Browse systemsRead intelligence

6: Control surfaces
2: Public systems
1: Active build

Capability map

The control surfaces around expensive AI work.

Agent Runtime

Production AI needs more than a prompt. The runtime decides what the model can see, call, stream, and modify.

MCP clients and servers, tool-call contracts, structured outputs, streaming thought/tool/token/artifact events.

Scope360
Micro Pipelines

Evaluation

Model behavior shifts. Release gates need frozen tasks, deterministic checks, and human-calibrated rubrics.

Bencher runs production-shaped tool, chart, fixture, and prompt-quality tests with runtime and LLM judges.

Bencher

Artifacts

Serious AI systems create objects users keep using: reports, tables, charts, configs, and scripts.

Formatter layers and artifact renderers keep generated work inspectable instead of burying it inside chat text.

Scope360
Micro Pipelines

Observability

Teams need to know which step failed, which tool was called, and where cost or latency compounded.

Event streams, traces, budget caps, prompt versions, and route-level measurement.

Bencher
Profiler

Cost + Latency

Repeated context, poor routing, and unnecessary frontier calls turn small mistakes into operating expense.

Prompt caching, provider routing, context windowing, compaction, and per-route cost controls.

Real Estate Agent Audit

Domain Models

Some domains require smaller, controlled models with the right training data and evaluation loop.

Fine-tuning pipelines, domain fixtures, research review, and mismatch detection before downstream handoff.

Pharmacology R&D Model

Public system artifacts

Line art of Bencher running model evaluations through a filtering sieve.

Released

LLM benchmark
Python
Tool calling
Charts

Benchmark Harness / Bencher

An open-source benchmark for real orchestrator work: tool calls, chart building, domain fixtures, prompt quality levels, and runtime plus LLM judges.

Python package - Published Apr 2026

View on GitHub Read the notes

Line art of Micro Pipelines being extracted from a parent project into an open-source package.

Released

OSS package
Python
Pipelines
Artifacts

Micro Pipelines

A small open-source package extracted from Scope AI: portable LLM pipelines for HTML artifacts, media analysis, and web search with minimal infrastructure around them.

Python package - Published May 2026

View on GitHub

Looping workshop process visual for Profiler in active development.

In development

User modeling
Python SDK
Evidence
Profiles

User Modeling SDK / Profiler

An open-source Python SDK for evidence-based user modeling in LLM apps. It turns chat history into structured profiles with evidence, grades, confidence, and follow-up questions.

Python SDK - Active development

Operating stack

The infrastructure choices behind the control surfaces.

Agent layer

The runtime around the model: what it sees, what it can call, how it stays cheap, and how it streams back.

MCP servers & clients
Tool calling with JSON-schema contracts
Structured outputs
Prompt caching
Hybrid retrieval
Context windowing & compaction
Streaming
Artifact renderers
Computer use / browser agents

Orchestration

How agent steps become runnable graphs: routing, retries, parallelism, and durable execution for work that can't be re-run for free.

LangGraph
Custom linear runners
Router -> executor split
Tool-loop with budget caps
Temporal
BullMQ / Redis queues
Idempotency keys & retry policies
LangChain
CrewAI / AutoGen

ML Ops

Traces, prompt versions, datasets, costs, evals, releases - the operating loop that lets us change a prompt without breaking a product.

LangSmith / Langfuse
OpenTelemetry spans across the agent loop
Prompt registry & versioning
Dataset snapshots & golden sets
LLM-as-judge + rubric judges
Regression gates in CI
Cost & latency dashboards per route
Canary deploys for prompts and tools
Fine-tuning pipelines

Models

Provider and local models picked by task shape - tool reliability, latency, context length, cost, and a path to fine-tuning when the domain needs it.

Claude / OpenAI / Gemini
Qwen
MiniMax
GLM
Grok
Own fine-tuned models
Embeddings

Bring the system
before it becomes expensive.

We review agent runtimes, eval harnesses, artifact flows, and domain models when the risk is concrete enough to deserve serious architecture.