Systems

Control surfacesfor productionAI systems.

The public proof layer behind the practice: eval harnesses, artifact runtimes, observability loops, domain fixtures, and cost controls extracted from real work.

6
Control surfaces
2
Public systems
1
Active build

Capability map

The control surfaces around expensive AI work.

Agent Runtime

Production AI needs more than a prompt. The runtime decides what the model can see, call, stream, and modify.

MCP clients and servers, tool-call contracts, structured outputs, streaming thought/tool/token/artifact events.
  • Scope360
  • Micro Pipelines

Evaluation

Model behavior shifts. Release gates need frozen tasks, deterministic checks, and human-calibrated rubrics.

Bencher runs production-shaped tool, chart, fixture, and prompt-quality tests with runtime and LLM judges.
  • Bencher

Artifacts

Serious AI systems create objects users keep using: reports, tables, charts, configs, and scripts.

Formatter layers and artifact renderers keep generated work inspectable instead of burying it inside chat text.
  • Scope360
  • Micro Pipelines

Observability

Teams need to know which step failed, which tool was called, and where cost or latency compounded.

Event streams, traces, budget caps, prompt versions, and route-level measurement.
  • Bencher
  • Profiler

Cost + Latency

Repeated context, poor routing, and unnecessary frontier calls turn small mistakes into operating expense.

Prompt caching, provider routing, context windowing, compaction, and per-route cost controls.
  • Real Estate Agent Audit

Domain Models

Some domains require smaller, controlled models with the right training data and evaluation loop.

Fine-tuning pipelines, domain fixtures, research review, and mismatch detection before downstream handoff.
  • Pharmacology R&D Model

Public system artifacts

Line art of Bencher running model evaluations through a filtering sieve.

Released

  • LLM benchmark
  • Python
  • Tool calling
  • Charts

Benchmark Harness / Bencher

An open-source benchmark for real orchestrator work: tool calls, chart building, domain fixtures, prompt quality levels, and runtime plus LLM judges.

Python package - Published Apr 2026

Line art of Micro Pipelines being extracted from a parent project into an open-source package.

Released

  • OSS package
  • Python
  • Pipelines
  • Artifacts

Micro Pipelines

A small open-source package extracted from Scope AI: portable LLM pipelines for HTML artifacts, media analysis, and web search with minimal infrastructure around them.

Python package - Published May 2026

Looping workshop process visual for Profiler in active development.

In development

  • User modeling
  • Python SDK
  • Evidence
  • Profiles

User Modeling SDK / Profiler

An open-source Python SDK for evidence-based user modeling in LLM apps. It turns chat history into structured profiles with evidence, grades, confidence, and follow-up questions.

Python SDK - Active development

Operating stack

The infrastructure choices behind the control surfaces.

Agent layer

The runtime around the model: what it sees, what it can call, how it stays cheap, and how it streams back.

  • MCP servers & clients
  • Tool calling with JSON-schema contracts
  • Structured outputs
  • Prompt caching
  • Hybrid retrieval
  • Context windowing & compaction
  • Streaming
  • Artifact renderers
  • Computer use / browser agents

Orchestration

How agent steps become runnable graphs: routing, retries, parallelism, and durable execution for work that can't be re-run for free.

  • LangGraph
  • Custom linear runners
  • Router -> executor split
  • Tool-loop with budget caps
  • Temporal
  • BullMQ / Redis queues
  • Idempotency keys & retry policies
  • LangChain
  • CrewAI / AutoGen

ML Ops

Traces, prompt versions, datasets, costs, evals, releases - the operating loop that lets us change a prompt without breaking a product.

  • LangSmith / Langfuse
  • OpenTelemetry spans across the agent loop
  • Prompt registry & versioning
  • Dataset snapshots & golden sets
  • LLM-as-judge + rubric judges
  • Regression gates in CI
  • Cost & latency dashboards per route
  • Canary deploys for prompts and tools
  • Fine-tuning pipelines

Models

Provider and local models picked by task shape - tool reliability, latency, context length, cost, and a path to fine-tuning when the domain needs it.

  • Claude / OpenAI / Gemini
  • Qwen
  • MiniMax
  • GLM
  • Grok
  • Own fine-tuned models
  • Embeddings

Bring the system
before it becomes expensive.

We review agent runtimes, eval harnesses, artifact flows, and domain models when the risk is concrete enough to deserve serious architecture.