Tools we buildfor ourselves first.

The Lab is where Factory tools become public: benchmarks, profiling SDKs, and small infrastructure pieces that first proved useful in our own AI systems.

Browse the productsRead the notes

0: Products
0: Public repo
0: In development

Open source projects

Line art of Bencher running model evaluations through a filtering sieve.

Released

LLM benchmark
Python
Tool calling
Charts

Benchmark Harness / Bencher

An open-source benchmark for real orchestrator work: tool calls, chart building, domain fixtures, prompt quality levels, and runtime plus LLM judges.

Python package - Published Apr 2026

View on GitHub Read the notes

In development

User modeling
Python SDK
Evidence
Profiles

User Modeling SDK / Profiler

An open-source Python SDK for evidence-based user modeling in LLM apps. It turns chat history into structured profiles with evidence, grades, confidence, and follow-up questions.

Python SDK - Active development

Working stack

Technologies we use for development and research.

Agent layer

The runtime around the model: what it sees, what it can call, how it stays cheap, and how it streams back.

MCP servers & clients
Tool calling with JSON-schema contracts
Structured outputs
Prompt caching
Hybrid retrieval
Context windowing & compaction
Streaming
Artifact renderers
Computer use / browser agents

Orchestration

How agent steps become runnable graphs: routing, retries, parallelism, and durable execution for work that can't be re-run for free.

LangGraph
Custom linear runners
Router -> executor split
Tool-loop with budget caps
Temporal
BullMQ / Redis queues
Idempotency keys & retry policies
LangChain
CrewAI / AutoGen

ML Ops

Traces, prompt versions, datasets, costs, evals, releases - the operating loop that lets us change a prompt without breaking a product.

LangSmith / Langfuse
OpenTelemetry spans across the agent loop
Prompt registry & versioning
Dataset snapshots & golden sets
LLM-as-judge + rubric judges
Regression gates in CI
Cost & latency dashboards per route
Canary deploys for prompts and tools
Fine-tuning pipelines

Models

Provider and local models picked by task shape - tool reliability, latency, context length, cost, and a path to fine-tuning when the domain needs it.

Claude / OpenAI / Gemini
Qwen
MiniMax
GLM
Grok
Own fine-tuned models
Embeddings

Build with us,
or learn alongside us.

The Factory ships for its own practice first, publishes what becomes useful, and helps outside teams when the problem fits our craft.