Tools we buildfor ourselves first.

The Lab is where Factory tools become public: benchmarks, profiling SDKs, and small infrastructure pieces that first proved useful in our own AI systems.

0
Products
0
Public repo
0
In development

Open source projects

Line art of Bencher running model evaluations through a filtering sieve.

Released

  • LLM benchmark
  • Python
  • Tool calling
  • Charts

Benchmark Harness / Bencher

An open-source benchmark for real orchestrator work: tool calls, chart building, domain fixtures, prompt quality levels, and runtime plus LLM judges.

Python package - Published Apr 2026

In development

  • User modeling
  • Python SDK
  • Evidence
  • Profiles

User Modeling SDK / Profiler

An open-source Python SDK for evidence-based user modeling in LLM apps. It turns chat history into structured profiles with evidence, grades, confidence, and follow-up questions.

Python SDK - Active development

Working stack

Technologies we use for development and research.

Agent layer

The runtime around the model: what it sees, what it can call, how it stays cheap, and how it streams back.

  • MCP servers & clients
  • Tool calling with JSON-schema contracts
  • Structured outputs
  • Prompt caching
  • Hybrid retrieval
  • Context windowing & compaction
  • Streaming
  • Artifact renderers
  • Computer use / browser agents

Orchestration

How agent steps become runnable graphs: routing, retries, parallelism, and durable execution for work that can't be re-run for free.

  • LangGraph
  • Custom linear runners
  • Router -> executor split
  • Tool-loop with budget caps
  • Temporal
  • BullMQ / Redis queues
  • Idempotency keys & retry policies
  • LangChain
  • CrewAI / AutoGen

ML Ops

Traces, prompt versions, datasets, costs, evals, releases - the operating loop that lets us change a prompt without breaking a product.

  • LangSmith / Langfuse
  • OpenTelemetry spans across the agent loop
  • Prompt registry & versioning
  • Dataset snapshots & golden sets
  • LLM-as-judge + rubric judges
  • Regression gates in CI
  • Cost & latency dashboards per route
  • Canary deploys for prompts and tools
  • Fine-tuning pipelines

Models

Provider and local models picked by task shape - tool reliability, latency, context length, cost, and a path to fine-tuning when the domain needs it.

  • Claude / OpenAI / Gemini
  • Qwen
  • MiniMax
  • GLM
  • Grok
  • Own fine-tuned models
  • Embeddings

Build with us,
or learn alongside us.

The Factory ships for its own practice first, publishes what becomes useful, and helps outside teams when the problem fits our craft.