woollama build log (slice-by-slice)¶

The running, chronological record of how woollama was built — one section per slice, each with the decisions, the load-bearing findings, and what was verified. For current status and what's next, see roadmap.md; this file is the detailed history behind it.

Historical note. These slices are the Python prototype (slices e–k). woollama has since cut over to the Rust daemon woollamad (published to crates.io + PyPI, v0.5.x). So the Python commands recorded below map to today's binary: woollama mcp → woollamad mcp, python -m woollama → woollamad. The entries are kept verbatim as the build record; for how to run woollama now see getting-started.md.

(Started life as the slice (e) handoff note; it now spans slices e → k. The original slice (e) plan is preserved verbatim below, followed by each follow-on.)

Index¶

Slice	What	Commit
(e)	woollama as an MCP server (stdio) — prompts + chat verb	`9a19be0`
(f)	MCP aggregator (tool re-export) + routing demo + allow-list boundary	`dd53122`
(h)	MCP over Streamable HTTP, mounted on one port	`58addb2`
(i)	Claude Code as a (tool-less) inference backend	`1fc0a24`
(j)	OpenAI-compat inferencer seam — Anthropic	`35c2dc0`
(k)	the rest of the providers — cloud built-ins + config-file inferencers	`5b2996e`

(Slices a–d — the OpenAI router, config files, multi-MCP discovery, test backfill — predate this log; see git log. There is no slice (g): the lettering jumped f→h during the session and we kept it faithful to the commits.)

Slice (e) — woollama as an MCP server (handoff / pickup note)¶

Status: DONE (2026-06-01). Implemented + tested. (Counts below are as-of that slice; the suite has grown since — see the per-slice "Verified" lines and roadmap.md for the current totals.)

Where this came from (session recap, 2026-05-31 → 06-01)¶

woollama is the graduation of a throwaway router probe (originally /tmp/router_probe/, now obsolete — do not look for it). The architecture was co-designed in the sibling cosmic-fabric repo; woollama is the actual router implementation. cosmic-fabric remains as the COSMIC frontend client and its design docs are now stale (still cite /tmp, still read the naming question as open — it's settled: woollama).
Mount move gotcha: the project moved /mnt/fast/... → /srv/physical/fast/... (also reachable as /home/teague/Projects/woollama via bind mount). This broke the .venv editable install (the _editable_impl_woollama.pth pointed at the dead /mnt/fast path). Fix already applied: uv sync --extra dev repointed it. All 43 tests pass. If imports break again as "unknown location", re-run uv sync --extra dev.
Dev deps live in the dev optional-dependency extra (not a dependency-groups table). Run tests with uv run --extra dev pytest.

Slices so far¶

78580a3 test suite: backfill manager + router + live integration; 19 → 43 tests
fa4d766 slice (c): multi-MCP-server discovery, long-lived connections, namespacing
fc05a04 slice (b): real config files — mcp.json + recipes.toml
1ef2c50 woollama v0.1.0 — Python prototype of the MCP + OpenAI router

Slice (d) was folded in; (e) is next.

What slice (e) is¶

Project woollama's inbound OpenAI surface onto an outbound MCP surface so MCP clients (Claude Desktop, the cosmic-fabric panel) can drive it natively. This unlocks the panel-as-MCP-client direction.

woollama-as-MCP-server exposes: - recipes → MCP prompts (prompts/list, prompts/get returns the rendered system message) - an orchestration verb → MCP tool (tools/list, tools/call) - capability negotiation on initialize

Decisions locked (user, 2026-06-01)¶

Transport: stdio first. Subprocess + stdio — matches what the panel and Claude Desktop need locally; easiest to test via subprocess. HTTP/SSE can follow as a later slice.
Orchestration verb is named chat. Mirrors the /v1/chat/completions surface. (Considered run_recipe / woollama_chat; chose chat.)
tools/list scope: start with just the chat verb, but build the projection to be EXTENSIBLE so re-exporting discovered downstream tools (textops.*, hello.*) can be added later without a redesign. i.e. the tools/list builder should be a function that today returns [chat] but is structured to concatenate registry-derived tools when we flip that on.

Implementation plan — TDD, tests are the spec¶

New file tests/test_mcp_server.py, five happy-path tests (all default suite), each fails today; implement minimally to pass each in order:

test_initialize_advertises_expected_capabilities client connects → server announces { tools, prompts, ... } caps.
test_prompts_list_returns_loaded_recipes prompts/list → sees the loaded recipe names (e.g. streamer, textcounter). Each recipe IS a prompt.
test_prompts_get_returns_rendered_system_message prompts/get("streamer") → returns the recipe's system text (the assemble_prompt equivalent on the MCP side).
test_tools_list_includes_chat_orchestration_verb tools/list → includes chat; its input schema has (recipe?, model?, messages, ...).
test_chat_tool_orchestrates_end_to_end tools/call chat {recipe: "streamer", messages: [...]} → final assistant message (loop hidden, same as the OpenAI surface). Mock the underlying inferencer call — this is a unit test, not integration.

Plus one opt-in integration test: spawn woollama and drive it as an MCP server over stdio with a real MCP client (mark it like the existing live/ integration tests so it skips when Ollama isn't up).

How (e) maps onto existing code (read these first next session)¶

The MCP server is a thin projection over machinery that already exists:

src/woollama/recipes.py — recipes.names(), recipes.get(name) → Recipe dict with keys system, inferencer, tools. prompts/list and prompts/get project directly off this. (NOTE: I had not finished re-reading recipes.py/manager.py when this note was written — read both before coding.)
src/woollama/manager.py — Registry: openai_tools_for(tool_names), dispatch(namespaced_name, args) (returns result with .content list of items carrying .text), all_tool_names(), start_all()/stop_all(), ServerManager(name, command, args).
src/woollama/router.py — _orchestrate_recipe(recipe, body) is the core chat-loop the chat tool must reuse. Do NOT duplicate it. Refactor the loop body out of the FastAPI handler into a transport-agnostic coroutine (e.g. orchestrate(recipe, messages) -> final_message) that BOTH /v1/chat/completions and the MCP chat tool call. The loop currently: prepends recipe["system"], requires an ollama/ inferencer, builds tools via registry.openai_tools_for(recipe["tools"]), loops ≤8 turns hitting Ollama, dispatches tool_calls via registry.dispatch, returns the final message. Extract this so the MCP path doesn't reimplement it.
src/woollama/config.py — load_mcp_servers() (mcp.json), recipe loading.
src/woollama/__main__.py — CLI wiring. Add an mcp subcommand (woollama mcp) that starts the stdio MCP server, alongside the existing HTTP serve command. This is what clients put in their mcp.json: { "command": "woollama", "args": ["mcp"] }.
The MCP server itself: use FastMCP (already a dev dep; the examples/mcp-hello/server.py is the reference shape). Mount prompts + the chat tool on a FastMCP("woollama") instance.

Open sub-questions raised by the test names (decide while writing tests)¶

chat tool input schema: recipe (name) vs model (woollama/<recipe>) vs both? Lean: accept recipe (bare name) primarily; optionally accept model for symmetry with the OpenAI surface.
Registry lifecycle under stdio: the stdio server needs the same long-lived registry.start_all() / stop_all() that the FastAPI lifespan does. Wire it into the MCP server's startup/shutdown.
tools/list extensibility hook: make the builder def mcp_tools_list() -> list[Tool]: return [_CHAT_TOOL] # + future registry tools so flipping on re-export is a one-line concat, per decision #3.

First concrete action next session¶

uv sync --extra dev (guard against mount-move staleness), confirm uv run --extra dev pytest -q is green (43 passing).
Read recipes.py and manager.py fully.
Refactor the orchestration loop out of router.py into a shared coroutine.
Write tests/test_mcp_server.py (5 tests) — RED.
Implement the FastMCP server + woollama mcp CLI subcommand — GREEN.
Add the opt-in stdio integration test.
Commit as slice (e): woollama as MCP server (stdio) — prompts + chat verb.

What shipped (2026-06-01)¶

src/woollama/router.py — extracted the chat-loop into a transport-agnostic orchestrate(recipe, user_msgs, reg) -> resp_dict coroutine + an OrchestrationError(message, kind, status, payload) that each transport maps to its own surface. _orchestrate_recipe (HTTP) is now a thin adapter. No loop duplication — the MCP chat tool reuses orchestrate.
src/woollama/mcp_server.py (new) — build_server(registry) -> FastMCP: one MCP prompt per recipe (rendering returns the system message; per-iter closure binding avoids the late-binding bug), the chat tool (messages, recipe, optional model="woollama/<name>"), and a lifespan that start_all()/stop_all() the registry inside the serving loop (the load-bearing detail — ServerManager binds futures/tasks to that loop). _chat_tool() builds the orchestration verb. serve() runs stdio with show_banner=False (stdout is the JSON-RPC channel). (Decision #3 / tool re-export landed as a follow-on — see below.)
src/woollama/__main__.py — woollama mcp subcommand → mcp_server.serve(), logging forced to stderr so stdout stays clean.
tests/test_mcp_server.py (new, 5 unit tests, in-memory fastmcp.Client over a bare Registry()): initialize caps, prompts/list, prompts/get, tools/list has chat, chat orchestrates end-to-end (inferencer mocked).
tests/test_integration.py — +2 opt-in tests:
test_mcp_stdio_surface_with_started_registry (not Ollama-gated): spawns woollama mcp over REAL stdio with a STARTED registry (hello + textops example servers). Proves registry.start_all() runs and the server comes up clean over stdio (where the documented anyio cancel-scope bug would bite) — the unit tests use an empty registry and can't catch that. Stops short of orchestration (the unknown-recipe call short-circuits before dispatch).
test_mcp_stdio_chat_orchestrates_end_to_end (@needs_ollama): drives a full chat orchestration over stdio — the MCP counterpart of the HTTP test_orchestrated_recipe... parity test. Gives the MCP transport the same end-to-end coverage HTTP has.
Verified: default suite green / 5 deselected; the non-Ollama stdio integration test passes (~5s, really spawns the example servers). The Ollama-gated end-to-end test runs when qwen3:14b-iq4xs is available. (Counts grew with the tool-re-export follow-on below.)

Client mcp.json entry¶

{ "command": "woollama", "args": ["mcp"] }

Natural next slices¶

~~HTTP/SSE transport~~ — DONE as Streamable HTTP, mounted (see follow-on below).
Re-export output_schema pass-through on proxy tools (currently content + structuredContent only).
Legacy SSE transport, if a client ever needs it (FastMCP transport="sse").

Follow-on slice — tool re-export (decision #3, 2026-06-01)¶

woollama is now an MCP aggregator: a client connecting over MCP sees the union of every configured downstream server's tools (namespaced <server>.<tool>, e.g. hello.count_to, textops.word_count) plus the chat verb.

Implementation note that corrects the original plan: decision #3 described re-export as a "one-line concat in the tools/list builder." That turned out to be wrong — a server's tools are only known after registry.start_all(), which (per the loop constraint) runs inside the lifespan, not at build_server() time. So re-export is a lifespan-time dynamic registration, not a build-time concat:

_ProxyTool(Tool) — a FastMCP Tool subclass carrying the downstream tool's own name + input schema; its run() dispatches through the unified Registry (the SAME long-lived connection layer the chat path uses — NOT a second FastMCP client stack, which as_proxy/mount would have introduced). It's raw passthrough (no orchestrate), so it owns its own error handling: dispatch exceptions and CallToolResult.isError both become ToolError. (Tool.from_function can't be used here — it rejects **kwargs handlers and derives the schema from the signature; we need an arbitrary downstream schema.) run() passes the downstream structuredContent through (so dict-returning tools like hello.count_to reach the client as structured data, not just JSON-as-text). It deliberately does NOT mirror the downstream output_schema onto the proxy tool — that would couple every call to schema validation and break content-only results; output_schema pass-through is a later refinement.
_register_reexported_tools(mcp, reg) — called in the lifespan right after start_all(); iterates reg.servers and add_tools a _ProxyTool per discovered tool. Schemas are defensively copied (HTTP's openai_tools_for reads the same spec objects).
Tests: +1 unit test (stubbed ServerManager: re-exported tool appears in tools/list with its schema and dispatches) and the non-Ollama stdio integration test extended to assert real hello.count_to / textops.word_count are re-exported and that hello.count_to {n:3} dispatches end-to-end over real stdio (content-block fidelity — the unit test's canned content can't prove that).
Verified: default suite 46 passed / 5 deselected; full integration suite (incl. the Ollama-gated chat-over-stdio test) green.

Follow-on slice — routing demonstration + allow-list boundary (2026-06-01)¶

A cohesive demonstration of the whole routing topology, plus a real fix the demonstration surfaced.

The allow-list is now a BOUNDARY, not a hint (behavior change in orchestrate). Previously the recipe's tools allow-list only governed which tools were offered to the model (openai_tools_for); dispatch would execute whatever namespaced name the model emitted — so a recipe scoped to hello.* could reach textops.* if the model emitted it. Now orchestrate computes allowed = set(recipe["tools"]) once and refuses any out-of-list tool_call: it does NOT dispatch, log.warnings the denial, and feeds an ERROR: ... not permitted result back as the tool message (every tool_call still needs a matching tool result, so the loop continues and the model can recover — feed-back-and-continue, not hard-fail). Scope: orchestrate only — the re-exported _ProxyTool aggregator surface is deliberately left open (different concept: what a recipe may reach vs. what the aggregator exposes).

tests/test_routing.py (new, 10 hermetic tests) — executable documentation of the routing map. Inferencer mocked with scripted turns; each downstream session is a stub ServerManager that records what IT received, so tests assert the RIGHT call reached the RIGHT session (per-session recording, not a global counter). Headline: woollama/textcounter (allow-lists textops.word_count + hello.count_to) fans one chat out to two sessions — proven over BOTH the HTTP and MCP chat transports. Plus proxy-routing, passthrough, and a rejection matrix: unknown namespace (400) / recipe (404) / non-ollama inferencer (501) / MCP unknown recipe (ToolError) / dispatch to unknown provider+tool (KeyError) / out-of-allow-list tool refused without dispatch (the boundary test asserts the forbidden session's call_tool was never invoked).

tests/test_integration.py — +1 @needs_ollama test (test_two_provider_recipe_uses_tools_from_two_sessions): real textcounter chat end-to-end; loose assert (non-empty content, no leaked tool_calls) since the hermetic matrix carries the deterministic routing proof.

examples/routing_demo.py (new, runnable) — python examples/routing_demo.py spins up both surfaces over the bundled defaults and prints every routing activity live (discovery, passthrough, two-provider orchestration, MCP aggregator + direct proxy calls, rejections). Parts needing Ollama skip cleanly when it's down. Confirmed: the live run shows the model calling word_count then count_to across two sessions, returning a single hidden-loop answer.

Verified: default suite 56 passed / 6 deselected; full integration suite 6 passed (incl. the live two-provider chat); the demo runs end-to-end.

Follow-on slice — MCP over Streamable HTTP, mounted on one port (2026-06-01)¶

woollama now serves BOTH surfaces on a single port: /v1/* (OpenAI) and /mcp (MCP over Streamable HTTP), over ONE shared registry — the router thesis. Both /mcp and /mcp/ work; the mount does not shadow /v1/*. (User decisions: mount-into-FastAPI over standalone; Streamable HTTP over legacy SSE.)

The load-bearing design (Design Y — registry owned in one place): - Single shared registry (router.registry) started once by router's FastAPI lifespan and used by BOTH the OpenAI orchestration path and the MCP chat/proxy tools. Under uvicorn there's one event loop, so start_all() and every request (both surfaces) run on it — the ServerManager loop-binding invariant holds for free. - build_server(registry, *, manage_registry=False) for the mounted server: it must NOT start/stop the registry (FastAPI's lifespan owns it) — avoids a double-start. The stdio path keeps manage_registry=True. - Lifespan composition in router.lifespan: populate + start_all → register_reexported_tools(_mcp, registry) (re-export is dynamic, post-start, and — verified — visible over the HTTP transport even though tools are added after http_app() is constructed) → async with _mcp_app.lifespan(app) (the Streamable HTTP session manager) → yield → stop_all. - Import cycle broken: mcp_server lazy-imports orchestrate/ OrchestrationError inside the chat tool, so router can import build_server/register_reexported_tools at module top and mount at import (_mcp_app = _mcp.http_app(path="/"); app.mount("/mcp", _mcp_app)). - register_reexported_tools is now public (was _register_…) since both build_server's stdio lifespan and router's HTTP lifespan call it.

Client mcp.json entry (HTTP): { "url": "http://<host>:<port>/mcp" }.

Tests (tests/test_integration.py, opt-in): test_mcp_over_http_shares_ one_port_and_registry (not Ollama-gated) — /v1/models still 200 (mount didn't shadow it), /mcp lists chat + re-exported tools, and hello.count_to{n:3} dispatches over HTTP returning the structured dict; test_mcp_over_http_chat_ orchestrates_end_to_end (@needs_ollama) — full chat orchestration over the mounted HTTP surface.

examples/routing_demo.py updated: its MCP-aggregator section now connects to /mcp on the SAME running HTTP server (Streamable HTTP) instead of spawning a separate woollama mcp stdio process — so the demo actually showcases the one-port-both-surfaces capability. Confirmed end-to-end against real Ollama.

Coupling note (intentional): router.lifespan now starts the shared registry entangled with the mounted MCP app's lifespan, and _mcp/_mcp_app are built at import. So the OpenAI surface (/v1/*) now has a hard dependency on the MCP server constructing + its lifespan entering cleanly — if MCP mounting throws, /v1/* goes down with it. Acceptable for "one unified router," but it widens the OpenAI surface's blast radius.

Verified: default suite 56 passed / 6 deselected (the import-time mount construction is exercised by every import woollama.router); full integration suite 8 passed (incl. both new HTTP-mount tests + the live chat-over-HTTP); the one-port demo runs end-to-end.

Follow-on slice — Claude Code as a (tool-less) inference backend (2026-06-01)¶

A recipe whose inferencer is claude-code/<model> routes to the local claude CLI in headless print mode (src/woollama/claude_code.py), using the user's EXISTING Claude auth — no ANTHROPIC_API_KEY. A keyless path to Claude. (Chosen over the OpenAI-compat Anthropic-API shim because there's no API key in the env; the user asked for "a claude code adapter".)

Scope: TOOL-LESS completions only. orchestrate dispatches by inferencer provider: claude-code/ (recipe with tools=[]) → claude_code.run_completion; claude-code/ WITH tools → 501 ("tool delegation not yet supported"); ollama/ → the existing woollama-owned loop; else → 501. Serves both the HTTP and MCP chat transports (both call orchestrate). Tool DELEGATION (Claude Code owns the loop and runs the recipe's MCP tools) is a deliberately separate, larger concept — an executor, not an inferencer — and a later slice with its own adversarial safety pass.

Subprocess, not the Agent SDK — the SDK needs the claude CLI on PATH anyway, so shelling out is fewer deps and trivially mockable (tests patch claude_code._invoke).

Findings pinned empirically against claude v2.1.160 (docs were wrong): - --output-format json emits a JSON array of events (system → assistant → result); the final text is the result event's result field. (Docs say a single object.) - 🔴 --permission-mode dontAsk does NOT make it tool-less — it auto-RUNS read-only Bash (verified: echo executed, permission_denials: []). The docs' "dontAsk denies everything unallowed" is false for read-only commands. So a naive tool-less backend would let crafted input run shell on the host. - There is no clean "disable all tools" flag. The lockdown used: --permission-mode dontAsk (non-interactive, no hang) + --strict-mcp-config (zero MCP servers) + --disallowedTools "Bash,Read,Write,Edit,NotebookEdit, WebFetch,WebSearch,Glob,Grep,Task" (closes the read-only-Bash gap + the other exec/file/network/subagent vectors; other tools are auto-denied by dontAsk). Plus a neutral temp cwd (don't inherit host CLAUDE.md/settings/plugins) and ANTHROPIC_API_KEY stripped from the child env (force subscription auth).

Verification split (honest): the runtime safety boundary could NOT be verified from inside this Claude Code session — nested claude invocations crash the harness (the H.replace glitch). So: - Hermetic tests/test_claude_code.py asserts the command is built locked-down (strict-mcp-config, dontAsk, Bash in disallowedTools), the API key is stripped, and the JSON-array output is parsed / errors surfaced. It CANNOT prove the lockdown actually holds at runtime. - Opt-in tests/test_integration.py::test_claude_code_backend_completes_and_ refuses_shell (@needs_claude_code: requires claude on PATH AND WOOLLAMA_TEST_CLAUDE_CODE=1, default-skipped — REAL subscription cost) is the runtime check: a real completion works AND a shell-exec attempt does NOT create an absolute-path canary file. This is the test to run (outside a nested Claude Code session) to confirm the safety boundary before trusting it.

Bundled recipe: cc-assistant (inferencer = "claude-code/haiku", tools = []) — a tool-less example to demo/test against. haiku keeps live cost low.

Roadmap honesty: this does NOT close "cloud inferencers." The OpenAI-compat HTTP seam architecture.md calls for (vLLM/Together/Groq/OpenRouter/anthropic-api) is still unbuilt; claude-code is an alternative, keyless path to Claude via a totally different mechanism (subprocess delegation), not that seam.

Verified: default suite 68 passed / 9 deselected; integration suite 8 passed + 1 skipped (the claude-code live test — awaiting opt-in). Runtime tool-lockdown verification PENDING the opt-in live run.

Follow-on slice — OpenAI-compat inferencer seam (Anthropic) (2026-06-02)¶

The cloud-inferencer track architecture.md calls for: woollama routes <provider>/<model> to any OpenAI-compatible chat-completions backend. New src/woollama/inferencers.py is the registry; orchestrate and the pass-through path are now provider-generic instead of ollama-hardcoded.

Inferencer(name, base_url, api_key_env, extra_body) + a built-in registry rebuilt per call (so env overrides are live): ollama (local, no auth, native options body) and anthropic (Claude API's OpenAI-compat endpoint, Authorization: Bearer $ANTHROPIC_API_KEY, default max_tokens + clamped temperature). Adding vLLM/Together/Groq/OpenRouter is just more entries; config-file-driven inferencers (architecture.md's inferencers block) is the natural follow-on.
orchestrate dispatch: claude-code/ → subprocess backend (prior slice); otherwise resolve the provider in the registry. Unknown provider → 501; known provider with a missing API key → 400 (distinct, fail-fast via inf.headers() before any network call). The loop posts to inf.chat_url() with inf.headers() and merges inf.extra_body into the request.
Pass-through (<provider>/<model> with no recipe) generalized the same way (was ollama-only). claude-code/ is NOT a pass-through provider (recipe- only) → falls through to the 400 unknown-namespace error.
Anthropic compat-endpoint facts (from docs, not memory): base https://api.anthropic.com/v1/ + chat/completions; Bearer auth; max_tokens supported (not required); temperature capped to [0,1]; system messages hoisted+concatenated (we send one — fine). Tools/function-calling ARE fully supported (tools[].function, response tool_calls, assistant tool_calls, tool-role messages) — so this is FULL orchestration over Anthropic, not chat- only. Only strict schema enforcement is dropped (use the native API for that). OLLAMA_URL module const removed; /v1/models now derives the ollama URL from the registry.
Tests: tests/test_inferencers.py — registry units + capturing-fake routing (orchestrate→anthropic asserts URL/Bearer/bare-model/max_tokens; ollama-unchanged regression; pass-through→anthropic). The old "non-ollama → 501" tests became "unknown-provider → 501" (anthropic is supported now) + "anthropic-without-key → 400". Opt-in @needs_anthropic live test (test_anthropic_inferencer_completes_live, skips without a key) does a real tool-less round-trip.
Roadmap: this BUILDS the OpenAI-compat seam → anthropic-API now done; vLLM/Together/Groq/OpenRouter are config entries away (pending config-file inferencers). Still open: streaming, Unix socket, the conversations surface, the Rust port. (claude-code remains a separate keyless mechanism, not this seam.)
Verified: default suite 76 passed / 9 deselected; integration 8 passed + 2 skipped (claude-code & anthropic live, both opt-in). Ollama path confirmed unchanged end-to-end (the generalization's regression canary). Caveats: the Anthropic tool-loop is doc-confirmed + unit-tested on the EMIT side (woollama sends the right request); the live round-trip is tool-LESS, and a live tool-using round-trip is unverified (no key). PENDING a run with ANTHROPIC_API_KEY.

Follow-on slice — the rest of the providers (config-file + cloud built-ins) (2026-06-02)¶

Completes the multi-provider story two ways: more verified built-in clouds, and a config file so ANY OpenAI-compat backend (self-hosted vLLM/llama.cpp, niche clouds, overrides) is a one-entry addition — the durable answer architecture.md calls for, since you can't hardcode someone's vLLM host.

Built-in clouds (base URLs verified from each vendor's docs, not memory — verification caught that Together is api.together.ai, NOT the .xyz I'd have guessed): openai (api.openai.com/v1), groq (api.groq.com/openai/v1), together (api.together.ai/v1), openrouter (openrouter.ai/api/v1) — each Bearer + <PROVIDER>_API_KEY, temperature=0. Plus the existing ollama + anthropic.
config.load_inferencers() reads an optional $config/inferencers.toml ([inferencers.<name>] with base_url, optional api_key_env, optional extra_body; ${VAR} expanded). MERGED OVER the built-ins (same name overrides) — a deliberate departure from recipes/mcp.json replace semantics: an inferencer registry is infrastructure you extend, not content you own wholesale, so adding one provider must not wipe the defaults. inferencers._registry() builds built-ins then overlays config, rebuilt per call (live env/edits).
Tests (tests/test_inferencers.py, +autouse clean-config fixture for hermeticity): the cloud built-ins resolve with the right URL/Bearer; config adds a custom vllm (no auth) and overrides a built-in's base_url while the other built-ins survive; ${VAR} expansion; missing base_url errors.
Roadmap: the OpenAI-compat cloud-inferencer seam is now functionally complete — 6 providers built in, and anything else is a config entry. (Live round-trips for the cloud providers remain unverified without keys, same posture as anthropic.) Lint note: ruff is configured (E/F/W/I/B) but not run by pytest and has drifted (~28 issues tree-wide, mostly the ; pattern in earlier test files) — a quick ruff --fix cleanup is a worthwhile small slice.
Verified: default suite 81 passed / 10 deselected; changed files lint-clean.

Follow-on slice — tool delegation to Claude Code (executor) (2026-06-06)¶

The "executor" concept the slice-i note deferred: a claude-code/<model> recipe WITH a non-empty tools list (previously a 501) now delegates — Claude Code owns the agentic loop and calls the recipe's allow-listed MCP tools itself; woollama returns only the final answer. Distinct from the tool-less inferencer (run_completion) and from the ollama recipes where woollama runs the tool loop.

Design fork settled by a spike, not by argument. The open question was how Claude reaches the tools and whether the allow-list stays a hard boundary. A headless claude -p spike (haiku, vs. the bundled hello server) established: --allowedTools X + --permission-mode dontAsk is a hard deny-all-else (an out-of-list MCP tool was denied and recorded in permission_denials; the allowed one ran and the loop terminated); a direct-server --mcp-config exposes tools as the clean mcp__<server>__<tool> (no dot). → Option B (config containment): hand Claude a per-recipe --mcp-config with ONLY the servers the allow-list references + --allowedTools with ONLY those tools. Defense-in- depth (config containment AND allow-list AND the slice-i built-in lockdown), clean naming, smallest blast radius, no loopback re-entrancy. Option A (point Claude at woollama's own /mcp aggregator) + per-session filtering is deferred.
claude_code.run_delegated: writes a temp --mcp-config, maps <server>.<tool> → mcp__<server>__<tool> for --allowedTools, keeps --strict-mcp-config + _DENY_TOOLS + neutral cwd, caps --max-turns.
Nested-contamination fix. The spike's child inherited THIS session's harness (its meta-tools) via CLAUDECODE/CLAUDE_CODE_*. New _child_env() strips that family (plus ANTHROPIC_API_KEY), used by the tool-less path too. A residual leak from global ~/.claude is not env-fixable, so the live gate is plain-terminal-only (like slice-i).
Routing: router.orchestrate_events claude-code branch delegates when recipe["tools"] is non-empty; _delegate_mcp_servers resolves the referenced servers from config.load_mcp_servers() (400 if one is missing — never a partial toolset). Works through every surface (HTTP chat, /v1/responses, MCP chat) for free. Bundled cc-counter recipe added.
Tests: hermetic unit (argv/config containment, the exact allow-list boundary, env strip) + routing (delegates not 501; missing-server 400). Live gate (@needs_claude_code, plain terminal): delegated count + a shell-exec attempt refused in delegation mode. Positive path verified at the EVENT level through woollama's GENERATED config (resolving ${WOOLLAMA_EXAMPLES_DIR}): the hello server launched and Claude invoked mcp__hello__count_to (result.is_error=False, no permission denials) — not merely a final-text claim. (The adversarial Bash-in-delegation refusal stays plain-terminal-only; it rests on construction + the shared _DENY_TOOLS, not yet observed.)
Verified: default suite 132 passed / 16 deselected; ruff clean.

Follow-on slice — claude-code tool lockdown: deny-list → allow-list-of-none (2026-06-06)¶

Running the delegation live gate surfaced a real robustness gap. The lockdown was a DENY-LIST (_DENY_TOOLS: Bash/Read/Write/…). On a machine whose global Claude config / plugins enable extra tools NOT in that list (this box has Skill, Workflow, Cron, Monitor, ToolSearch, LSP, …), those tools stayed reachable: a deny-list can't enumerate every deployment's tools, and --permission-mode dontAsk does NOT deny them (it hard-denies only tools that would prompt; auto-approved extras pass through — confirmed: a spike showed Skill launching, not being denied). What looked like "nested-session contamination" was actually this machine's global config: a fully-scrubbed env -i child still showed the extras, so they're not a session leak.

Fix (authoritative claude flags, confirmed via the claude-code guide + probes): - --tools "" — the primary lockdown: an ALLOW-LIST of built-in tools set to NONE. Removes the entire built-in/plugin set at the source, robust against any deployment's extras. (Added to BOTH the tool-less and delegation arg builders.) - ENABLE_TOOL_SEARCH=false (in _child_env) — --tools "" also disables the built-in ToolSearch that surfaces deferred MCP tools (Claude Code's default), so the recipe's MCP tools would vanish; this loads them UPFRONT instead. - _DENY_TOOLS kept as defense-in-depth, now carrying LSP — the one tool --tools "" leaves behind (only --bare drops it, and --bare breaks subscription auth — rejected: a probe returned "Not logged in"). - Rejected --bare (kills keychain/credential reads → no auth).

Verified at the event level through woollama's real _build_delegate_args + _child_env: the delegated Claude now exposes ONLY the recipe's MCP server tools (no built-ins, no LSP, no Skill/Workflow), the out-of-list MCP tool is hard-denied, the delegated tool runs, and a shell-exec attempt is refused (Bash absent). Because the harness tools are now stripped, the delegation + tool-less live gates pass trustworthily even nested. 154 hermetic pass; ruff clean.

Follow-on slice — executor adversarial safety pass (2026-06-06)¶

The "adversarial safety pass" the executor was always flagged as needing. A focused security review of the delegation path + lockdown + stored store (with each finding verified against the code, not speculated) found three real issues, now fixed:

HIGH — provider keys leaked to the child. _child_env was a DENY-LIST of 3 vars, so OPENAI_API_KEY/GROQ_API_KEY/TOGETHER_API_KEY/OPENROUTER_API_KEY (and any inferencers.toml api_key_env) flowed into the claude child AND the MCP servers it spawns. A key-custody router must not do that. Fixed: _child_env is now an ALLOW-LIST (_CHILD_ENV_ALLOW: HOME/PATH/locale/proxy/ TERM/… only) — no provider keys/secrets reach the child. Verified live that subscription auth still works with just that set (HOME carries ~/.claude).
MEDIUM — host settings could undercut the sibling boundary. The child read the host's ~/.claude/settings.json; a permissions.allow: mcp__* rule there would auto-approve a tool and slip past dontAsk. Fixed: --setting-sources project on both arg builders (neutral temp cwd → loads no settings). Verified auth survives it.
MEDIUM — comma in a recipe tool name injected an extra --allowedTools entry (a same-server sibling grant, e.g. count_to,mcp__hello__hello). Fixed: reject commas/whitespace in tool names — at the router (_delegate_mcp_servers → 400) and defensively in _mcp_tool_name (ValueError).

Verified SAFE (checked, hold up): no SQL injection (all duckdb queries parameterized), no --mcp-config JSON injection (json.dump + server-name validated against config), no argv/shell injection (create_subprocess_exec list form), no tempdir/path traversal (random TemporaryDirectory/mkdtemp), turn + timeout bounds present. Accepted LOW / out-of-scope: --tools "" empty-string semantics are CLI-version-coupled (deny-list _DENY_TOOLS is the backstop); conversations have no per-caller ownership (mitigated by 122-bit unguessable uuid4 ids; woollama is local single-user).

New: a live sibling-denial test (test_claude_code_delegation_denies_same_ server_sibling) — the review's top previously-untested risk — asserts at the event level that an un-allow-listed same-server tool never executes. 156 hermetic pass; ruff clean; all three @needs_claude_code live gates green.

Revert — drop the duckdb `stored` backend (conv-5) (2026-06-06)¶

conv-5 gave woollama an embedded duckdb conversation store ($XDG_DATA_HOME/woollama/conversations.duckdb) that persisted transcripts and replayed them for models with no native session. On reflection that's a category error: woollama must never own conversation storage — the design's own §1 says so. It may proxy a backend's transcript, retrieve it, or create one in ANOTHER system; it does not become the store. Shipped 2026-06-05, reverted the next day.

Removed in full: the duckdb dependency; StoredStore / StoredBackend / stored_store / rehydrate_stored / _default_db_path in conversations.py; the startup rehydration in the FastAPI lifespan; ConversationStore.add; and the backend_for_model → "stored" default (it returns None for non-claude models again). complete_stateless stays (it backs the stateless /v1/responses path).

Replacement is a non-decision: a model with no state-owning backend is stateless — /v1/responses store:true and /v1/conversations create both return a clean 501 ("use store:false; the caller owns history"), exactly as the Anthropic Messages API is itself stateless. Only state-OWNING backends live in BACKENDS (today: claude-resume). If non-claude models ever need stateful conversations, the answer is a backend that DEFERS to an external owner woollama is a client to — a conversation-store MCP server, or Managed Agents (conversations-api-design.md §8.7, drafted this session) — never woollama's DB.

Test ripple: dropped the stored unit/routing/items/rehydrate tests; restored the store:true → 501 / conversations.create non-owner → 501 assertions; the live e2e /v1/conversations journey moved off ollama+stored onto claude-resume (@needs_claude_code) and now asserts items → 501 (transcript-read is the driver/managed-agents slice's job). 151 hermetic pass; ruff clean; no DB leak.

Live confirmation (2026-06-07): ran test_conversations_surface_full_journey_live end-to-end against the real claude-resume backend (claude-code/haiku) — full CRUD + discover, two-turn recall proving the resumed Claude session owns the state, items → 501, delete → 404. Green in 17.2s. (The revert commit 2d45f23 predates this run and flags the journey as "not yet run live"; it now is.)

conv-6 — the `managed-agents` backend (2026-06-07)¶

The second state-owning conversation backend, and the purest one: it defers all state to Anthropic's Managed Agents API (/v1/agents + /v1/sessions, beta managed-agents-2026-04-01). Anthropic hosts the session, the loop, and a per-session container; woollama holds only the session_id. This is the alternative to the §6-blocked Rust tmux driver for a hosted, stateful, tool- capable session — the hard infra is Anthropic's.

Shape (design-doc §8.7): new namespace claude-agent/<model> → backend_for_model → managed-agents. managed_agents.py is the SDK wrapper (parallel to claude_code.py): the sync anthropic SDK run in asyncio.to_thread, with _client() as the single mock seam (lazy import anthropic, so the core install and the hermetic suite need neither the package nor a key). conversations.ManagedAgentsBackend holds the cache: ONE tool-less agent per model (created lazily, reused — never per session, the documented anti-pattern) + one shared environment; a session per conversation, created on the first turn. send_turn opens the event stream, sends only the NEW user turn (Anthropic owns prior history), and collects agent.message text to session.status_idle, under a per-event wall-clock deadline + an outer wait_for (SDK stream timeouts are per-chunk, not total). history is implemented (events.list → transcript items), so /v1/conversations/{id}/items SERVES the transcript for this backend — the first for which it does (claude-resume still 501s). delete → sessions.delete. ManagedAgentsError now also maps to 502 in _responses_stateful.

Dependency: anthropic>=0.107.1 in a new optional agents extra (verified the beta.{agents,environments,sessions.events} surface exists in 0.107.1 before writing against it).

Tests: tests/test_managed_agents.py (10) mocks the SDK with a fake client that records control-plane calls and scripts the event stream — asserts lazy create-then-reuse, agent/env created once across conversations, the new-turn-only send, history retrieval, /items now 200 (the capability win), delete teardown, and the 502 mapping. 161 hermetic pass; ruff clean.

Honesty: the hermetic tests prove woollama's WIRING, not the Managed Agents API contract — the bindings come from the claude-api skill docs against anthropic==0.107.1. Known limit (documented, not solved): the in-memory handle table means a restart orphans live (billed) sessions, and each fresh process re-creates its per-model agent — the ant-YAML / reuse-by-name control plane is the eventual fix.

Live confirmation (2026-06-07): ran test_managed_agents_conversation_journey_live against the real Managed Agents API (claude-agent/haiku) — CREATE (backend managed-agents) → discover → two-turn recall proving Anthropic resumed the hosted session → /items 200 serving the transcript (the capability win over claude-resume's 501) → session delete → 404. Green in 15.0s. (Commit 0552661 predates this run and flags the gate as "not run live"; it now is.)

issue #1 — honor `num_ctx` for the ollama provider (2026-06-07)¶

Bug: ollama/<model> passthrough forwards the client body to ollama's OpenAI-compat /v1/chat/completions, which ignores num_ctx (even nested as options.num_ctx) — long inputs silently loaded at the 4096 default and truncated. ollama only honors num_ctx on its NATIVE /api/chat.

Fix (ollama_native.py + router): when an ollama passthrough request asks for a context size (options.num_ctx) and carries no tools, route it to /api/chat, translating the request to ollama's native shape and the response BACK to the OpenAI chat-completions shape (both non-stream → chat.completion, and stream → chat.completion.chunk SSE, NDJSON-per-line → OpenAI deltas + [DONE]). The native wire shapes were captured live from ollama first (not recalled); created_at is an RFC3339 string so OpenAI created uses our own int(time.time()). Requests without num_ctx, and those WITH tools, stay on the unchanged /v1 path (we don't half-translate tool_calls; documented). Native endpoint gets a 600s read timeout — large num_ctx means a long prompt-eval before the first byte, correlated with exactly these requests.

Tests: tests/test_ollama_native.py (11) — the pure translators against the captured shapes + router wiring (mocked httpx) asserting num_ctx routes to /api/chat intact, no-ctx and ctx+tools stay on /v1, and the stream translator emits OpenAI SSE. 172 hermetic pass; ruff clean.

Live acceptance (2026-06-07): test_ollama_num_ctx_honored_via_native_endpoint drove a num_ctx=16384 chat-completion through a live woollama → ollama, then asserted ollama's /api/ps reports context_length=16384 for the loaded model (the issue's exact bar). Green in 12.3s.

issue #2 — design slice: store-only backend for non-claude models (2026-06-07)¶

A DESIGN slice (no code), by decision — to avoid a second conv-5-style revert on the "woollama never owns the store" principle. Read both cosmic-fabric primary sources first (cosmic-fabric#1, doc/woollama-coordination.md): the agreed architecture is woollama-as-inference-backbone, fabric-behind-woollama, and cosmic-fabricd thinning toward a desktop-session daemon.

Findings that reshaped it: (1) Managed Agents (conv-6) can't satisfy #2 — it pins inference to a Claude model, so it can't run ollama/<model>. (2) ollama has NO server-side sessions (verified, 0.24.0): /api/chat is stateless; the /api/generate context token-id array is caller-held, generate-only, opaque, reload-fragile — not a state owner.

Decision (user's call, it touches the principle): defer the transcript to fabric / the cosmic-fabricd session daemon, behind a PLUGGABLE conversation-store seam — consistent with how claude-resume defers to Claude's JSONL and managed-agents to Anthropic. Design doc gained §3.1 (two backend kinds — native-loop vs store-only/BYO-inference — sharing the "never the store" invariant), §8 item 8, and a full §10 (the store-provider protocol create/get/append/delete, the store-only send_turn assembly+inference loop, fabric as first provider, MCP-store / JSONL as future pluggable providers).

Status: impl BLOCKED pending the fabric read/append contract (cosmic-fabric side). Tracked as issue #2 / roadmap conv-7. No src/test changes this slice.

issue #2 — store-backed backend mechanism (2026-06-07, same day as the design)¶

Built the woollama-SIDE half of #2 behind an un-wired seam — the part that's codeable without the cross-repo contract. conversations.py gains: ConversationStoreProvider (a Protocol — the PROVISIONAL create/get/append/ delete contract woollama proposes to fabric; named to avoid clashing with the ConversationStore handle table, which is routing state, not bytes); StoreBackedBackend (store-only / BYO-inference: native_id = store.create() → assemble store.get(...) + new turn → STATELESS inference → store.append(...); history serves /items for free); register_store_backend(provider, complete) + a backend_for_model gate that routes non-claude models to it ONLY when a provider is registered. The inference fn is INJECTED (router passes complete_stateless) so conversations never imports router.

Error path (advisor-caught, the happy-path fake wouldn't surface it): complete_stateless raises OrchestrationError, which _responses_stateful didn't catch (only ClaudeCodeError/ManagedAgentsError) → a store inference failure would 500. Added an OrchestrationError arm there that surfaces payload/status like the stateless branch. Tested with a completer that raises.

Known seam (#1 ↔ #2, documented not fixed): complete_stateless sends ollama via /v1, so a store-backed ollama turn does NOT honor num_ctx yet — issue #1's native path is passthrough-only.

Default behavior UNCHANGED: no provider ships, so backend_for_model("ollama/…") → None and non-claude store:true still 501s (the no-regression gate, still green). Tests: tests/test_store_backend.py (8) over an in-memory fake provider — assemble/complete/append, cross-turn recall via reassembly, /items, delete, the routing gate, register(), and the clean-error path. 180 hermetic pass; ruff clean.

Framing (per the chosen design→confirm→implement plan): the protocol is a VISIBLE proposal, not a settled contract — fed back to cosmic-fabric on issue #2 as the "confirm" step. #2 stays open: the fabric provider + the agreed contract + the cosmic-fabric wiring remain.

#1 ↔ #2 seam closed — num_ctx in stateful turns (2026-06-07)¶

The store-backed slice shipped with a documented seam: complete_stateless sent ollama via /v1, so a stateful ollama turn ignored num_ctx (issue #1's native path was passthrough-only). Closed it: complete_stateless gained an options kwarg and routes ollama through the native /api/chat when num_ctx is present (reusing the #1 ollama_native translators); the /v1/responses options thread through _responses_stateful → send_turn(..., options=) → the injected completer. The three backends' send_turn took a keyword-only options (claude-resume / managed-agents ignore it — Claude/Anthropic own inference; store-backed forwards it). CompleteFn relaxed to Callable[..., Awaitable[str]] for the kwarg.

Tests: +3 — complete_stateless routes num_ctx native / falls back to /v1 (test_ollama_native.py), and a store-backed ollama turn threads num_ctx to /api/chat end-to-end with the REAL complete_stateless (test_store_backend.py). 183 hermetic pass; ruff clean. Live: test_responses_stateless_ollama_honors_ num_ctx drove a stateless /v1/responses turn with num_ctx=8192 → /api/ps shows 8192. Green in 10.1s. (The store-backed path uses the same completer; it's hermetic-only since no provider is wired.) Recipe/orchestrate path unaffected — num_ctx applies to direct ollama models.

conv-8 — interactive requires_action path (2026-06-07)¶

The §5 interactive pause/answer path, shipped via the managed-agents backend — WITHOUT the §6-blocked Rust tmux driver. A hosted CMA session pauses when the model calls a client-side custom tool; woollama maps that onto the Responses requires_action primitive.

Grounded the mechanism in the installed SDK first (the conv-6 discipline): agent.custom_tool_use carries .id/.name/.input; the resume is user.custom_tool_result with custom_tool_use_id (Required) + content; session.status_idle carries stop_reason whose .type is requires_action. Custom tools are client-executed → no container provisioning.

managed_agents.py: the agent gains ONE custom tool, ask_user (question[, options]). _collect_sync (was _run_turn_sync) now also watches for agent.custom_tool_use and returns {text, pending} (pending = the paused tool call); run_turn (new user.message) and answer_turn (new user.custom_tool_result) share it.
conversations.py: Conversation gains required_action + pending_tool_use_id. ManagedAgentsBackend._apply folds {text,pending} into the handle — pending → required_action={type:ask_user,question:<input>} + status awaiting_input, else clears. New answer(conv, answer) resumes via answer_turn.
responses.py: build_response gains required_action (with status="requires_action") — a documented Responses SUPERSET (extra key + non-OpenAI status), so that variant is NOT a strict openai Response; the completed variant is unchanged and still validates.
router.py: signal the pause through conv, not a new return type (the tool_use_id is conversation-scoped — consumed on a later request — so it must live on conv anyway; a TurnResult would be a redundant second encoding). Two gotchas handled: the post-send status="idle" is now conditional (don't clobber awaiting_input); and an attached turn on an awaiting_input conv routes to backend.answer BEFORE a fresh send_turn, clearing pending on resume. The requires_action Response carries conv.required_action.

Tests: test_managed_agents.py +2 — the pause→answer round-trip (asserts status:requires_action + the question, then the EXACT tool_use_id is returned on resume and the turn completes) and the routing discriminator (resume hits answer, not send_turn; the answer is NOT appended as a user.message). 185 hermetic pass; ruff clean; mkdocs --strict clean.

Scope/honesty: most build-ahead feature of the session — no consumer yet. Kept minimal (one ask_user tool, single round-trip; no tool_confirmation/always-ask, no multi-pending). Hermetic is the contract proof; the live gate (test_managed_agents_requires_action_round_trip_live, @needs_anthropic) is PAID and BEST-EFFORT — it SKIPS if the model answers directly instead of calling ask_user (nondeterministic) — and is written-but-NOT-run.

conv-9 — streaming /v1/responses (2026-06-07)¶

/v1/responses previously 400'd on stream:true; now a STATELESS turn emits OpenAI Responses SSE. Grounded the event shapes in the installed openai SDK first (required fields per event) so the sequence is real, not guessed.

router.py: complete_stream(model, messages) — an async generator of TEXT DELTAS: recipes via orchestrate_events(stream=True) (tool turns hidden), a plain inferencer via its chat-completions SSE (parse choices[0].delta.content). _responses_stream(...) wraps deltas in the canonical Responses event sequence (response.created → output_item.added → content_part.added → output_text.delta* → output_text.done → content_part.done → output_item.done → response.completed), emitted as named event: + JSON data: frames with monotonic sequence_numbers. Primes the delta source before returning the StreamingResponse, so a setup error (unknown model/recipe, missing key, upstream 4xx) maps to an HTTP status not an empty 200 stream (mirrors the chat streamers). responses_create now streams the stateless path; STATEFUL stream:true stays a 400 (claude-resume can't token-stream; managed-agents native streaming is a later slice).

Tests: tests/test_responses_stream.py (6) — the full event sequence + accumulated text for an inferencer stream and a recipe stream, the emitted frames validated against openai's ResponseCreatedEvent/ResponseTextDeltaEvent/ ResponseCompletedEvent, stateful→400, and setup errors (unknown model→400, unknown recipe→404) surfacing as HTTP not a stream. Repointed the old test_responses_stream_true_is_400 at the stateful case. 191 hermetic pass; ruff clean. Live: test_responses_streaming_live drove a real ollama stream — events created…delta…completed, deltas concatenating to the completed text. Green in ~15s. (#1↔#2 note: streaming uses the inferencer's /v1 SSE, so num_ctx-native is non-stream-only for now — documented follow-on.)

issue #3 — cloud models discoverable in GET /v1/models (2026-06-08)¶

cosmic-fabric's model picker is populated from /v1/models, which listed only ollama/<model> (live-discovered) + woollama/<recipe> — cloud models routed by id but weren't discoverable, so a list-backed combobox couldn't offer them.

Design (user's call): woollama owns NO model catalog. Each inferencer opts in via inferencers.toml: a static models = [...] list (no key needed to list) and/or discover = true (live-query the provider's own /v1/models) filtered by model_patterns (fnmatch globs, so openrouter's ~300 don't flood the picker). This generalizes ollama's existing live discovery (ollama is now just discover=True by default) to any provider, opt-in. Built-in clouds surface nothing until configured → no regression, and no listing models whose key isn't set.

inferencers.py: Inferencer gains models/discover/model_patterns; ollama built-in defaults discover=True; all() exposes the registry. Config merge is now FIELD-LEVEL over built-ins (was wholesale replace) so you can add models to anthropic without restating base_url; the "new provider needs base_url" check moved here (only here do we know if a name extends a built-in).
config.py: load_inferencers parses the new fields and emits only-present keys (so the merge can tell "unset" from "set"); base_url no longer required at parse time.
router.py: /v1/models loops all inferencers — static models + (if discover) _discover_models (queries {base_url}/models, fnmatch-filtered), deduped, namespaced provider/<id>; a failing discover is logged and skipped, not fatal.

Tests: test_inferencers.py +4 (field-merge extend-without-base_url, new-provider base_url error now from the registry, discover/patterns parse, and the /v1/models enumeration with static + pattern-filtered discovery over mocked httpx); fixed the shared HttpxResponseStub to add raise_for_status. 194 hermetic pass; ruff clean. Live: the existing @needs_ollama /v1/models test passes against real ollama (the refactored discovery path). The cloud-discovery-against-a-real-vendor variation is the same code path (hermetic-covered); could be live-checked with a key later. Config reference for the new fields lands with the docs pass.