AI Pipeline
note
This example is in progress.
An LLM inference pipeline with a prompt editor, container-side model inference, and a streaming output viewer.
Planned architecture
Loading diagram...
Components
- Prompt editor — a JavaScript metaframe (e.g.
editor.mtfm.io) that emits the prompt as a named output - LLM container — a Docker container running inference (Ollama, vllm, or similar); receives the prompt at
/inputs/prompt.txt, writes response to/outputs/response.txt; optionally streams intermediate results via the WebSocket channel - Output viewer — a JavaScript metaframe that renders the response, or a downstream container that processes structured output
Key patterns
Streaming progress via WebSocket
For long-running inference, the container can stream tokens in real time without waiting for the job to finish:
import os, asyncio, websockets, json
async def stream_tokens(tokens):
url = os.environ.get("WEBSOCKET_URL")
async with websockets.connect(url) as ws:
for token in tokens:
await ws.send(json.dumps({"type": "token", "value": token}))
Model caching
Use $JOB_CACHE to avoid re-downloading the model on every job:
cache_dir = os.environ["JOB_CACHE"]
model_path = os.path.join(cache_dir, "llama3.gguf")
if not os.path.exists(model_path):
download_model(model_path)
Version-pinning the model container
Use git refs in URLs to pin the inference container to a specific commit:
https://github.com/myorg/llm-runner/tree/_model-version_
Then pass ?model-version=v2.1.0 in the metapage URL to pin to a release.