Swap the base URL.
Use OpenAI-compatible endpoints for chat, embeddings and model routing so existing SDK-based tools can move to local infrastructure.
One private AI appliance with an OpenAI-compatible API gateway, local inference, RAG, agents, MCP connectors, audit logs and Pure Mode. A signed manifest enforces what runs where — and what remains inside the support boundary.
External entities at the top. The appliance below. The support boundary line cleanly separates certified core from anything you build in T3.
Developers · analysts · support · legal · operations.
Browser · IDE · Slack/Teams · Email · CLI.
Your existing IdP — we federate, never replace.
e.g., Okta · Azure AD · Google · Ping.
Whatever you already use — chat, source control, ticketing, docs, CRM, mail, storage.
TLS termination · reverse proxy · routing · rate-limiting
Traefik · Kong · NGINX
Federated to your IdP via OIDC / SAML — never replaces it · SCIM user provisioning · role mapping (Admin / User / Auditor / Read-Only)
Keycloak · Authentik · Zitadel
OpenAI & Anthropic-compatible API · model routing · per-team budgets · audit logging
LiteLLM
High-throughput model serving · chat · code · embeddings · client fine-tunes · loaded from on-box signed registry
SGLang · vLLM
Vetted MCP catalog (T1) + verified partner connectors (T2). All credentials in on-box vault — they never leave the appliance.
MCP servers · chat · source control · ticketing · CRM · docs · …
Agent runtimes for multi-step tasks · default catalog of agents we configure · client-extensible in T3
openclaw / nemoclaw
Approved workflow templates, scheduled jobs and long-running orchestration
Console workflow runtime · Temporal-style orchestration
Vector + RAG store inside knowledge workspace · object storage · cache · optional dedicated DB by agreement
Local vector store · MinIO · Redis · (Postgres + pgvector)
LLM tracing · metrics · logs — fully on-prem. No telemetry leaves the box.
Langfuse · Grafana · Loki · Prometheus
Container orchestration · VM management · OS · out-of-band management · signed-update + license daemons
Kubernetes · Portainer · Proxmox · Linux · BMC
Compute · memory · storage · network · power · physical security
Supermicro GPU(s) · CPU · NVMe · 25 / 100 GbE NIC · redundant PSU · TPM · tamper sensors
Custom apps · custom connectors · custom workflows · client-trained models
No host privileges · egress allowlist · isolated secrets · outage here never blocks T1
Defined by you, on your clock — outside our SLA
Every component is signed and labelled. T1 runs with host privileges. T2 in restricted containers. T3 sandboxed with no host access. The admin UI shows tier badges next to every installed component — never ambiguous, never argued.
One-click admin action that disables every T2/T3 component. Use it for security incidents, support diagnosis ("if it reproduces in Pure Mode it's our ticket"), or to keep an audit clean.
Chat, source control, ticketing, docs, CRM, mail, storage — all wired through curated MCP servers. Every credential lives in your on-box vault. Nothing leaves the appliance.
LLM Machines gives engineering teams familiar interfaces while keeping traffic, credentials, models and logs under enterprise control.
Use OpenAI-compatible endpoints for chat, embeddings and model routing so existing SDK-based tools can move to local infrastructure.
Integrate private models with IDE assistants, internal applications, LangChain-style workflows and MCP servers.
Keep request logs, model routing, usage attribution, rate limits and metrics available to admins without sending telemetry outside.
Unified endpoint for all LLM providers and local models. Usage tracking, rate limiting, cost control.
A polished, ChatGPT-like interface for all end users, SSO-linked and wired through the local gateway.
Document ingestion, vector search, and retrieval-augmented generation for enterprise knowledge bases.
AI-powered research and knowledge synthesis. Deep-dive reports generated automatically.
Autonomous agents for complex, multi-step enterprise workflows.
Automatic detection and redaction of sensitive data before it ever reaches a model.
A high-performance engine for running open-weight models locally — pure OSS, no NVIDIA AI Enterprise tax.
The connective tissue that turns these projects into a single, deployable, production-ready appliance. The signed manifest, the tier model, the support boundary, the runbook.
Details security, platform and developer teams usually ask before approving an on-prem AI deployment.
Yes. The gateway exposes OpenAI-compatible endpoints so applications can point to the appliance instead of a public cloud API, while keeping authentication, logging and routing local.
Yes. Security-sensitive deployments can use offline license activation and local model registries so core inference, RAG and application surfaces work without public internet access.
The architecture is designed for open-weight model families such as Llama, Mistral and Qwen, with model choice sized to your hardware, latency and quality requirements.
Connector credentials live in the on-box vault inside your environment. MCP servers and integration services use those credentials locally instead of sending them to our infrastructure.
Pure Mode disables T2/T3 custom components and keeps the certified T1 core running. It is useful for incident response, support diagnosis and audit preparation.
The signed certified core, tier model, manifest, gateway, inference services and documented T1/T2 components are supported. Client-built T3 extensions remain isolated from the SLA.
See how the technology lands inside your environment — onboarding, pricing, or just talk to us.