Technology

On-Prem AI Appliance Architecture

One private AI appliance with an OpenAI-compatible API gateway, local inference, RAG, agents, MCP connectors, audit logs and Pure Mode. A signed manifest enforces what runs where — and what remains inside the support boundary.

01 — Architecture

Reference architecture.

External entities at the top. The appliance below. The support boundary line cleanly separates certified core from anything you build in T3.

External
End Users

Developers · analysts · support · legal · operations.
Browser · IDE · Slack/Teams · Email · CLI.

Authenticate via client SSO
External
Client Identity Provider

Your existing IdP — we federate, never replace.
e.g., Okta · Azure AD · Google · Ping.

OIDC · SAML 2.0 · SCIM
External
Client's Existing Tools

Whatever you already use — chat, source control, ticketing, docs, CRM, mail, storage.

OAuth · REST · GraphQL · Webhooks
LLM Machine · On-Prem Appliance Certified Core

Edge / Gateway

TLS termination · reverse proxy · routing · rate-limiting

Traefik · Kong · NGINX

T1

Identity & SSO

Federated to your IdP via OIDC / SAML — never replaces it · SCIM user provisioning · role mapping (Admin / User / Auditor / Read-Only)

Keycloak · Authentik · Zitadel

T1

App Surfaces · user-facing

T1
Chat Interface
RAG · multi-user · MCP
LibreChat
IDE Backend
VS Code · JetBrains plugin
Continue
Code Completion
FIM · self-hosted
opencode / Tabby
Workflow Runner
templates · approvals · webhooks
Console workflows
Knowledge Workspace
RAG · workspace docs
Local RAG workspace

Inference Gateway

OpenAI & Anthropic-compatible API · model routing · per-team budgets · audit logging

LiteLLM

T1

Inference Servers

High-throughput model serving · chat · code · embeddings · client fine-tunes · loaded from on-box signed registry

SGLang · vLLM

T1

Tool / Integration Layer

Vetted MCP catalog (T1) + verified partner connectors (T2). All credentials in on-box vault — they never leave the appliance.

MCP servers · chat · source control · ticketing · CRM · docs · …

T1 T2

Agentic Layer

Agent runtimes for multi-step tasks · default catalog of agents we configure · client-extensible in T3

openclaw / nemoclaw

T1

Workflow & Orchestration

Approved workflow templates, scheduled jobs and long-running orchestration

Console workflow runtime · Temporal-style orchestration

T1

Data

Vector + RAG store inside knowledge workspace · object storage · cache · optional dedicated DB by agreement

Local vector store · MinIO · Redis · (Postgres + pgvector)

T1

Observability & Audit

LLM tracing · metrics · logs — fully on-prem. No telemetry leaves the box.

Langfuse · Grafana · Loki · Prometheus

T1

Platform

Container orchestration · VM management · OS · out-of-band management · signed-update + license daemons

Kubernetes · Portainer · Proxmox · Linux · BMC

T1

Hardware · enterprise / industry-grade

Compute · memory · storage · network · power · physical security

Supermicro GPU(s) · CPU · NVMe · 25 / 100 GbE NIC · redundant PSU · TPM · tamper sensors

T1
SUPPORT BOUNDARY · PURE MODE SHUTS DOWN EVERYTHING BELOW

Client BYO Sandbox

Custom apps · custom connectors · custom workflows · client-trained models

No host privileges · egress allowlist · isolated secrets · outage here never blocks T1

Defined by you, on your clock — outside our SLA

T3
Tier model

T1 / T2 / T3 with manifest enforcement.

Every component is signed and labelled. T1 runs with host privileges. T2 in restricted containers. T3 sandboxed with no host access. The admin UI shows tier badges next to every installed component — never ambiguous, never argued.

Pure Mode

Kill everything custom. Keep certified core running.

One-click admin action that disables every T2/T3 component. Use it for security incidents, support diagnosis ("if it reproduces in Pure Mode it's our ticket"), or to keep an audit clean.

MCP catalog

Vetted connectors out of the box.

Chat, source control, ticketing, docs, CRM, mail, storage — all wired through curated MCP servers. Every credential lives in your on-box vault. Nothing leaves the appliance.

02 — For developers

Build against local AI like a standard API.

LLM Machines gives engineering teams familiar interfaces while keeping traffic, credentials, models and logs under enterprise control.

API compatibility

Swap the base URL.

Use OpenAI-compatible endpoints for chat, embeddings and model routing so existing SDK-based tools can move to local infrastructure.

Tooling

Works with developer workflows.

Integrate private models with IDE assistants, internal applications, LangChain-style workflows and MCP servers.

Operations

Observable by default.

Keep request logs, model routing, usage attribution, rate limits and metrics available to admins without sending telemetry outside.

[ 01 ]

LiteLLM — Gateway & Router

Unified endpoint for all LLM providers and local models. Usage tracking, rate limiting, cost control.

[ 02 ]

LibreChat — User Interface

A polished, ChatGPT-like interface for all end users, SSO-linked and wired through the local gateway.

[ 03 ]

Knowledge RAG Layer — Retrieval Engine

Document ingestion, vector search, and retrieval-augmented generation for enterprise knowledge bases.

[ 04 ]

Open Notebook — Research Agent

AI-powered research and knowledge synthesis. Deep-dive reports generated automatically.

[ 05 ]

NemoClaw / OpenClaw — Agentic Framework

Autonomous agents for complex, multi-step enterprise workflows.

[ 06 ]

Microsoft Presidio — PII Anonymisation

Automatic detection and redaction of sensitive data before it ever reaches a model.

[ 07 ]

SGLang — Inference Engine

A high-performance engine for running open-weight models locally — pure OSS, no NVIDIA AI Enterprise tax.

[ 08 ]

LLM Machines — Integration Layer

The connective tissue that turns these projects into a single, deployable, production-ready appliance. The signed manifest, the tier model, the support boundary, the runbook.

03 — FAQ

Architecture questions.

Details security, platform and developer teams usually ask before approving an on-prem AI deployment.

Can existing OpenAI API clients use LLM Machines?

Yes. The gateway exposes OpenAI-compatible endpoints so applications can point to the appliance instead of a public cloud API, while keeping authentication, logging and routing local.

Can the appliance run air-gapped?

Yes. Security-sensitive deployments can use offline license activation and local model registries so core inference, RAG and application surfaces work without public internet access.

Which models can run locally?

The architecture is designed for open-weight model families such as Llama, Mistral and Qwen, with model choice sized to your hardware, latency and quality requirements.

Where are connector credentials stored?

Connector credentials live in the on-box vault inside your environment. MCP servers and integration services use those credentials locally instead of sending them to our infrastructure.

What does Pure Mode do?

Pure Mode disables T2/T3 custom components and keeps the certified T1 core running. It is useful for incident response, support diagnosis and audit preparation.

What is inside the support boundary?

The signed certified core, tier model, manifest, gateway, inference services and documented T1/T2 components are supported. Client-built T3 extensions remain isolated from the SLA.

What's next

Ready to dig deeper?

See how the technology lands inside your environment — onboarding, pricing, or just talk to us.