SOLUTIONS · ARK INFERENCE

Get started on ARK Inference. One API. Four proven stacks.

Ship a private knowledge workspace, an always-on personal agent, an enterprise-secure assistant, or a self-improving autonomous agent — all on ARK's OpenAI v1-compatible inference layer. Point, run, done. No wrappers, no lock-in, no data leaving the region.

Get free credits → Talk to sales

OpenAI v1

Compatible
Drop-in API

EU-hosted

Zero retention
by default

99%

Best-effort
availability

−98.9%

Tokens/call in
stateful mode

OUTCOMES

Start with the problem, not the plumbing.

Each stack below is optimized for a different kind of AI workload. Pick the one that matches the outcome you're trying to ship — the inference layer is shared.

Private knowledge assistant

Ingest your documents, your wiki, your contracts. Ask questions, get citations, never send a byte to a public model endpoint.

Best fit: AnythingLLM

Always-on personal & team agent

An assistant that lives in the channels your team already uses — WhatsApp, Slack, Telegram, Discord — and remembers context across sessions.

Best fit: OpenClaw or Hermes Agent

Enterprise-secure autonomous agent

Sandboxed execution, policy-based guardrails, on-premises deployment. For regulated environments that need autonomy without losing control.

Best fit: NVIDIA NemoClaw

HOW IT FITS TOGETHER

Four stacks, one sovereign runtime.

Every stack below speaks OpenAI v1 and points at the same ARK gateway — so knowledge workspaces, personal agents, sandboxed autonomous agents, and self-improving agents all share the same inference, the same audit trail, the same region.

OpenClaw Personal · multi-channel NemoClaw Enterprise · sandboxed AnythingLLM Knowledge · RAG Hermes Self-improving agent

QUICKSTART

One base URL. Every stack.

All four stacks speak OpenAI v1. Set your base URL to https://api.ark-labs.cloud/api/v1, drop in your ARK API key, and every chat, agent, RAG query, and tool call runs inside the EU, under your compliance regime, with zero retention by default.

Running ARK Core or Tailored on your own hardware? Same config — just swap the hostname for your internal gateway.

Read the API docs →

.env — works for all four stacks

# Point any OpenAI-compatible client at ARK:
OPENAI_BASE_URL="https://api.ark-labs.cloud/api/v1"
OPENAI_API_KEY="ark_sk_live_...redacted..."

# Pick any frontier open-source model:
MODEL="llama-3.3-70b-instruct"
# or: deepseek-r1, qwen2.5-72b, mistral-nemo-instruct-2407,
#     gpt-oss-120b, ...

# That's it. Every stack below reads these three vars.
# Active stack → AnythingLLM   # OpenClaw# NVIDIA NemoClaw# AnythingLLM# Hermes Agent

WHY ARK FOR AGENTS & RAG

Built for the way agents actually run.

Agents and RAG loops hammer the inference layer with short, repetitive calls. ARK is engineered for exactly that traffic shape — and for the compliance constraints that come with running agents in regulated environments.

OpenAI v1 & Anthropic compatible

Every stack on this page works unchanged. Swap one URL, keep your SDK, keep your prompts.

EU data residency, zero retention

Every token stays in the region. Nothing logged, nothing retained, nothing used for training by default.

Stateful inference — agents stop paying for context twice

Session-level KV-cache persistence cuts tokens per call by 98.9% and TTFT by 87% for multi-step reasoning loops.

99% fault tolerance

Dynamic shard redundancy means the platform keeps serving even when hardware fails. Agents don't care about your GPU weather.

10+ curated frontier open models

Llama, Qwen, DeepSeek, Mistral, GPT-OSS, and more — benchmark-selected, always-on, served through the same API.

Elastic hot-scaling, heterogeneous GPU

Add or remove GPUs at runtime, mix vendors, mix generations. No model reloads, no restarts, no cluster downtime.

FAQ

Common questions

Do I need to fork or modify these stacks to run them on ARK?

No. All four speak OpenAI v1 natively. You set OPENAI_BASE_URL to https://api.ark-labs.cloud/api/v1 and pass your ARK API key — that's the integration. Every tool call, embedding request, and chat completion flows through ARK.

Which ARK product do I deploy these on?

For most builders: ARK Cloud — managed, EU-hosted, free credits to start. If you need data sovereignty on your own hardware, run the same stacks against ARK Tailored or ARK Core. Same API surface, same SDK integrations, different deployment topology.

Can I mix and match — e.g. AnythingLLM for RAG and Hermes for agents?

Yes. Because they all hit the same ARK gateway with the same API key, multiple stacks can share a single inference backend, a single audit trail, and a single credit pool. Use AnythingLLM as the knowledge layer and Hermes as the autonomous operator — both authenticated, both EU-resident, both governed by the same policies.

Why is stateful inference a big deal for agents specifically?

Agent loops re-send the same conversation context on every step. ARK's session-level KV-cache persistence means only new tokens are re-computed — a 98.9% reduction in tokens per call and an 87% reduction in time-to-first-token. For customers running on their own hardware, that translates directly into higher session density on the same GPUs.

What about models? Are the right ones available?

ARK serves 10+ curated frontier open-source models — Llama 3.3, Qwen 2.5, DeepSeek R1/V3, Mistral-Nemo, GPT-OSS, and embedding + guardrail models — all benchmark-selected. New releases are onboarded based on customer demand. See the model library.