ARK API is compatible with OpenAI. You can use /chat/completions, /embeddings and /audio/transcriptions endpoints in the same way you would use OpenAI's endpoints. You can even use their client libraries by customising the base_url parameter.
There exist limitations and extensions that distinguish the ARK API from OpenAI's offering. They are listed below.
Limitations
Model names aliasing
ARK executes inference using open-weight models like Meta's Llama rather than proprietary models such as ChatGPT 4o. Because the openai library (and possibly others like it) validates model names against a predefined enum, the ARK API configuration implements the ability to assign aliases to model names.
Examples:
gpt-3.5-turbo→meta-llama/Llama-3.1-8B-Instructgpt-4o→meta-llama/Llama-3.1-70B-Instructtext-embedding-ada-002→BAAI/bge-m3whisper-1→whisper-1
Unsupported or not fully supported OpenAI parameters
/chat/completions
frequency_penalty— penalising repeated tokens is not supported.function_call— explicit function calls unavailable.logit_bias— biasing token probabilities unimplemented.logprobs— token log probabilities unavailable.presence_penalty— adjusting the likelihood of introducing new tokens is unavailable.response_format— only text output is supported; JSON and other formats are unavailable.seed— random seed control for reproducibility is unsupported.stop— relies oneos_token_idrather than arbitrary string-based stop sequences.temperature— setting temperature=0 lacks true determinism (though approximate). Internally set to 0.0001 to prevent numerical issues.tools&tool_choice— function calling and tool integration unimplemented at this level (see Tool Calling for the supported pattern).top_p— nucleus sampling unimplemented.user— user parameter for per-user request tracking is unsupported.
/embeddings
dimensions— must stay within the model's predefined limits; arbitrary dimension settings are unsupported.encoding_format— only float encoding is supported; base64 encoding is unavailable.user— user parameter for request tracking is unsupported.
/audio/transcriptions
prompt— custom prompting is currently unsupported.
Extensions
Custom parameters
/chat/completions
ark_simplified— when using streaming, set this totrueto disable wrapping every single token in a full JSON object. SSE event payloads will then contain only the token itself. Token usage JSON and[DONE]still arrive at the conclusion of inference.
/embeddings
This endpoint currently has no ARK extensions.
Stateful processing
During inference, a rich internal state is built inside GPU memory which represents the current prompt, the message history, and the reasoning done by the model. OpenAI optimises by processing every single request on randomly selected GPUs — but in the process most of that state is lost because only the final assistant reply is kept.
ARK allows users to have a session during which all requests are processed on the same set of GPUs and the full internal state is maintained between requests. Depending on the application, this strategy can enhance both response quality and performance.
To implement: enable cookie support in your client. The API will respond with:
set-cookie: ark_session_id=${SESSION_UUID}; Max-Age=86400; Path=/; SameSite=lax
Then sending:
cookie: ark_session_id=${SESSION_UUID}
with subsequent requests reuses the session. Please note that there are timeouts configured which destroy inactive sessions after some time, to prevent blocking GPUs indefinitely.
Prerequisites
- Obtain the API URL and API key from the Deployment Team.
- Install Python 3 (pre-installed on most Linux distributions).
- Create a working directory, a virtual environment, and install the dependencies used across the examples:
mkdir ark
cd ark
python -m venv .venv
source .venv/bin/activate
pip install openai # all examples
pip install numpy # some examples
pip install requests # some examples