API¶

Salut exposes an OpenAI-compatible HTTP API on port 7258. Any client, library, or tool that speaks the OpenAI protocol works out of the box.

Base URL¶

http://localhost:7258/v1

If you’ve configured an API key in Settings, include it in the Authorization header:

Authorization: Bearer your-api-key

Endpoints¶

List Models¶

GET /v1/models

Returns all models available in the cluster — both locally loaded models and models available on paired peers.

curl http://localhost:7258/v1/models

{
  "object": "list",
  "data": [
    {
      "id": "mlx-community/Qwen3-8B-MLX-4bit",
      "object": "model",
      "owned_by": "local"
    }
  ]
}

Chat Completions¶

POST /v1/chat/completions

The primary inference endpoint. Supports both regular and streaming responses.

Request body:

Field	Type	Description
`model`	string	Model ID (e.g., `mlx-community/Qwen3-8B-MLX-4bit`)
`messages`	array	Conversation messages with `role` and `content`
`temperature`	float	Sampling temperature (0.0–2.0, default 1.0)
`max_tokens`	int	Maximum tokens to generate
`stream`	bool	Enable server-sent events streaming (default false)
`top_p`	float	Nucleus sampling threshold (default 1.0)
`repetition_penalty`	float	Repetition penalty (default 1.0)

Example — regular response:

curl http://localhost:7258/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-8B-MLX-4bit",
    "messages": [{"role": "user", "content": "Explain distributed inference in one sentence."}],
    "max_tokens": 100
  }'

Example — streaming:

curl http://localhost:7258/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-8B-MLX-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Streaming returns text/event-stream with one JSON chunk per data: line, followed by data: [DONE].

Cluster Info¶

GET /v1/cluster

Returns the current cluster state, including connected peers, their health, VRAM, and loaded models.

Client Libraries¶

Python (openai)¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:7258/v1",
    api_key="unused",  # or your configured key
)

response = client.chat.completions.create(
    model="mlx-community/Qwen3-8B-MLX-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

TypeScript/JavaScript¶

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:7258/v1",
  apiKey: "unused",
});

const response = await client.chat.completions.create({
  model: "mlx-community/Qwen3-8B-MLX-4bit",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);

curl¶

# One-shot
curl -s http://localhost:7258/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-8B-MLX-4bit","messages":[{"role":"user","content":"Hi"}]}' \
  | python -m json.tool

# Streaming
curl -N http://localhost:7258/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-8B-MLX-4bit","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Integration Tips¶

IDE Extensions¶

Most AI coding extensions support custom OpenAI-compatible endpoints. Point them at http://localhost:7258/v1 with any API key value.

Popular extensions that work with Salut:

Continue — set the provider to “openai” with your local base URL
Cody — configure a custom OpenAI endpoint
Cursor — use the OpenAI-compatible API option

Multiple Nodes¶

You can query any node in your cluster — it doesn’t have to be the one that has the model loaded. If node A receives a request for a model loaded on node B, it proxies the request automatically.

For distributed models (sharded across multiple peers), any participating peer can serve as the coordinator. Send the request to whichever node is most convenient.