API¶
Salut exposes an OpenAI-compatible HTTP API on port 7258. Any client, library, or tool that speaks the OpenAI protocol works out of the box.
Base URL¶
http://localhost:7258/v1
If you’ve configured an API key in Settings, include it in the Authorization header:
Authorization: Bearer your-api-key
Endpoints¶
List Models¶
GET /v1/models
Returns all models available in the cluster — both locally loaded models and models available on paired peers.
curl http://localhost:7258/v1/models
{
"object": "list",
"data": [
{
"id": "mlx-community/Qwen3-8B-MLX-4bit",
"object": "model",
"owned_by": "local"
}
]
}
Chat Completions¶
POST /v1/chat/completions
The primary inference endpoint. Supports both regular and streaming responses.
Request body:
Field |
Type |
Description |
|---|---|---|
|
string |
Model ID (e.g., |
|
array |
Conversation messages with |
|
float |
Sampling temperature (0.0–2.0, default 1.0) |
|
int |
Maximum tokens to generate |
|
bool |
Enable server-sent events streaming (default false) |
|
float |
Nucleus sampling threshold (default 1.0) |
|
float |
Repetition penalty (default 1.0) |
Example — regular response:
curl http://localhost:7258/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-8B-MLX-4bit",
"messages": [{"role": "user", "content": "Explain distributed inference in one sentence."}],
"max_tokens": 100
}'
Example — streaming:
curl http://localhost:7258/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-8B-MLX-4bit",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
Streaming returns text/event-stream with one JSON chunk per data: line, followed by data: [DONE].
Cluster Info¶
GET /v1/cluster
Returns the current cluster state, including connected peers, their health, VRAM, and loaded models.
Client Libraries¶
Python (openai)¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:7258/v1",
api_key="unused", # or your configured key
)
response = client.chat.completions.create(
model="mlx-community/Qwen3-8B-MLX-4bit",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
TypeScript/JavaScript¶
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:7258/v1",
apiKey: "unused",
});
const response = await client.chat.completions.create({
model: "mlx-community/Qwen3-8B-MLX-4bit",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);
curl¶
# One-shot
curl -s http://localhost:7258/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Qwen3-8B-MLX-4bit","messages":[{"role":"user","content":"Hi"}]}' \
| python -m json.tool
# Streaming
curl -N http://localhost:7258/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Qwen3-8B-MLX-4bit","messages":[{"role":"user","content":"Hi"}],"stream":true}'
Integration Tips¶
IDE Extensions¶
Most AI coding extensions support custom OpenAI-compatible endpoints. Point them at http://localhost:7258/v1 with any API key value.
Popular extensions that work with Salut:
Continue — set the provider to “openai” with your local base URL
Cody — configure a custom OpenAI endpoint
Cursor — use the OpenAI-compatible API option
Multiple Nodes¶
You can query any node in your cluster — it doesn’t have to be the one that has the model loaded. If node A receives a request for a model loaded on node B, it proxies the request automatically.
For distributed models (sharded across multiple peers), any participating peer can serve as the coordinator. Send the request to whichever node is most convenient.