API

Salut exposes an OpenAI-compatible HTTP API on port 7258. Any client, library, or tool that speaks the OpenAI protocol works out of the box.

Base URL

http://localhost:7258/v1

If you’ve configured an API key in Settings, include it in the Authorization header:

Authorization: Bearer your-api-key

Endpoints

List Models

GET /v1/models

Returns all models available in the cluster — both locally loaded models and models available on paired peers.

curl http://localhost:7258/v1/models
{
  "object": "list",
  "data": [
    {
      "id": "mlx-community/Qwen3-8B-MLX-4bit",
      "object": "model",
      "owned_by": "local"
    }
  ]
}

Chat Completions

POST /v1/chat/completions

The primary inference endpoint. Supports both regular and streaming responses.

Request body:

Field

Type

Description

model

string

Model ID (e.g., mlx-community/Qwen3-8B-MLX-4bit)

messages

array

Conversation messages with role and content

temperature

float

Sampling temperature (0.0–2.0, default 1.0)

max_tokens

int

Maximum tokens to generate

stream

bool

Enable server-sent events streaming (default false)

top_p

float

Nucleus sampling threshold (default 1.0)

repetition_penalty

float

Repetition penalty (default 1.0)

Example — regular response:

curl http://localhost:7258/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-8B-MLX-4bit",
    "messages": [{"role": "user", "content": "Explain distributed inference in one sentence."}],
    "max_tokens": 100
  }'

Example — streaming:

curl http://localhost:7258/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-8B-MLX-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Streaming returns text/event-stream with one JSON chunk per data: line, followed by data: [DONE].

Cluster Info

GET /v1/cluster

Returns the current cluster state, including connected peers, their health, VRAM, and loaded models.

Client Libraries

Python (openai)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:7258/v1",
    api_key="unused",  # or your configured key
)

response = client.chat.completions.create(
    model="mlx-community/Qwen3-8B-MLX-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

TypeScript/JavaScript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:7258/v1",
  apiKey: "unused",
});

const response = await client.chat.completions.create({
  model: "mlx-community/Qwen3-8B-MLX-4bit",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);

curl

# One-shot
curl -s http://localhost:7258/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-8B-MLX-4bit","messages":[{"role":"user","content":"Hi"}]}' \
  | python -m json.tool

# Streaming
curl -N http://localhost:7258/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-8B-MLX-4bit","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Integration Tips

IDE Extensions

Most AI coding extensions support custom OpenAI-compatible endpoints. Point them at http://localhost:7258/v1 with any API key value.

Popular extensions that work with Salut:

  • Continue — set the provider to “openai” with your local base URL

  • Cody — configure a custom OpenAI endpoint

  • Cursor — use the OpenAI-compatible API option

Multiple Nodes

You can query any node in your cluster — it doesn’t have to be the one that has the model loaded. If node A receives a request for a model loaded on node B, it proxies the request automatically.

For distributed models (sharded across multiple peers), any participating peer can serve as the coordinator. Send the request to whichever node is most convenient.