Models¶

Salut uses MLX for inference on Apple Silicon. Models must be in MLX format — typically quantized versions from the mlx-community organization on Hugging Face.

Preload Model¶

Set a model to load automatically when Salut starts. This is useful for machines that always serve the same model.

Enter the full Hugging Face model ID, for example:

mlx-community/Qwen3-8B-MLX-4bit
mlx-community/Llama-3.3-70B-Instruct-4bit
mlx-community/Phi-4-mini-instruct-4bit

The model downloads from Hugging Face on first use and is cached in the ~/.cache/huggingface/ directory. Subsequent loads use the cached version.

Set via SALUT_PRELOAD_MODEL environment variable.

Tip

The display name shown in the menu bar is derived from the model ID automatically. For example, mlx-community/Qwen3-8B-MLX-4bit becomes Qwen 3 8B (4-bit).

Custom Models¶

Add model IDs that appear in your local model selector alongside auto-detected ones. This is useful for models hosted on private Hugging Face repositories or custom fine-tunes.

Each model ID should be on its own line in the custom models field.

How Loading Works¶

When a model is requested (either via preload or an API request):

Salut checks if the model is already loaded in memory.
If not, it downloads the model weights from Hugging Face (if not cached).
The model is loaded into GPU memory using MLX.
The tokenizer is initialized.
The model is ready to serve requests.

Only one model is loaded at a time. Loading a new model unloads the previous one to free GPU memory.

Model Sizing¶

The amount of GPU memory (VRAM) a model needs depends on:

Parameter count — larger models need more memory (7B, 13B, 70B, etc.)
Quantization — 4-bit models use roughly half the memory of 8-bit models
Context length — longer contexts use more memory during inference

Rough guidelines for 4-bit quantized models on Apple Silicon:

Model Size	Approximate VRAM	Example Machines
1–4B	2–4 GB	Any Mac with 8 GB+
7–8B	4–6 GB	Any Mac with 16 GB+
13–14B	8–10 GB	Mac with 16 GB+
30–34B	18–22 GB	Mac with 32 GB+
70B	38–42 GB	Mac with 64 GB+, or distributed

Distributed Models¶

When a model is too large for a single machine, Salut can distribute it across paired peers. Each peer handles a subset of the model’s transformer layers (a “shard”).

This happens automatically during the rendezvous process — you don’t need to configure sharding manually. See Clustering for more details.