Network¶
Salut’s clustering controls determine how your machine participates in the peer network. You can be a full participant, a client that offloads work, or completely standalone.
Participation Modes¶
Cluster Mode (Default)¶
Your machine both advertises itself on the network and accepts inference work from paired peers.
Advertise |
Yes — other peers can discover you |
Browse |
Yes — you discover other peers |
Accept inbound |
Yes — paired peers can send you work |
Route outbound |
Yes — you can distribute work to peers |
This is the default and the right choice for most setups. Every machine contributes its GPU to the shared pool.
Client-Only Mode¶
Your machine discovers and routes to other peers but doesn’t advertise itself or accept inbound work.
Advertise |
No |
Browse |
Yes |
Accept inbound |
No |
Route outbound |
Yes |
Use this for machines you want to use as API clients without contributing their GPU. For example, a laptop that sends queries to more powerful desktops.
Enable in Settings → General by toggling Client-Only mode.
Set via SALUT_CLIENT_ONLY=1 environment variable.
Solo Mode¶
Your machine operates standalone — no advertising, no browsing, no peer communication.
Advertise |
No |
Browse |
No |
Accept inbound |
No |
Route outbound |
No |
Toggle clustering off in Settings → General or from the menu bar. All inference runs locally on your machine only.
Set via SALUT_CLUSTER_ENABLED=0 environment variable.
Distributed Inference¶
When a model is too large for a single machine’s GPU memory, Salut distributes it across the cluster automatically.
How It Works¶
A request comes in for a model (e.g., a 70B model).
The coordinator node builds a rendezvous plan — mapping transformer layers to peers based on available VRAM and bandwidth.
Each peer loads its assigned shard (a contiguous range of layers).
During inference, activations flow between peers in sequence: peer 1 processes layers 0–19, sends the result to peer 2 for layers 20–39, and so on.
The final peer returns the output to the coordinator, which sends the response to the client.
Layer Assignment¶
Layers are assigned based on each peer’s estimated capacity:
Available VRAM — more memory means more layers
GPU type — Metal (Apple Silicon) performance characteristics
Network bandwidth — peers with faster connections get more layers to minimize activation transfer overhead
The assignment is recalculated whenever the cluster composition changes (peers join, leave, or become unhealthy).
Activation Transport¶
Activations (the intermediate tensors passed between peers) can be transported in two modes:
- Passthrough (default)
Full-precision fp16 activations. Best quality, higher bandwidth usage.
- INT8 Blockwise
Activations are quantized to 8-bit integers using a blockwise scheme (similar to Petals). Cuts bandwidth in half with minimal quality loss.
Configure in Settings → General or via SALUT_QUANT_ACTIVATIONS=1.
Cluster Tokens¶
For lab or office setups with many machines, manual pairing is tedious. Cluster tokens provide automatic pairing.
How It Works¶
Choose a shared secret (any string) and set it on all machines.
Set it in Settings → General → Cluster Token or via
SALUT_CLUSTER_TOKENenvironment variable.When peers discover each other, they compare HMAC-SHA256 hashes of their tokens and fingerprints.
If the HMACs match, the peers are automatically paired — no manual acceptance needed.
Security Considerations¶
The token itself is never sent over the network — only the HMAC.
Tokens should be treated like passwords. Use a strong, unique value.
Peers with mismatched tokens go through the normal manual pairing flow.
You can use both: cluster token for trusted lab machines, manual pairing for ad-hoc guests.
Tip
A simple way to distribute the token: set it as an environment variable in a shared dotfile, configuration management tool, or launch script.