When we're testing different ideas, speed of iteration is critical. Knowing this, I built a utility to let me explore different ML architectures quickly. I'm open-sourcing it as "mixlab".
The whole point is iteration speed: define a model in JSON, train it on your Mac, ship the same config to a cloud GPU. No code changes between platforms.
When you're testing whether Mamba beats attention at your parameter budget, or whether GQA helps at your context length, you don't want to write PyTorch for each experiment. You want to edit a JSON file, hit enter, and see loss numbers in seconds.
local Mac (Metal) cloud GPU (CUDA)
mixlab -config my_model.json ===> mixlab -config my_model.json
-train 'data/*.bin' -train 'data/*.bin'It compiles JSON configs into a typed Go IR and executes them on GPU through Apple's MLX framework. The same IR runs on Metal (macOS) and CUDA (Linux/Docker) with no translation layer.
Why Go, not Python?
Three reasons that matter in practice:
1. Build speed. make build takes 1.6 seconds. The full test suite runs in 5. When you're iterating on a new block type, that loop matters.
2. Built-in profiling. mixlab -cpuprofile cpu.prof gives you a flame graph with zero setup. No py-spy, no Nsight, no extra tooling. I used this to diagnose GPU utilization on a remote A40: generated a signed URL, had the RunPod worker upload the profile to GCS, and opened it locally in my browser.
3. Import-based extensibility. Need a custom block type? Create a Go package, import mixlab/arch, call RegisterBlock(), and you inherit Metal+CUDA backends, the training loop, Muon/AdamW optimizers, safetensors, checkpointing, and profiling. No C++ extensions, no custom build systems. Just a Go import and an init() function.
Bonus reason: I just have a personal preference for strongly typed, compiled languages, so that's what I used.
Does it work?
On a Shakespeare character-level benchmark matching nanoGPT's architecture (6-layer, 6-head, d=384, 10.8M params):
| Platform | Best val loss | Tokens/sec |
|---|---|---|
| M1 Max (Metal) | 1.5527 | ~37,900 |
| NVIDIA A40 (CUDA) | 1.5588 | ~91,300 |
| nanoGPT (reference) | 1.4697 | — |
The 0.08 gap is from byte-level tokenization (256 vocab vs nanoGPT's 65-char vocab). A PyTorch numerical parity test confirms the training pipeline is exact to 8 decimal places.
Install
Or build from source with make build. Docker images are on Docker Hub for CUDA.