Hanzo ML
Hanzo ML is a Rust-based minimalist ML framework forked from HuggingFace Candle, optimized for edge AI, quantization, and multimodal inference. It provides the tensor computation layer for the Hanz...
Overview
Hanzo ML is a Rust-based minimalist ML framework forked from HuggingFace Candle, optimized for edge AI, quantization, and multimodal inference. It provides the tensor computation layer for the Hanzo ecosystem, with GPU acceleration via CUDA, Metal (Apple Silicon), and CPU optimizations (MKL, Accelerate).
The repository maintains dual crate namespaces: original candle-* crates for upstream compatibility plus hanzo-* crates as the forward-looking Hanzo-branded API. Both coexist in the workspace and share the same underlying code.
Key Differentiators from Upstream Candle
- Hanzo-branded crates:
hanzo-ml,hanzo-nn,hanzo-transformers,hanzo-datasetspublished at v0.9.2-alpha.2 - HANZO_INTEGRATION.md: Integration guide for Hanzo Engine (mistral-rs fork)
- Hanzo-specific kernels:
hanzo-flash-attn,hanzo-metal-kernels,hanzo-kernels - Python bindings:
hanzo-ml-pyo3for Python interop - WASM examples:
hanzo-ml-wasm-examplesfor browser inference
When to use
- High-performance ML inference in Rust applications
- CUDA/Metal GPU acceleration without Python runtime
- Loading GGUF, safetensors, ONNX models in Rust
- Browser-based ML via WebAssembly
- Integration with Hanzo Engine for model serving
- Building custom inference pipelines with zero-cost abstractions
Hard requirements
- Rust 1.75+
- CUDA Toolkit 12+ (for CUDA backend) or macOS 13+ (for Metal)
- For WASM:
wasm-packand compatible browser
Quick reference
| Item | Value |
|---|---|
| Repo | github.com/hanzoai/ml |
| Branch | main |
| Language | Rust (primary), C++/CUDA/Metal/Python |
| Version | 0.9.2-alpha.2 (hanzo crates), 0.9.2 (candle crates) |
| Build | cargo build --workspace |
| Test | cargo test --workspace |
| License | BSD-3-Clause OR Apache-2.0 |
Workspace Crates
Hanzo-Branded (v0.9.2-alpha.2)
| Crate | Purpose |
|---|---|
hanzo-ml | Core tensor operations, Device, DType, quantization |
hanzo-nn | Neural network layers (Linear, Conv, LayerNorm, Attention) |
hanzo-transformers | Transformer model implementations (90+) |
hanzo-datasets | Dataset loading (MNIST, CIFAR, TinyStories) |
hanzo-ml-pyo3 | Python bindings via PyO3 |
hanzo-flash-attn | Flash Attention v2 CUDA kernels |
hanzo-metal-kernels | Custom Metal GPU kernels |
hanzo-kernels | Custom CUDA kernels |
hanzo-onnx | ONNX model evaluation |
hanzo-ug | Universal Graph backend |
hanzo-ml-examples | Example binaries |
hanzo-ml-wasm-examples | Browser WASM examples |
hanzo-ml-wasm-tests | WASM test suite |
Upstream-Compatible (v0.9.2)
| Crate | Purpose |
|---|---|
candle-core | Original tensor core (same code as hanzo-ml) |
candle-nn | Original NN layers |
candle-transformers | Original transformer models |
candle-datasets | Original datasets |
candle-examples | Original examples |
candle-pyo3 | Original Python bindings |
GPU Backend Crates (opt-in)
| Crate | Purpose |
|---|---|
hanzo-kernels | Custom CUDA kernels (reduce, cast, affine, etc.) |
hanzo-metal-kernels | Custom Metal kernels (Apple GPU) |
hanzo-flash-attn | Flash Attention v2 (CUDA SM80+) |
Feature Flags
| Feature | Effect |
|---|---|
cuda | NVIDIA GPU via cudarc 0.19.1 |
cudnn | Additional cuDNN kernels |
nccl | Multi-GPU distribution |
mkl | Intel Math Kernel Library |
accelerate | Apple Accelerate framework |
metal | Apple GPU via hanzo-metal-kernels + objc2-metal |
ug | Universal Graph backend |
One-file quickstart
Tensor Operations
use hanzo_ml::{Device, Tensor, DType};
fn main() -> hanzo_ml::Result<()> {
let device = Device::cuda_if_available(0)?;
let a = Tensor::randn(0f32, 1., (2, 3), &device)?;
let b = Tensor::randn(0f32, 1., (3, 4), &device)?;
let c = a.matmul(&b)?;
println!("Shape: {:?}", c.shape()); // [2, 4]
let d = a.relu()?;
let e = a.softmax(1)?;
let f = a.to_dtype(DType::BF16)?;
Ok(())
}Neural Network Training
use hanzo_ml::{Device, Tensor, DType, Module};
use hanzo_nn::{VarBuilder, VarMap, Linear, linear, AdamW};
fn main() -> hanzo_ml::Result<()> {
let device = Device::cuda_if_available(0)?;
let varmap = VarMap::new();
let vb = VarBuilder::from_varmap(&varmap, DType::F32, &device);
let layer1 = linear(784, 256, vb.pp("layer1"))?;
let layer2 = linear(256, 10, vb.pp("layer2"))?;
let input = Tensor::randn(0f32, 1., (32, 784), &device)?;
let h = layer1.forward(&input)?.relu()?;
let output = layer2.forward(&h)?;
let mut opt = AdamW::new(varmap.all_vars(), Default::default())?;
let target = Tensor::zeros((32, 10), DType::F32, &device)?;
let loss = hanzo_nn::loss::mse(&output, &target)?;
opt.backward_step(&loss)?;
println!("Loss: {}", loss.to_scalar::<f32>()?);
Ok(())
}Load GGUF Model
use hanzo_ml::quantized::gguf_file;
use std::fs::File;
fn main() -> anyhow::Result<()> {
let mut file = File::open("model.gguf")?;
let model = gguf_file::Content::read(&mut file)?;
for (name, info) in model.tensor_infos.iter() {
println!("{}: {:?}", name, info.shape);
}
Ok(())
}Load safetensors
use hanzo_ml::{Device, DType};
use hanzo_nn::VarBuilder;
let device = Device::cuda_if_available(0)?;
let vb = unsafe {
VarBuilder::from_mmaped_safetensors(
&["model.safetensors"], DType::F32, &device,
)?
};
let weight = vb.get((768, 768), "transformer.h.0.attn.c_attn.weight")?;Cargo.toml Setup
[dependencies]
# From crates.io (when published)
hanzo-ml = { version = "0.9.2-alpha.2", features = ["metal"] }
hanzo-nn = "0.9.2-alpha.2"
hanzo-transformers = "0.9.2-alpha.2"
# From git (current)
hanzo-ml = { git = "https://github.com/hanzoai/ml", branch = "main" }
hanzo-nn = { git = "https://github.com/hanzoai/ml", branch = "main" }
hanzo-transformers = { git = "https://github.com/hanzoai/ml", branch = "main" }Integration with Hanzo Engine
Hanzo Engine (mistral-rs fork) uses Hanzo ML as its tensor backend:
# In engine Cargo.toml
[dependencies]
hanzo-ml = { git = "https://github.com/hanzoai/ml", branch = "main" }
hanzo-nn = { git = "https://github.com/hanzoai/ml", branch = "main" }
hanzo-transformers = { git = "https://github.com/hanzoai/ml", branch = "main" }
[features]
default = ["metal"]
metal = ["hanzo-ml/metal", "hanzo-nn/metal"]
cuda = ["hanzo-ml/cuda"]Quantization Support
| Format | Use Case |
|---|---|
| GGUF/GGML | Universal, llama.cpp compatible |
| AFQ (Affine) | Optimized for Metal/Apple Silicon |
| GPTQ/AWQ | GPU-optimized quantization |
| ISQ | In-situ runtime quantization |
Supported Models (90+ via hanzo-transformers)
| Category | Models |
|---|---|
| LLMs | LLaMA 1/2/3, Falcon, Gemma 1/2, Phi 1-3, Mistral, Mixtral, Mamba/2, StarCoder/2, Qwen3 MoE, Yi, GLM4, DeepSeek v2, SmolLM3, Olmo |
| Vision | DINOv2, ConvMixer, EfficientNet, ResNet, ViT, VGG, YOLO v3/v8, SAM, SegFormer, MobileNet v4, CLIP, SigLIP |
| Audio | Whisper, EnCodec, MetaVoice, Parler-TTS, Mimi, Silero VAD |
| Diffusion | Stable Diffusion 1.5/2.1/XL/3, Flux |
| Multimodal | BLIP, LLaVA, Moondream, PaddleOCR-VL, Pixtral, PaliGemma |
| Quantized | GGUF/GGML format, llama.cpp compatible |
Supported Formats
| Format | Extension | Use Case |
|---|---|---|
| GGUF | .gguf | Quantized models (llama.cpp compatible) |
| safetensors | .safetensors | HuggingFace standard (fast, safe) |
| ONNX | .onnx | Cross-framework interop |
| PyTorch | .bin, .pt | Legacy format |
Project Structure
ml/
├── hanzo-ml/ # Core tensor ops (hanzo-branded)
│ ├── src/
│ │ ├── lib.rs
│ │ ├── tensor.rs # Tensor type
│ │ ├── device.rs # CPU/CUDA/Metal device
│ │ ├── dtype.rs # Data types (F16, BF16, F32, etc.)
│ │ ├── backend.rs # Backend trait
│ │ ├── cuda_backend/ # CUDA implementation
│ │ ├── metal_backend/ # Metal implementation
│ │ ├── cpu_backend/ # CPU implementation
│ │ ├── quantized/ # GGUF/GGML quantization
│ │ └── safetensors.rs # safetensors loading
│ ├── benches/ # Performance benchmarks
│ └── tests/ # Unit tests
├── hanzo-nn/ # Neural network layers
├── hanzo-transformers/ # 90+ transformer model implementations
├── hanzo-datasets/ # Dataset loading utilities
├── hanzo-ml-pyo3/ # Python bindings
├── hanzo-flash-attn/ # Flash Attention CUDA kernels
│ └── kernels/ # CUDA kernel source files
├── hanzo-metal-kernels/ # Metal GPU kernels
├── hanzo-kernels/ # Generic CUDA kernels
├── hanzo-onnx/ # ONNX evaluation
├── hanzo-ml-examples/ # Example binaries
├── hanzo-ml-wasm-examples/ # WASM browser examples
├── candle-core/ # Upstream-compatible core (v0.9.2)
├── candle-nn/ # Upstream-compatible NN
├── candle-transformers/ # Upstream-compatible transformers
├── candle-datasets/ # Upstream-compatible datasets
├── candle-examples/ # Upstream-compatible examples
├── candle-book/ # Documentation book
├── tensor-tools/ # CLI tensor manipulation
├── Cargo.toml # Workspace root
├── Makefile
├── HANZO_INTEGRATION.md # Engine integration guide
└── LLM.mdDevelopment Workflow
# Build entire workspace
cargo build --workspace
# Test everything
cargo test --workspace
# Build with Metal (Apple Silicon)
cargo build --workspace --features metal
# Build with CUDA
cargo build --workspace --features cuda
# Run example (LLaMA inference)
cargo run --release --example llama -- --model meta-llama/Llama-3.2-3B-Instruct
# Sync from upstream
git remote add upstream https://github.com/huggingface/candle.git
git fetch upstream
git merge upstream/mainPerformance Considerations
Apple Silicon (Metal)
- Use
AFQ4quantization for best throughput - Enable
--features "metal accelerate"for CPU fallback ops - Group size 64 balances speed and accuracy
CUDA
- Use GPTQ or AWQ quantization
- Flash Attention enabled via
hanzo-flash-attn(SM80+) - PagedAttention for memory efficiency in long sequences
CPU
- GGUF models with appropriate quantization level
mklfeature for Intel platformsacceleratefeature for Apple platforms
Related Skills
hanzo/hanzo-candle.md- Upstream candle documentation (planned fork)hanzo/hanzo-engine.md- Inference engine using hanzo-mlhanzo/hanzo-ane.md- Apple Neural Engine (complementary to Metal)hanzo/hanzo-kensho.md- Image generation modelhanzo/hanzo-sho.md- Text diffusion engine
How is this guide?
Last updated on
Hanzo Koe
Koe is a 1.6B parameter text-to-speech model that directly generates realistic multi-speaker dialogue from text transcripts. It supports two-speaker conversations with speaker tags (`[S1]`, `[S2]`)...
Hanzo Mugen
Mugen is a PyTorch framework for deep learning research on audio generation. It provides training and inference code for multiple state-of-the-art generative audio models: text-to-music (MusicGen),...