Hanzo
ServicesEngine

Hanzo Engine

High-performance LLM inference engine — blazing-fast Rust-based serving with Metal/CUDA acceleration, quantization, vision, audio, and MCP tools

Hanzo Engine

Hanzo Engine is a high-performance, multimodal inference engine written in Rust. It serves LLMs, vision models, speech models, image generation models, and embedding models through an OpenAI-compatible HTTP API, with optional MCP (Model Context Protocol) server support.

Key design goals: zero-config model loading, hardware-aware quantization, and production-grade throughput via PagedAttention and continuous batching.

Features

  • Hardware Acceleration: Native Metal (Apple Silicon) and CUDA (NVIDIA) backends, FlashAttention V2/V3, multi-GPU tensor parallelism
  • Quantization: GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB, AFQ (Metal-optimized), and In-Situ Quantization (ISQ) of any HuggingFace model
  • PagedAttention: High-throughput continuous batching on CUDA and Metal with prefix caching and KV cache quantization
  • Multimodal: Text, vision, audio/speech, image generation, and embeddings in a single binary
  • MCP Integration: Built-in MCP server exposing model tools via JSON-RPC, plus MCP client for connecting to external tool servers
  • OpenAI-Compatible API: Drop-in replacement for OpenAI endpoints — /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models
  • Agentic: Integrated tool calling with Python/Rust callbacks, web search, and structured output (JSON schema, regex, grammar)
  • LoRA / X-LoRA: Adapter model support with runtime weight merging
  • AnyMoE: Create mixture-of-experts on any base model
  • Per-Layer Topology: Fine-tune quantization bit depth per layer for optimal quality/speed tradeoffs
  • Web UI: Built-in chat interface at /ui when serving

Architecture

┌───────────────────────────────────────────────────────────────────┐
│                         Hanzo Engine                              │
├───────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Clients                                                          │
│  ───────                                                          │
│  OpenAI SDK ─┐                                                    │
│  curl / HTTP ─┤──▶ ┌──────────────────────────────────┐          │
│  Python SDK  ─┤    │   HTTP Server (axum)              │          │
│  Rust SDK    ─┤    │   :36900                          │          │
│  MCP Client  ─┘──▶ │   /v1/chat/completions            │          │
│                     │   /v1/completions                 │          │
│                     │   /v1/embeddings                  │          │
│                     │   /v1/models                      │          │
│                     │   /health                         │          │
│                     │   /docs (SwaggerUI)               │          │
│                     └─────────┬────────────────────────┘          │
│                               │                                   │
│                               ▼                                   │
│                     ┌──────────────────────────────────┐          │
│                     │   Pipeline Engine                 │          │
│                     │   ──────────────────              │          │
│                     │   • Continuous batching           │          │
│                     │   • PagedAttention + prefix cache │          │
│                     │   • ISQ / GGUF / GPTQ / AWQ      │          │
│                     │   • Device mapping (multi-GPU)    │          │
│                     │   • Chat template detection       │          │
│                     └─────────┬────────────────────────┘          │
│                               │                                   │
│              ┌────────────────┼────────────────┐                  │
│              ▼                ▼                 ▼                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐       │
│  │  Metal (GPU) │  │  CUDA (GPU)  │  │  CPU (fallback)  │       │
│  │  Apple Si    │  │  FlashAttn   │  │  AVX2 / MKL      │       │
│  │  AFQ quants  │  │  Marlin kern │  │  GGUF / HQQ      │       │
│  └──────────────┘  └──────────────┘  └──────────────────┘       │
│                                                                   │
│  ┌──────────────────────────────────────────────────────┐        │
│  │   MCP Server (:4321)                                  │        │
│  │   Exposes model tools via JSON-RPC 2.0                │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                   │
└───────────────────────────────────────────────────────────────────┘

Supported Models

Text Models

ModelArchitecturesNotes
LlamaLlama 2, 3, 3.1, 3.2GGUF supported
MistralMistral 7B, Mixtral 8x7B/8x22BMoE support
QwenQwen 2, 3, 3 Next, 3 MoEThinking mode
GemmaGemma, Gemma 2
PhiPhi 2, 3, 3.5 MoE
DeepSeekV2, V3 / R1MoE with MoQE
GLMGLM 4, 4.7 Flash, 4.7 MoE
StarcoderStarcoder 2Code models
SmolLMSmolLM 3Small models
GraniteGranite 4.0
GPT-OSSGPT-OSS (Harmony format)reasoning_effort support

Vision Models

ModelInput Types
Qwen 3-VL, 3-VL MoEImage + Text
Qwen 2-VL, 2.5-VLImage + Text
Gemma 3, 3nImage + Text
Llama 4, 3.2 VisionImage + Text
Mistral 3Image + Text
Phi 4 Multimodal, Phi 3VImage + Text
MiniCPM-OImage + Text
Idefics 2, 3Image + Text
LLaVA, LLaVA NextImage + Text

Speech, Image Generation, and Embedding Models

CategoryModels
Speech (ASR)Voxtral, Dia
Image GenerationFLUX
EmbeddingsEmbedding Gemma, Qwen 3 Embedding

Quick Start

Install

Build from source with the appropriate backend:

# macOS (Metal)
cargo build --package hanzo-engine --release --no-default-features --features metal

# Linux (CUDA)
cargo build --package hanzo-engine --release --features cuda

# Linux (CUDA + FlashAttention)
cargo build --package hanzo-engine --release --features "cuda flash-attn cudnn"

# Install binary
cargo install --path hanzo-engine --no-default-features --features metal

Run a Model

# Interactive chat with auto-detected architecture
hanzo-engine run -m Qwen/Qwen3-4B

# Start HTTP server with web UI
hanzo-engine serve --ui -m google/gemma-3-4b-it

# Start server with ISQ 4-bit quantization
hanzo-engine serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4

# Run a GGUF quantized model
hanzo-engine run --format gguf -f ./model.Q4_K_M.gguf

# Auto-tune for your hardware
hanzo-engine tune -m Qwen/Qwen3-4B --emit-config config.toml
hanzo-engine from-config -f config.toml

# System diagnostics
hanzo-engine doctor

The server listens on port 36900 by default. Visit http://localhost:36900/ui for the built-in chat interface, or http://localhost:36900/docs for interactive Swagger API docs.

CLI Commands

CommandPurpose
runInteractive chat mode
serveStart HTTP + optional MCP server
from-configRun from a TOML config file
quantizeGenerate UQFF quantized model
tuneAuto-benchmark and recommend settings
doctorSystem diagnostics (CUDA, Metal, HF connectivity)
loginAuthenticate with HuggingFace Hub
cacheManage downloaded model cache
benchRun performance benchmarks

API Reference

Hanzo Engine exposes an OpenAI-compatible REST API. Any client that speaks the OpenAI API works out of the box.

Chat Completions

curl http://localhost:36900/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain PagedAttention in one paragraph."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Using the Python OpenAI SDK

import openai

client = openai.OpenAI(
    base_url="http://localhost:36900/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about Rust."},
    ],
    max_tokens=128,
)

print(completion.choices[0].message.content)

Streaming

Set "stream": true in the request body. The server returns Server-Sent Events (SSE) compatible with the OpenAI streaming format.

stream = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Count to 10 slowly."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Text Completions

curl http://localhost:36900/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "prompt": "The Rust programming language is",
    "max_tokens": 128
  }'

Embeddings

Serve an embedding model to enable this endpoint:

hanzo-engine serve -m google/embeddinggemma-300m
curl http://localhost:36900/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "input": "Hello, world!"
  }'

Endpoints Summary

MethodPathDescription
POST/v1/chat/completionsChat completions (streaming supported)
POST/v1/completionsText completions
POST/v1/embeddingsVector embeddings
GET/v1/modelsList loaded models
GET/healthServer health check
GET/docsInteractive Swagger UI
GET/uiBuilt-in chat web UI

Extended Parameters

Beyond the standard OpenAI API, Hanzo Engine supports additional request parameters:

ParameterTypeDescription
top_kintTop-K sampling
min_pfloatMin-P sampling threshold
grammarobjectConstrained generation (regex, JSON schema, Lark grammar, llguidance)
enable_thinkingboolEnable chain-of-thought for supported models
truncate_sequenceboolTruncate over-length prompts instead of rejecting
repetition_penaltyfloatMultiplicative penalty for repeated tokens
web_search_optionsobjectEnable web search integration
reasoning_effortstringReasoning depth for Harmony-format models: low, medium, high
dry_multiplierfloatDRY anti-repetition penalty strength

Quantization Guide

Hanzo Engine supports extensive quantization methods for reducing model size and increasing throughput.

In-Situ Quantization (ISQ)

ISQ quantizes model weights during loading -- the full unquantized model never needs to fit in memory. Just pass --isq <bits>:

# 4-bit quantization (auto-selects best method for your hardware)
hanzo-engine serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4

# 8-bit quantization
hanzo-engine serve -m Qwen/Qwen3-4B --isq 8

On Metal, ISQ uses AFQ (affine quantization) optimized for Apple Silicon. On CUDA, it uses Q/K quantization for best throughput.

Quantization Methods

MethodBit WidthsDevicesNotes
ISQ (auto)2, 3, 4, 5, 6, 8AllHardware-aware auto-selection
GGUF2-8 bit (Q/K types)CPU, CUDA, MetalMost portable format
GPTQ2, 3, 4, 8CUDA onlyMarlin kernel for 4/8-bit
AWQ4, 8CUDA onlyMarlin kernel for 4/8-bit
HQQ4, 8AllHalf-quadratic quantization
FP88AllE4M3 floating point
BNB4, 8CUDAbitsandbytes int8, fp4, nf4
AFQ2, 3, 4, 6, 8Metal onlyFastest on Apple Silicon
MLXPre-quantizedMetalApple MLX format

Running GGUF Models

# Direct GGUF file
hanzo-engine run --format gguf -f ./Qwen3-4B-Q4_K_M.gguf

# Auto-detect GPTQ model from HuggingFace
hanzo-engine run -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bit

Per-Layer Topology

Fine-tune quantization per layer using a topology file. This lets you keep attention layers at higher precision while aggressively quantizing feed-forward layers:

hanzo-engine serve -m Qwen/Qwen3-4B --topology topology.toml

PagedAttention

PagedAttention accelerates inference and enables high-throughput continuous batching:

  • CUDA: Enabled by default; disable with --no-paged-attn
  • Metal: Opt-in with --paged-attn
  • Prefix caching: Reuses KV cache blocks across requests sharing common prefixes (system prompts)
  • KV cache quantization: FP8 (E4M3) reduces KV cache memory by ~50%
# Custom KV cache memory allocation
hanzo-engine serve -m Qwen/Qwen3-4B --paged-attn --paged-attn-gpu-memory 8GiB

# KV cache quantization for longer contexts
hanzo-engine serve -m Qwen/Qwen3-4B --paged-attn --paged-cache-type f8e4m3

MCP Integration

Hanzo Engine can act as both an MCP server (exposing model tools) and an MCP client (connecting to external tool servers).

MCP Server

Expose model capabilities as MCP tools alongside the HTTP API:

hanzo-engine serve \
  -m Qwen/Qwen3-4B \
  --port 36900 \
  --mcp-port 4321

The MCP server exposes tools based on loaded model modalities (text, vision, etc.) via JSON-RPC 2.0.

MCP Client

Connect to external MCP tool servers so the model can call external tools during inference:

hanzo-engine serve \
  -m Qwen/Qwen3-4B \
  --mcp-client-config mcp-tools.json

Docker

# CUDA
docker run --gpus all -p 36900:36900 ghcr.io/hanzoai/engine:latest \
  serve -m Qwen/Qwen3-4B --port 36900

# With ISQ quantization
docker run --gpus all -p 36900:36900 ghcr.io/hanzoai/engine:latest \
  serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4 --port 36900

Configuration

Environment Variables

VariableDescription
HANZO_ENGINE_PORTServer port (default: 36900)
KEEP_ALIVE_INTERVALSSE keep-alive interval in ms
OTEL_EXPORTER_OTLP_ENDPOINTOpenTelemetry export endpoint
OTEL_SERVICE_NAMEService name for traces
HF_TOKENHuggingFace Hub authentication token

TOML Configuration

For reproducible deployments, use a TOML config:

# Generate optimal config for your hardware
hanzo-engine tune -m Qwen/Qwen3-4B --emit-config config.toml

# Run from config
hanzo-engine from-config -f config.toml

Performance Tips

  1. Use ISQ: --isq 4 gives the best balance of quality and speed on most hardware
  2. Enable PagedAttention: Required for high-throughput batched serving
  3. Run tune: hanzo-engine tune benchmarks your hardware and emits optimal settings
  4. KV cache quantization: --paged-cache-type f8e4m3 halves KV cache memory for longer contexts
  5. Multi-GPU: Use device mapping for models that exceed single-GPU memory
  6. FlashAttention: Add --features flash-attn at build time for CUDA (significant speedup)

How is this guide?

Last updated on

On this page