Hanzo

Edge

Compute at the edge — on-device AI inference close to your users, offline and private.

Edge

Run AI inference at the edge — on the device, in the browser, or on embedded hardware — instead of round-tripping to the cloud. Hanzo Edge executes Zen and any GGUF model locally with zero network latency and full data privacy.

On-Device Inference

Edge runs quantized models on hardware you already have — Apple Silicon (Metal), NVIDIA (CUDA), x86/ARM CPUs, and WebAssembly in the browser. Data never leaves the device, and it works completely offline.

# Install
curl -sSL https://edge.hanzo.ai/install.sh | sh

# Run a model (auto-downloads from Hugging Face)
hanzo-edge run --model zenlm/zen3-nano --prompt "Hello!"
ModelParamsQuantizedBest for
zenlm/zen3-nano600M~400 MBEmbedded, IoT
zenlm/zen-eco4B~2.5 GBMobile, tablets
zenlm/zen4-mini8B~5 GBDesktop, laptop

OpenAI-Compatible Server

Edge ships an OpenAI-compatible server, so existing SDKs and tools work against a local endpoint with no code changes:

hanzo-edge serve --model zenlm/zen4-mini --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zen4-mini",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

The same /v1/chat/completions, /v1/completions, and /v1/models routes are served locally as in the cloud LLM Gateway — point at localhost for private, offline inference or at the gateway for scale.

Edge vs Cloud

EdgeCloud
WhereOn-deviceHanzo GPU clusters
LatencyZero networkNetwork round-trip
PrivacyData stays localSent to the cloud
ModelsQuantized GGUFFull precision, any size

Use Edge when privacy, offline capability, or minimal latency matter; use the cloud when you need full-precision or high-throughput serving.

How is this guide?

Last updated on

On this page