Hanzo
Hanzo Skills Reference

Hanzo Kensho

Kensho is a 17B parameter image generation foundation model using a Mixture of Experts (MoE) Diffusion Transformer architecture. It generates high-quality images from text prompts at multiple resol...

Overview

Kensho is a 17B parameter image generation foundation model using a Mixture of Experts (MoE) Diffusion Transformer architecture. It generates high-quality images from text prompts at multiple resolutions, achieving state-of-the-art scores on DPG-Bench (85.89) and GenEval benchmarks.

The architecture combines a DiT (Diffusion Transformer) backbone with a MoE feed-forward network (4 routed experts, 2 activated per token) and uses Llama 3.1-8B-Instruct as a text encoder for deep prompt understanding. Generation uses flow matching schedulers for fast, high-fidelity output.

Zen MoDE Architecture

Kensho implements the Zen MoDE (Mixture of Diverse Experts) methodology applied to image diffusion:

  • MoE gating with softmax scoring and top-k expert selection
  • Load balancing loss for uniform expert utilization across distributed training
  • SwiGLU feed-forward blocks within both joint and single transformer layers
  • Flash Attention for memory-efficient cross-modal attention between image patches and text embeddings

When to use

  • Text-to-image generation at production quality
  • High-resolution image synthesis (up to 1360x768, multiple aspect ratios)
  • Applications requiring deep prompt understanding (complex scenes, text rendering)
  • Research into MoE diffusion architectures

Hard requirements

  1. Python 3.10+ with PyTorch 2.5+
  2. CUDA 12.4+ with Flash Attention installed
  3. GPU VRAM: ~40GB (full model in bfloat16)
  4. Llama 3.1-8B-Instruct access (HuggingFace gated model)

Quick reference

ItemValue
Repogithub.com/hanzoai/kensho
Branchmain
LanguagePython
Parameters17B
ArchitectureMoE Diffusion Transformer
Precisionbfloat16
Installpip install -r requirements.txt && pip install -U flash-attn --no-build-isolation
Inferencepython inference.py --model_type full
LicenseSee repo

Model Variants

VariantInference StepsGuidance ScaleFlow ShiftScheduler
Full505.03.0FlowUniPCMultistep
Dev280.06.0FlashFlowMatchEuler
Fast160.03.0FlashFlowMatchEuler

Supported Resolutions

ResolutionAspect
1024 x 1024Square
768 x 1360Portrait
1360 x 768Landscape
880 x 1168Portrait
1168 x 880Landscape
1248 x 832Landscape
832 x 1248Portrait

Architecture Details

Text Prompt
    |
    v
┌──────────────────────────────┐
│ Llama 3.1-8B-Instruct        │ (frozen text encoder)
│ + CLIP/T5 text encoders      │
└──────────┬───────────────────┘
           |
           v
┌──────────────────────────────┐
│ Kensho DiT (17B params)      │
│                              │
│  PatchEmbed(4ch -> 1024dim)  │
│  + RoPE positional encoding  │
│  + Timestep embedding        │
│                              │
│  Joint Transformer Blocks:   │
│    - Cross-attention (img+txt)│
│    - adaLN modulation        │
│    - MoE FFN (SwiGLU)        │
│      4 experts, top-2 routing│
│      + load balancing loss   │
│                              │
│  Single Transformer Blocks:  │
│    - Self-attention (img)    │
│    - adaLN modulation        │
│    - MoE FFN (SwiGLU)        │
│                              │
│  Output projection           │
└──────────────────────────────┘
           |
           v
     Generated Image

MoE Gate Implementation

The expert routing uses a softmax-based gating mechanism:

  • Input projected through learned weight matrix to num_routed_experts logits
  • Softmax scoring with top-k selection (k=2 by default)
  • Auxiliary load balancing loss (alpha=0.01) for training stability
  • All-gather across distributed ranks for balanced expert utilization

One-file quickstart

import torch
from hi_diffusers import HiDreamImagePipeline, HiDreamImageTransformer2DModel
from hi_diffusers.schedulers.fm_solvers_unipc import FlowUniPCMultistepScheduler
from transformers import LlamaForCausalLM, PreTrainedTokenizerFast

# Load text encoder
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct", use_fast=False
)
text_encoder = LlamaForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    output_hidden_states=True,
    output_attentions=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

# Load pipeline
scheduler = FlowUniPCMultistepScheduler(
    num_train_timesteps=1000, shift=3.0, use_dynamic_shifting=False
)
transformer = HiDreamImageTransformer2DModel.from_pretrained(
    "HiDream-ai/HiDream-I1-Full",
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
).to("cuda")

pipe = HiDreamImagePipeline.from_pretrained(
    "HiDream-ai/HiDream-I1-Full",
    scheduler=scheduler,
    tokenizer_4=tokenizer,
    text_encoder_4=text_encoder,
    torch_dtype=torch.bfloat16,
).to("cuda", torch.bfloat16)
pipe.transformer = transformer

# Generate
image = pipe(
    "A landscape painting of mountains at sunset",
    height=1024, width=1024,
    guidance_scale=5.0,
    num_inference_steps=50,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("output.png")

Dependencies

torch>=2.5.1
torchvision>=0.20.1
diffusers>=0.32.1
transformers>=4.47.1
einops>=0.7.0
accelerate>=1.2.1
flash-attn (manual install)

Project Structure

kensho/
├── hi_diffusers/
│   ├── __init__.py
│   ├── models/
│   │   ├── attention.py              # HiDreamAttention + FeedForwardSwiGLU
│   │   ├── attention_processor.py    # Flash Attention processor
│   │   ├── embeddings.py             # PatchEmbed, RoPE, TimestepEmbed
│   │   ├── moe.py                    # MoEGate + MOEFeedForwardSwiGLU
│   │   └── transformers/
│   │       └── transformer_hidream_image.py  # Main DiT model
│   ├── pipelines/
│   │   └── hidream_image/
│   │       ├── pipeline_hidream_image.py     # Inference pipeline
│   │       └── pipeline_output.py
│   └── schedulers/
│       ├── flash_flow_match.py       # FlashFlowMatchEulerDiscreteScheduler
│       └── fm_solvers_unipc.py       # FlowUniPCMultistepScheduler
├── inference.py                      # CLI inference script
├── gradio_demo.py                    # Gradio web UI
├── requirements.txt
└── assets/

Benchmark Results (DPG-Bench)

ModelOverallGlobalEntityAttributeRelationOther
DALL-E 383.5090.9789.6188.3990.5889.83
Flux.1-dev83.7985.8086.7989.9890.0489.90
SD3-Medium84.0887.9091.0188.8380.7088.68
Kensho85.8976.4490.2289.4893.7491.83
  • hanzo/hanzo-ml.md - Rust ML framework for inference optimization
  • hanzo/hanzo-engine.md - Model serving infrastructure
  • hanzo/hanzo-candle.md - Candle tensor operations
  • hanzo/hanzo-sho.md - Text diffusion (complementary modality)

How is this guide?

Last updated on

On this page