Hanzo Mugen
Mugen is a PyTorch framework for deep learning research on audio generation. It provides training and inference code for multiple state-of-the-art generative audio models: text-to-music (MusicGen),...
Overview
Mugen is a PyTorch framework for deep learning research on audio generation. It provides training and inference code for multiple state-of-the-art generative audio models: text-to-music (MusicGen), text-to-sound (AudioGen), neural audio codecs (EnCodec), multi-band diffusion decoding, non-autoregressive generation (MAGNeT), audio watermarking (AudioSeal), and joint conditioning (JASCO).
The core architecture is a transformer language model operating over quantized audio tokens from EnCodec, with text conditioning via T5 or CLAP embeddings. Model scales range from ~300M to ~3.3B parameters.
Models Included
| Model | Task | Description |
|---|---|---|
| MusicGen | Text-to-music | Controllable music generation with melody conditioning |
| AudioGen | Text-to-sound | General audio/sound effect generation |
| EnCodec | Audio codec | High-fidelity neural audio tokenizer |
| Multi-Band Diffusion | Decoder | Diffusion-based EnCodec decoder for improved quality |
| MAGNeT | Text-to-music/sound | Non-autoregressive masked generation |
| AudioSeal | Watermarking | Imperceptible audio watermarking |
| MusicGen Style | Style-to-music | Text + style conditioning |
| JASCO | Multi-conditioned | Chords, melodies, drums + text conditioning |
When to use
- Text-conditioned music or sound effect generation
- Training custom audio generation models on proprietary data
- Neural audio compression/tokenization (EnCodec)
- Audio watermarking for generated content
- Research into transformer-based audio generative models
Hard requirements
- Python 3.9+ with PyTorch 2.1
- GPU recommended for training and fast inference
- ffmpeg installed (system or conda)
- xformers < 0.0.23 for memory-efficient attention
Quick reference
| Item | Value |
|---|---|
| Repo | github.com/hanzoai/mugen |
| Branch | main |
| Language | Python |
| Package | audiocraft v1.4.0a2 |
| Install | pip install -e . |
| Train | Hydra-based: dora run |
| Test | make tests |
| Lint | make linter |
| License | MIT (code), CC-BY-NC 4.0 (weights) |
Architecture
Transformer Language Model
The core generative model is an autoregressive transformer operating on EnCodec audio tokens:
Text Prompt: "epic orchestral music with drums"
|
v
┌────────────────────────────────┐
│ Conditioning │
│ T5 / CLAP text embeddings │
│ + optional: chroma, chords, │
│ drums, melody, style │
│ ConditionFuser (cross-attn │
│ or prepend or input_interp) │
└──────────┬─────────────────────┘
|
v
┌────────────────────────────────┐
│ Streaming Transformer LM │
│ Codebook pattern: parallel │
│ n_q codebook streams (8) │
│ card: 1024 (codebook size) │
│ │
│ Configurable scale: │
│ small: dim=512, 8 layers │
│ base: dim=1024, 24 layers │
│ medium: dim=1536, 32 layers │
│ large: dim=2048, 48 layers │
│ │
│ Positional: sin/rope/sin_rope │
│ Attention: causal, streaming │
│ Optional: xformers flash attn │
│ CFG: classifier-free guidance │
└──────────┬─────────────────────┘
|
v
┌────────────────────────────────┐
│ EnCodec Decoder │
│ (or Multi-Band Diffusion) │
│ Audio tokens -> waveform │
│ 16kHz (sound) / 32kHz (music) │
└────────────────────────────────┘Model Scale Configurations
| Scale | dim | Heads | Layers | ~Params |
|---|---|---|---|---|
| small | 512 | 8 | 8 | ~300M |
| base | 1024 | 16 | 24 | ~1.5B |
| medium | 1536 | 24 | 32 | ~2.2B |
| large | 2048 | 32 | 48 | ~3.3B |
EnCodec (Neural Audio Codec)
- SEANet encoder/decoder architecture
- Residual Vector Quantization (RVQ) with configurable codebooks (n_q=8 default)
- 1024 codes per codebook
- Supports 16kHz and 32kHz sample rates
- Streaming capable for real-time applications
One-file quickstart
Music Generation
import torchaudio
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained("facebook/musicgen-melody")
model.set_generation_params(duration=8)
# Text-conditioned
wav = model.generate(["epic cinematic orchestral soundtrack"])
# Melody-conditioned
melody, sr = torchaudio.load("melody.wav")
wav = model.generate_with_chroma(
["epic cinematic version"],
melody[None].expand(1, -1, -1),
sr,
)
# Save output
torchaudio.save("output.wav", wav[0].cpu(), model.sample_rate)Sound Effect Generation
from audiocraft.models import AudioGen
model = AudioGen.get_pretrained("facebook/audiogen-medium")
model.set_generation_params(duration=5)
wav = model.generate(["sirens and a humming engine approach and pass"])Audio Compression
from audiocraft.models import EncodecModel
model = EncodecModel.encodec_model_24khz()
model.set_target_bandwidth(6.0)
wav = torch.randn(1, 1, 24000) # 1 second at 24kHz
encoded = model.encode(wav)
decoded = model.decode(encoded)Training
Training uses Facebook's Dora experiment manager with Hydra configs:
# Install with training deps
pip install -e '.[dev]'
# Run training (example: MusicGen base at 32kHz)
dora run solver=musicgen/musicgen_base_32khz
# Custom training config
dora run solver=musicgen/musicgen_base_32khz \
dataset.batch_size=8 \
optim.epochs=100 \
transformer_lm.dim=1024 \
transformer_lm.num_layers=24Hydra Config Structure
config/
├── augmentations/default.yaml
├── conditioner/
│ ├── text2music.yaml # T5 text conditioning
│ ├── text2sound.yaml # Text for sound generation
│ ├── chroma2music.yaml # Chromagram conditioning
│ ├── chords2music.yaml # Chord conditioning
│ ├── drums2music.yaml # Drum track conditioning
│ ├── style2music.yaml # Style embedding conditioning
│ └── jasco_chords_drums.yaml
├── dset/audio/ # Dataset configs
├── model/
│ ├── encodec/ # Codec model configs
│ └── lm/
│ ├── default.yaml # Base LM config
│ └── model_scale/ # small/base/medium/large
└── config.yaml # Root configDependencies
torch==2.1.0
torchaudio>=2.0.0,<2.1.2
torchvision==0.16.0
av==11.0.0
einops
flashy>=0.0.1
hydra-core>=1.1
xformers<0.0.23
transformers>=4.31.0
demucs
librosa
soundfile
gradio
encodec
torchdiffeqProject Structure
mugen/
├── audiocraft/
│ ├── __init__.py # v1.4.0a2
│ ├── models/
│ │ ├── musicgen.py # MusicGen model
│ │ ├── audiogen.py # AudioGen model
│ │ ├── encodec.py # EnCodec codec
│ │ ├── multibanddiffusion.py # Multi-band diffusion decoder
│ │ ├── magnet.py # MAGNeT model
│ │ ├── jasco.py # JASCO multi-conditioned
│ │ ├── watermark.py # AudioSeal watermarking
│ │ ├── lm.py # Core transformer LM
│ │ ├── lm_magnet.py # MAGNeT LM variant
│ │ ├── flow_matching.py # Flow matching model
│ │ ├── builders.py # Model factory
│ │ └── loaders.py # Checkpoint loading
│ ├── modules/
│ │ ├── transformer.py # StreamingTransformer
│ │ ├── conditioners.py # Text/audio/chroma conditioners
│ │ ├── codebooks_patterns.py # Parallel/delay codebook patterns
│ │ ├── seanet.py # SEANet encoder/decoder
│ │ ├── conv.py # Causal/streaming convolutions
│ │ ├── rope.py # Rotary positional embeddings
│ │ └── streaming.py # Streaming inference support
│ ├── solvers/
│ │ ├── musicgen.py # MusicGen training solver
│ │ ├── audiogen.py # AudioGen training solver
│ │ ├── compression.py # EnCodec training solver
│ │ ├── diffusion.py # Diffusion training solver
│ │ ├── magnet.py # MAGNeT training solver
│ │ ├── jasco.py # JASCO training solver
│ │ └── watermark.py # Watermark training solver
│ ├── data/ # Dataset loading and processing
│ ├── losses/ # Training losses (balancer, STFT, loudness)
│ ├── metrics/ # FAD, KLD, CLAP, PESQ, ViSQOL
│ ├── adversarial/ # GAN discriminators (MPD, MSD, MSSTFTD)
│ ├── optim/ # Optimizers, schedulers, EMA, FSDP
│ ├── quantization/ # RVQ implementation
│ ├── grids/ # Dora experiment grids
│ ├── utils/ # Utilities, caching, checkpoints
│ └── train.py # Training entry point
├── config/ # Hydra configs
├── assets/ # Example audio files
├── setup.py
├── Makefile
└── requirements.txtRelated Skills
hanzo/hanzo-koe.md- Text-to-dialogue model (speech-specific)hanzo/hanzo-engine.md- Model serving infrastructurehanzo/hanzo-ml.md- Rust ML frameworkhanzo/hanzo-candle.md- Candle audio model support (Whisper, EnCodec)
How is this guide?
Last updated on
Hanzo ML
Hanzo ML is a Rust-based minimalist ML framework forked from HuggingFace Candle, optimized for edge AI, quantization, and multimodal inference. It provides the tensor computation layer for the Hanz...
Hanzo Sho
Sho is a text diffusion engine that generates text using masked diffusion rather than autoregressive token prediction. The core model (Genjo, 8B parameters) uses a Transformer Encoder with bidirect...