Hanzo Jin - Visual Self-Supervised Learning Framework
Jin is a research-stage visual self-supervised learning framework implementing Joint Embedding Predictive Architectures (JEPA).
Overview
Jin is a research-stage visual self-supervised learning framework implementing Joint Embedding Predictive Architectures (JEPA). Python + PyTorch. Implements I-JEPA, Saccade JEPA (novel variant), and Self-Distillation MAE.
NOTE: Jin is vision-only. The "multimodal" roadmap (text + audio + 3D) exists in grant proposals but is not yet implemented in code. Current codebase is image-patch JEPA training only — no inference API, no published weights.
Why Jin?
- JEPA architecture: Self-supervised visual representation learning
- Novel Saccade JEPA: Inspired by mammalian saccadic eye movement
- Self-Distillation MAE: Masked autoencoder with DINO-style centering
- Energy I-JEPA: Alternative using Hopfield-based energy attention
- Research tool: Benchmarking, visualization, attention maps
OSS Base
Originated from LumenPallidium/jepa, adopted under Zen LM org. Repo: github.com/hanzoai/jin (redirects to zenlm/jin). Package name: jin-tac.
When to use
- Research into visual self-supervised learning
- Training JEPA models on image datasets (ImageNet)
- Experimenting with novel JEPA variants
- Benchmarking visual representation quality
- NOT for production multimodal inference (not yet implemented)
Hard requirements
- Python 3.8+ with PyTorch
- GPU recommended for training (ImageNet-scale)
Quick reference
| Item | Value |
|---|---|
| Language | Python (PyTorch) |
| Package | jin-tac |
| Repo | github.com/hanzoai/jin (→ zenlm/jin) |
| Train | python jepa/train.py |
| Config | config/training.yml |
Model Variants (Implemented)
| Model | Class | Backbone | Description |
|---|---|---|---|
| I-JEPA | ViTJepa | ViT (Transformer) | Paper implementation (arXiv:2301.08243) |
| Energy I-JEPA | EnergyIJepa | Energy Transformer | Hopfield-based energy attention |
| Saccade JEPA | SaccadeJepa | ConvNeXT tiny | Novel: mammalian saccadic eye movement |
| Self-Distillation MAE | SelfDistillMAE | ViT + cross-attention | Masked autoencoder with DINO centering |
Architecture
┌─────────────────────────────────────────┐
│ Jin JEPA Architecture │
├──────────────────────────────────────────┤
│ │
│ Image ──▶ Patcher ──▶ Patch Embeddings │
│ ┌──────────┐ │
│ Context patches ──│ Context │ │
│ │ Encoder │── EMA ──▶ Target Encoder
│ └────┬─────┘ │
│ │ │
│ ┌────┴─────┐ │
│ │Predictor │ │
│ └────┬─────┘ │
│ │ │
│ Predict target embeddings │
│ from context embeddings │
│ │
│ Loss: MSE(predicted, target_stopped) │
│ + VICReg (variance + covariance) │
│ + Cycle consistency (Saccade only) │
└──────────────────────────────────────────┘One-file quickstart
# Training (the only supported mode)
import yaml
from jepa.train import train
from jepa.jepa import ViTJepa
# Load config
with open("config/training.yml") as f:
config = yaml.safe_load(f)
# Create model
model = ViTJepa(
image_size=224,
patch_size=16,
embed_dim=768,
depth=12,
num_heads=12,
)
# Train on ImageNet
train(model, config)Training Configuration (config/training.yml)
model:
type: "vit_jepa" # or "saccade_jepa", "energy_ijepa", "self_distill_mae"
image_size: 224
patch_size: 16
embed_dim: 768
depth: 12
num_heads: 12
training:
dataset: "imagenet"
batch_size: 128
gradient_accumulation: 128
learning_rate: 1.5e-4
warmup_epochs: 40
total_epochs: 300
weight_decay: 0.05
ema_momentum: 0.996 # Target encoder EMA
schedule:
type: "cosine"
min_lr: 1e-6Saccade JEPA (Novel Variant)
from jepa.jepa import SaccadeJepa
model = SaccadeJepa(
image_size=224,
patch_size=16,
embed_dim=768,
# Uses ConvNeXT tiny backbone
# NeRF-like positional encoding of rotation/translation affine transforms
# VICReg loss + cycle consistency
)
# Cycle consistency: forward-backward saccade prediction must reconstruct original
# Mimics mammalian visual system's saccadic eye movementsEvaluation
| Method | Purpose |
|---|---|
| Linear probes | Evaluate frozen representations |
| KNN (k-nearest neighbors) | Non-parametric evaluation |
| Correlation dimension | Representation geometry |
| UMAP visualization | Embedding space visualization |
| Attention map dashboard | Interactive Dash visualization |
Dependencies
torch,torchvision— core frameworkeinops— tensor operationspyyaml— config parsingtqdm,numpy,matplotlib— utilitiessklearn(optional) — KNN evaluationumap-learn(optional) — visualizationenergy_transformer(optional) — for Energy I-JEPA variant
Project Structure
jin/
├── jepa/
│ ├── jepa.py # Core models (ViTJepa, SaccadeJepa, EnergyIJepa)
│ ├── masked_autoencoder.py # Self-Distillation MAE
│ ├── train.py # Training loop
│ ├── patcher.py # Image patch embedding (Conv, Hybrid, Conv3d)
│ ├── saccade.py # Saccade cropper (NeRF positional encoding)
│ └── vicreg.py # VICReg loss terms
├── config/
│ └── training.yml # Training configuration
├── papers/ # Research papers and grant proposals
├── pyproject.toml
└── LLM.mdRoadmap (Aspirational — NOT in code)
Grant proposals describe future multimodal expansion:
- Text encoder (not implemented)
- Audio encoder (not implemented)
- Diffusion Transformer MoE (not implemented)
- Cross-modal alignment (not implemented)
Related Skills
hanzo/hanzo-engine.md- Inference engine (future: serve trained Jin models)hanzo/hanzo-candle.md- Rust ML frameworkhanzo/zenlm.md- Text-only Zen modelshanzo/hanzo-gym.md- Training infrastructure
How is this guide?
Last updated on
Zoo Gym - Unified AI Model Training Platform
Zoo Gym is the unified training infrastructure for all ZenLM AI models, built on LLaMA Factory.
Hanzo Kensho
Kensho is a 17B parameter image generation foundation model using a Mixture of Experts (MoE) Diffusion Transformer architecture. It generates high-quality images from text prompts at multiple resol...