Hanzo
Hanzo Skills Reference

Hanzo Jin - Visual Self-Supervised Learning Framework

Jin is a research-stage visual self-supervised learning framework implementing Joint Embedding Predictive Architectures (JEPA).

Overview

Jin is a research-stage visual self-supervised learning framework implementing Joint Embedding Predictive Architectures (JEPA). Python + PyTorch. Implements I-JEPA, Saccade JEPA (novel variant), and Self-Distillation MAE.

NOTE: Jin is vision-only. The "multimodal" roadmap (text + audio + 3D) exists in grant proposals but is not yet implemented in code. Current codebase is image-patch JEPA training only — no inference API, no published weights.

Why Jin?

  • JEPA architecture: Self-supervised visual representation learning
  • Novel Saccade JEPA: Inspired by mammalian saccadic eye movement
  • Self-Distillation MAE: Masked autoencoder with DINO-style centering
  • Energy I-JEPA: Alternative using Hopfield-based energy attention
  • Research tool: Benchmarking, visualization, attention maps

OSS Base

Originated from LumenPallidium/jepa, adopted under Zen LM org. Repo: github.com/hanzoai/jin (redirects to zenlm/jin). Package name: jin-tac.

When to use

  • Research into visual self-supervised learning
  • Training JEPA models on image datasets (ImageNet)
  • Experimenting with novel JEPA variants
  • Benchmarking visual representation quality
  • NOT for production multimodal inference (not yet implemented)

Hard requirements

  1. Python 3.8+ with PyTorch
  2. GPU recommended for training (ImageNet-scale)

Quick reference

ItemValue
LanguagePython (PyTorch)
Packagejin-tac
Repogithub.com/hanzoai/jin (→ zenlm/jin)
Trainpython jepa/train.py
Configconfig/training.yml

Model Variants (Implemented)

ModelClassBackboneDescription
I-JEPAViTJepaViT (Transformer)Paper implementation (arXiv:2301.08243)
Energy I-JEPAEnergyIJepaEnergy TransformerHopfield-based energy attention
Saccade JEPASaccadeJepaConvNeXT tinyNovel: mammalian saccadic eye movement
Self-Distillation MAESelfDistillMAEViT + cross-attentionMasked autoencoder with DINO centering

Architecture

┌─────────────────────────────────────────┐
│           Jin JEPA Architecture          │
├──────────────────────────────────────────┤
│                                          │
│  Image ──▶ Patcher ──▶ Patch Embeddings │
│                    ┌──────────┐          │
│  Context patches ──│ Context  │          │
│                    │ Encoder  │── EMA ──▶ Target Encoder
│                    └────┬─────┘          │
│                         │                │
│                    ┌────┴─────┐          │
│                    │Predictor │          │
│                    └────┬─────┘          │
│                         │                │
│           Predict target embeddings      │
│           from context embeddings        │
│                                          │
│  Loss: MSE(predicted, target_stopped)    │
│  + VICReg (variance + covariance)        │
│  + Cycle consistency (Saccade only)      │
└──────────────────────────────────────────┘

One-file quickstart

# Training (the only supported mode)
import yaml
from jepa.train import train
from jepa.jepa import ViTJepa

# Load config
with open("config/training.yml") as f:
    config = yaml.safe_load(f)

# Create model
model = ViTJepa(
    image_size=224,
    patch_size=16,
    embed_dim=768,
    depth=12,
    num_heads=12,
)

# Train on ImageNet
train(model, config)

Training Configuration (config/training.yml)

model:
  type: "vit_jepa"        # or "saccade_jepa", "energy_ijepa", "self_distill_mae"
  image_size: 224
  patch_size: 16
  embed_dim: 768
  depth: 12
  num_heads: 12

training:
  dataset: "imagenet"
  batch_size: 128
  gradient_accumulation: 128
  learning_rate: 1.5e-4
  warmup_epochs: 40
  total_epochs: 300
  weight_decay: 0.05
  ema_momentum: 0.996      # Target encoder EMA

  schedule:
    type: "cosine"
    min_lr: 1e-6

Saccade JEPA (Novel Variant)

from jepa.jepa import SaccadeJepa

model = SaccadeJepa(
    image_size=224,
    patch_size=16,
    embed_dim=768,
    # Uses ConvNeXT tiny backbone
    # NeRF-like positional encoding of rotation/translation affine transforms
    # VICReg loss + cycle consistency
)

# Cycle consistency: forward-backward saccade prediction must reconstruct original
# Mimics mammalian visual system's saccadic eye movements

Evaluation

MethodPurpose
Linear probesEvaluate frozen representations
KNN (k-nearest neighbors)Non-parametric evaluation
Correlation dimensionRepresentation geometry
UMAP visualizationEmbedding space visualization
Attention map dashboardInteractive Dash visualization

Dependencies

  • torch, torchvision — core framework
  • einops — tensor operations
  • pyyaml — config parsing
  • tqdm, numpy, matplotlib — utilities
  • sklearn (optional) — KNN evaluation
  • umap-learn (optional) — visualization
  • energy_transformer (optional) — for Energy I-JEPA variant

Project Structure

jin/
├── jepa/
│   ├── jepa.py          # Core models (ViTJepa, SaccadeJepa, EnergyIJepa)
│   ├── masked_autoencoder.py  # Self-Distillation MAE
│   ├── train.py         # Training loop
│   ├── patcher.py       # Image patch embedding (Conv, Hybrid, Conv3d)
│   ├── saccade.py       # Saccade cropper (NeRF positional encoding)
│   └── vicreg.py        # VICReg loss terms
├── config/
│   └── training.yml     # Training configuration
├── papers/              # Research papers and grant proposals
├── pyproject.toml
└── LLM.md

Roadmap (Aspirational — NOT in code)

Grant proposals describe future multimodal expansion:

  • Text encoder (not implemented)
  • Audio encoder (not implemented)
  • Diffusion Transformer MoE (not implemented)
  • Cross-modal alignment (not implemented)
  • hanzo/hanzo-engine.md - Inference engine (future: serve trained Jin models)
  • hanzo/hanzo-candle.md - Rust ML framework
  • hanzo/zenlm.md - Text-only Zen models
  • hanzo/hanzo-gym.md - Training infrastructure

How is this guide?

Last updated on

On this page