Hanzo
Hanzo Skills Reference

Hanzo Extract

Content Extraction & Sanitization

Overview

Hanzo Extract is a Rust crate for extracting clean text from web pages, PDFs, and Claude Code conversation logs. Designed to produce LLM-ready content with optional PII sanitization via hanzo-guard. Ships two CLI binaries (extract-web, extract-conversations) and a library API with async trait-based extractors.

Why Hanzo Extract?

  • Web extraction: Fetches pages, strips scripts/nav/footer, extracts main content area
  • PDF extraction: Text extraction from PDF documents via lopdf
  • Conversation export: Turns Claude Code JSONL session logs into training datasets with quality scoring and train/val/test splits
  • Sanitization: PII redaction pipeline via hanzo-guard (feature-gated)
  • Feature flags: Compile only what you need (web, pdf, conversations, sanitize)

Tech Stack

  • Language: Rust (edition 2021)
  • Crate: hanzo-extract v0.1.0
  • Async runtime: Tokio
  • Web: reqwest + scraper (HTML parsing/CSS selectors)
  • PDF: lopdf
  • Conversations: walkdir, sha2, chrono, rand (anonymization + splits)
  • Error handling: thiserror
  • Serialization: serde + serde_json
  • CI: GitHub Actions (ci.yml, pages.yml for docs)

Repo: github.com/hanzoai/extract

When to use

  • Extracting clean text from web pages for LLM context/RAG
  • Extracting text from PDF documents
  • Exporting Claude Code conversations for fine-tuning datasets
  • Building content pipelines that need PII redaction

Quick reference

ItemValue
Cratehanzo-extract
Repogithub.com/hanzoai/extract
Version0.1.0
LicenseMIT OR Apache-2.0
Docsdocs.rs/hanzo-extract
Default featuresweb, pdf

Installation

cargo add hanzo-extract
# Cargo.toml - pick what you need
[dependencies]
hanzo-extract = "0.1"                                         # web + pdf (default)
hanzo-extract = { version = "0.1", features = ["full"] }      # everything
hanzo-extract = { version = "0.1", features = ["conversations"] }  # dataset export

Usage

Web Extraction

use hanzo_extract::{WebExtractor, ExtractorConfig, Extractor};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = WebExtractor::new(ExtractorConfig::default());
    let result = extractor.extract("https://example.com").await?;

    println!("Title: {:?}", result.title);
    println!("Text: {}", result.text);
    println!("Words: {}", result.word_count);
    Ok(())
}

PDF Extraction

use hanzo_extract::{PdfExtractor, Extractor};

let extractor = PdfExtractor::default();
let result = extractor.extract("document.pdf").await?;
println!("{}", result.text);

Conversation Export (training datasets)

use hanzo_extract::conversations::{ConversationExporter, ExporterConfig};
use std::path::Path;

let mut exporter = ConversationExporter::new();
exporter.export(
    Path::new("~/.claude/projects"),
    Path::new("./training-data"),
)?;

Output structure:

./training-data/
  conversations_20260313.jsonl   # Full conversation data
  training_20260313.jsonl        # Instruction/response pairs
  splits/
    train_20260313.jsonl         # 80%
    val_20260313.jsonl           # 10%
    test_20260313.jsonl          # 10%

CLI Binaries

# Web extraction
cargo install hanzo-extract --features web
extract-web https://example.com
extract-web https://example.com --json

# Conversation export
cargo install hanzo-extract --features conversations
extract-conversations --source ~/.claude/projects --output ./conversations

Configuration

use hanzo_extract::ExtractorConfig;

let config = ExtractorConfig::default()
    .with_max_length(200_000)       // Max chars (default: 100,000)
    .with_timeout(60)               // Request timeout secs (default: 30)
    .with_clean_text(true);         // Strip HTML/scripts (default: true)

// With sanitization (requires `sanitize` feature)
let config = config
    .with_sanitize(true)
    .with_redact_pii(true)
    .with_detect_injection(true);

Feature Flags

FeatureDefaultDescription
webYesWeb page extraction (reqwest + scraper)
pdfYesPDF text extraction (lopdf)
conversationsNoClaude Code session export
sanitizeNoPII redaction via hanzo-guard
fullNoAll features

Key Files

FilePurpose
src/lib.rsCrate root, Extractor trait, re-exports
src/web.rsWebExtractor with HTML parsing and content area detection
src/pdf.rsPdfExtractor for PDF text extraction
src/conversations.rsConversationExporter with quality scoring and splits
src/config.rsExtractorConfig with builder pattern
src/sanitize.rsSanitization pipeline (hanzo-guard integration)
src/error.rsExtractError enum (thiserror)
src/result.rsExtractResult struct

Quality Scoring (Conversations)

Conversations are scored 0.0-1.0 based on:

  • Thinking/reasoning presence (+0.2)
  • Tool usage (+0.15)
  • Agentic tools like Task/dispatch (+0.1)
  • Opus/Sonnet model (+0.1/+0.05)
  • Response length (+0.1)
  • hanzo/hanzo-aci.md - Agent Computer Interface (uses extraction for document conversion)
  • hanzo/hanzo-guard.md - LLM I/O sanitization (powers the sanitize feature)

How is this guide?

Last updated on

On this page