Mojo

Neutron Mojo is an ML inference library targeting Mojo 1.0. It provides tensor operations, quantization formats, neural network layers, and an inference serving pipeline — all with SIMD-accelerated kernels.

Status: Implementation complete for pre-1.0 Mojo syntax. Awaiting Mojo 1.0 compiler release (expected H1 2026) for testing and migration.

Tensor Operations

Type-safe tensors with compile-time dimension checking:

from neutron.tensor import Tensor, Dim, matmul, softmax, rmsnorm

# Typed dimensions (compile-time shape safety)
alias Batch = Dim[0]
alias Seq = Dim[1]
alias Hidden = Dim[2]

# Core operations
let output = matmul(weights, input)      # Matrix multiply
let probs = softmax(logits, axis=-1)     # Softmax
let normed = rmsnorm(x, weight, 1e-5)    # RMS normalization
let activated = silu(x)                   # SiLU activation

SIMD Kernels

Hot-path operations use SIMD intrinsics for maximum throughput:

| Function | Description | |----------|-------------| | simd_dot(a, b) | Dot product | | simd_matvec(A, v) | Matrix-vector multiply | | simd_rmsnorm(x, w, eps) | RMS layer normalization | | simd_attention_scores(Q, K, scale) | Attention score computation | | simd_online_softmax_attention(Q, K, V) | Fused attention (FlashAttention-style) |

Additional Operations

layernorm(x, weight, bias)    # Layer normalization
gelu(x)                       # GELU activation
swiglu(x, w1, w2, w3)        # SwiGLU (used in LLaMA)

Quantization

Support for 8 quantization formats used in production LLM deployment:

| Format | Bits | Block Size | Use Case | |--------|------|------------|----------| | Q4_0 | 4 | 32 | Basic 4-bit | | Q4_1 | 4+min | 32 | 4-bit with offset | | Q8_0 | 8 | 32 | High-quality 8-bit | | Q4_K_S | 4 | K-quant | Small K-quant | | Q4_K_M | 4 | K-quant | Most common (best quality/size) | | NF4 | 4 | NormalFloat | QLoRA fine-tuning | | FP8_E4M3 | 8 | — | Training (4 exp, 3 mantissa) | | FP8_E5M2 | 8 | — | Inference (5 exp, 2 mantissa) |

from neutron.quant import QuantType

let qt = QuantType.Q4_K_M
qt.bits_per_element()  # 4
qt.block_size()        # 32

Neural Network Layers

Models

Pre-built model architectures:

| Model | Description | |-------|-------------| | LLaMA | Meta's LLaMA family | | Phi | Microsoft Phi series | | Mistral | Mistral AI models | | GPT | GPT-2/NeoX variants |

Key Components

from neutron.nn import Attention, KVCache, RoPE, BPETokenizer

# Attention with KV cache
let cache = KVCache(max_seq_len=4096, n_heads=32, head_dim=128)
let output = Attention(Q, K, V, cache)

# Rotary position embeddings
let Q_rot, K_rot = RoPE(Q, K, position)

# Tokenization
let tokenizer = BPETokenizer.load("tokenizer.json")
let tokens = tokenizer.encode("Hello, world!")
let text = tokenizer.decode(tokens)

Inference Pipeline

End-to-end text generation from quantized models:

from neutron.nn import Q4Model, q4_pipeline_generate, PipelineConfig

let model = Q4Model.load("model.gguf")
let tokenizer = BPETokenizer.load("tokenizer.json")

let config = PipelineConfig(
    max_tokens=512,
    temperature=0.7,
    top_p=0.9,
    chat_template="llama",  # or "chatml"
)

let response = q4_pipeline_generate(model, tokenizer, "What is Rust?", config)

Pipeline Steps

  1. Apply chat template (LLaMA, ChatML, etc.)
  2. Encode prompt with BPE tokenizer
  3. Create KV cache (quantized to Q8 for memory efficiency)
  4. Prefill all prompt tokens
  5. Autoregressive decode with sampling (temperature, top-p, repetition penalty)
  6. Decode output tokens back to text

E-Graph Optimizer

Algebraic rewrite engine for compute graph fusion — 30+ rules:

| Category | Examples | |----------|---------| | Identity | x + 0 → x, x * 1 → x | | Idempotence | relu(relu(x)) → relu(x) | | Involution | transpose(transpose(x)) → x | | Translation invariance | softmax(x + c) → softmax(x) | | Operator fusion | gelu(linear(x)) → fused_linear_gelu(x) | | Reassociation | matmul(A, matmul(B, C)) → matmul(matmul(A, B), C) |

The optimizer represents the compute graph as an e-graph (equality saturation) and applies rewrite rules until no more simplifications are possible.

Serving

Text Protocol

Lightweight stdin/stdout protocol for local inference:

REQUEST
prompt=What is Rust?
max_tokens=256
temperature=0.7

RESPONSE
text=Rust is a systems programming language...
tokens=42
time_ms=1523

HTTP Server

OpenAI-compatible API:

POST /v1/completions
POST /v1/chat/completions

Supports streaming responses.

Model I/O

| Format | Read | Write | Description | |--------|------|-------|-------------| | GGUF | Yes | Yes | llama.cpp quantized models | | SafeTensors | Yes | Yes | HuggingFace format | | Checkpoints | Yes | Yes | Training checkpoints |

Project Structure

mojo/
├── tensor/         # Multi-dtype tensor ops, SIMD kernels
├── quant/          # Quantization (NF4, Q4_K, Q8_0, FP8)
├── nn/             # Neural network layers, models, pipelines
├── serve/          # HTTP + text protocol inference server
├── train/          # Training loops, optimizers
├── optim/          # Adam, SGD, AdamW, schedules
├── autograd/       # Automatic differentiation
├── fusion/         # E-graph algebraic optimization
├── data/           # Tokenizers, datasets, data loaders
├── io/             # GGUF, SafeTensors, checkpoints
├── model/          # LLaMA, Phi, Mistral, GPT
├── python/         # Python interop bindings
├── dlpack/         # DLPack tensor exchange
└── cli/            # Inference + benchmark CLI