Mojo
Neutron Mojo is an ML inference library targeting Mojo 1.0. It provides tensor operations, quantization formats, neural network layers, and an inference serving pipeline — all with SIMD-accelerated kernels.
Status: Implementation complete for pre-1.0 Mojo syntax. Awaiting Mojo 1.0 compiler release (expected H1 2026) for testing and migration.
Tensor Operations
Type-safe tensors with compile-time dimension checking:
from neutron.tensor import Tensor, Dim, matmul, softmax, rmsnorm
# Typed dimensions (compile-time shape safety)
alias Batch = Dim[0]
alias Seq = Dim[1]
alias Hidden = Dim[2]
# Core operations
let output = matmul(weights, input) # Matrix multiply
let probs = softmax(logits, axis=-1) # Softmax
let normed = rmsnorm(x, weight, 1e-5) # RMS normalization
let activated = silu(x) # SiLU activation
SIMD Kernels
Hot-path operations use SIMD intrinsics for maximum throughput:
| Function | Description |
|----------|-------------|
| simd_dot(a, b) | Dot product |
| simd_matvec(A, v) | Matrix-vector multiply |
| simd_rmsnorm(x, w, eps) | RMS layer normalization |
| simd_attention_scores(Q, K, scale) | Attention score computation |
| simd_online_softmax_attention(Q, K, V) | Fused attention (FlashAttention-style) |
Additional Operations
layernorm(x, weight, bias) # Layer normalization
gelu(x) # GELU activation
swiglu(x, w1, w2, w3) # SwiGLU (used in LLaMA)
Quantization
Support for 8 quantization formats used in production LLM deployment:
| Format | Bits | Block Size | Use Case |
|--------|------|------------|----------|
| Q4_0 | 4 | 32 | Basic 4-bit |
| Q4_1 | 4+min | 32 | 4-bit with offset |
| Q8_0 | 8 | 32 | High-quality 8-bit |
| Q4_K_S | 4 | K-quant | Small K-quant |
| Q4_K_M | 4 | K-quant | Most common (best quality/size) |
| NF4 | 4 | NormalFloat | QLoRA fine-tuning |
| FP8_E4M3 | 8 | — | Training (4 exp, 3 mantissa) |
| FP8_E5M2 | 8 | — | Inference (5 exp, 2 mantissa) |
from neutron.quant import QuantType
let qt = QuantType.Q4_K_M
qt.bits_per_element() # 4
qt.block_size() # 32
Neural Network Layers
Models
Pre-built model architectures:
| Model | Description | |-------|-------------| | LLaMA | Meta's LLaMA family | | Phi | Microsoft Phi series | | Mistral | Mistral AI models | | GPT | GPT-2/NeoX variants |
Key Components
from neutron.nn import Attention, KVCache, RoPE, BPETokenizer
# Attention with KV cache
let cache = KVCache(max_seq_len=4096, n_heads=32, head_dim=128)
let output = Attention(Q, K, V, cache)
# Rotary position embeddings
let Q_rot, K_rot = RoPE(Q, K, position)
# Tokenization
let tokenizer = BPETokenizer.load("tokenizer.json")
let tokens = tokenizer.encode("Hello, world!")
let text = tokenizer.decode(tokens)
Inference Pipeline
End-to-end text generation from quantized models:
from neutron.nn import Q4Model, q4_pipeline_generate, PipelineConfig
let model = Q4Model.load("model.gguf")
let tokenizer = BPETokenizer.load("tokenizer.json")
let config = PipelineConfig(
max_tokens=512,
temperature=0.7,
top_p=0.9,
chat_template="llama", # or "chatml"
)
let response = q4_pipeline_generate(model, tokenizer, "What is Rust?", config)
Pipeline Steps
- Apply chat template (LLaMA, ChatML, etc.)
- Encode prompt with BPE tokenizer
- Create KV cache (quantized to Q8 for memory efficiency)
- Prefill all prompt tokens
- Autoregressive decode with sampling (temperature, top-p, repetition penalty)
- Decode output tokens back to text
E-Graph Optimizer
Algebraic rewrite engine for compute graph fusion — 30+ rules:
| Category | Examples |
|----------|---------|
| Identity | x + 0 → x, x * 1 → x |
| Idempotence | relu(relu(x)) → relu(x) |
| Involution | transpose(transpose(x)) → x |
| Translation invariance | softmax(x + c) → softmax(x) |
| Operator fusion | gelu(linear(x)) → fused_linear_gelu(x) |
| Reassociation | matmul(A, matmul(B, C)) → matmul(matmul(A, B), C) |
The optimizer represents the compute graph as an e-graph (equality saturation) and applies rewrite rules until no more simplifications are possible.
Serving
Text Protocol
Lightweight stdin/stdout protocol for local inference:
REQUEST
prompt=What is Rust?
max_tokens=256
temperature=0.7
RESPONSE
text=Rust is a systems programming language...
tokens=42
time_ms=1523
HTTP Server
OpenAI-compatible API:
POST /v1/completions
POST /v1/chat/completions
Supports streaming responses.
Model I/O
| Format | Read | Write | Description | |--------|------|-------|-------------| | GGUF | Yes | Yes | llama.cpp quantized models | | SafeTensors | Yes | Yes | HuggingFace format | | Checkpoints | Yes | Yes | Training checkpoints |
Project Structure
mojo/
├── tensor/ # Multi-dtype tensor ops, SIMD kernels
├── quant/ # Quantization (NF4, Q4_K, Q8_0, FP8)
├── nn/ # Neural network layers, models, pipelines
├── serve/ # HTTP + text protocol inference server
├── train/ # Training loops, optimizers
├── optim/ # Adam, SGD, AdamW, schedules
├── autograd/ # Automatic differentiation
├── fusion/ # E-graph algebraic optimization
├── data/ # Tokenizers, datasets, data loaders
├── io/ # GGUF, SafeTensors, checkpoints
├── model/ # LLaMA, Phi, Mistral, GPT
├── python/ # Python interop bindings
├── dlpack/ # DLPack tensor exchange
└── cli/ # Inference + benchmark CLI