Inference engine · C++17 / CUDA · Apache 2.0

ternative

An inference engine for ternary-weight LLMs with runtime LoRA — the llama.cpp of BitNet models. It loads a BitNet I2_S base and a separate LoRA adapter, merges them at full F32 precision, and serves the result over an OpenAI-compatible HTTP server — on GPU or CPU-only hardware.

−10+1 v1.0.0 · May 2026 Windows · Linux CUDA 12.x

Why ternative

No other stack can serve this correctly.

Merging a LoRA adapter into an I2_S base and re-quantizing rounds every delta to zero — the fine-tuning is silently discarded (delta magnitude ≈ 10⁻⁵ versus base weight ≈ 1.2). ternative keeps the LoRA separate and applies it at full F32 precision at load time.

Engine	BitNet I2_S	Runtime LoRA	I2_S + LoRA	Server
llama.cpp	⚠ type-36 error	✓ Q4/Q8 only	✗	via llama-server
bitnet.cpp	✓ native	✗ no path	✗	✗
ternative	✓	✓ full precision	✓	✓ built-in

How it works

De-quantize, apply,
re-cast.

The pipeline keeps the alignment intact where everyone else loses it: the I2_S base is de-quantized to F32, the LoRA delta is applied at full precision, then cast to F16 for inference and cached to disk for fast reloads.

Load the I2_S base GGUF (~1.1 GB on disk)

De-quantize I2_S → F32

Apply delta: W = W_base + (B·A)·α/r

Cast → F16, cached as .tvcache

Offload to GPU — mixed F16 + INT8, 30 layers, GPU KV-cache

Performance

Benchmarked on a 4 GB laptop.

GPU · 30 layers

6–7tok/s

14 layers F16 + 16 layers INT8 · 3,296 MB VRAM

CPU-only

~6tok/s

AVX2 F16C GEMM + OpenMP, no GPU needed

Prefill

~50× faster

Batched GEMV kernel — 100 prompt tokens in milliseconds

Quick start

Build & run (Linux)

bash

# clone & build
git clone --depth 1 \
  https://github.com/michelangeloromerochisco/ternative
cd ternative
cmake -B build -DCMAKE_BUILD_TYPE=Release \
  && cmake --build build --parallel

# GPU build: add -DTERNATIVE_CUDA=ON

Serve, OpenAI-compatible

bash

./build/ternative \
  --model ./models/ggml-model-i2_s.gguf \
  --lora  ./models/dpo_aligned-lora.gguf \
  --server --port 8080

# then point any OpenAI client at
http://localhost:8080/v1

Supported models

What it runs today

Model	Format	Status
Orchid 1.0	I2_S + LoRA	Production
BitNet b1.58-2B-4T	I2_S	✓
Terse	I2_S + ext	Upcoming

Roadmap

Shipped & next

✓ GGUF v3 loader + I2_S tensor ops
✓ LoRA merge at F32 — zero rounding loss
✓ CPU inference · AVX2 F16C + OpenMP
✓ OpenAI-compatible server
✓ GPU-resident forward pass + KV-cache
○ cuBLAS GEMM for large-batch prefill
○ Metal backend (Apple Silicon)
○ Python bindings — ternative-py

Get the engine.

Apache 2.0 — free for research and commercial use. Windows & Linux.

View on GitHub ↗ Releases ↗