Inference engine · C++17 / CUDA · Apache 2.0

ternative

An inference engine for ternary-weight LLMs with runtime LoRA — the llama.cpp of BitNet models. It loads a BitNet I2_S base and a separate LoRA adapter, merges them at full F32 precision, and serves the result over an OpenAI-compatible HTTP server — on GPU or CPU-only hardware.

−10+1 v1.0.0 · May 2026 Windows · Linux CUDA 12.x

Why ternative

No other stack can serve this correctly.

Merging a LoRA adapter into an I2_S base and re-quantizing rounds every delta to zero — the fine-tuning is silently discarded (delta magnitude ≈ 10⁻⁵ versus base weight ≈ 1.2). ternative keeps the LoRA separate and applies it at full F32 precision at load time.

EngineBitNet I2_SRuntime LoRAI2_S + LoRAServer
llama.cpp⚠ type-36 error✓ Q4/Q8 onlyvia llama-server
bitnet.cpp✓ native✗ no path
ternative✓ full precision✓ built-in

How it works

De-quantize, apply,
re-cast.

The pipeline keeps the alignment intact where everyone else loses it: the I2_S base is de-quantized to F32, the LoRA delta is applied at full precision, then cast to F16 for inference and cached to disk for fast reloads.

01
Load the I2_S base GGUF (~1.1 GB on disk)
02
De-quantize I2_S → F32
03
Apply delta: W = W_base + (B·A)·α/r
04
Cast → F16, cached as .tvcache
05
Offload to GPU — mixed F16 + INT8, 30 layers, GPU KV-cache

Performance

Benchmarked on a 4 GB laptop.

GPU · 30 layers
6–7tok/s

14 layers F16 + 16 layers INT8 · 3,296 MB VRAM

CPU-only
~6tok/s

AVX2 F16C GEMM + OpenMP, no GPU needed

Prefill
~50× faster

Batched GEMV kernel — 100 prompt tokens in milliseconds


Quick start

Build & run (Linux)

bash
# clone & build
git clone --depth 1 \
  https://github.com/michelangeloromerochisco/ternative
cd ternative
cmake -B build -DCMAKE_BUILD_TYPE=Release \
  && cmake --build build --parallel

# GPU build: add -DTERNATIVE_CUDA=ON
 

Serve, OpenAI-compatible

bash
./build/ternative \
  --model ./models/ggml-model-i2_s.gguf \
  --lora  ./models/dpo_aligned-lora.gguf \
  --server --port 8080

# then point any OpenAI client at
http://localhost:8080/v1

Supported models

What it runs today

ModelFormatStatus
Orchid 1.0I2_S + LoRAProduction
BitNet b1.58-2B-4TI2_S
TerseI2_S + extUpcoming
Roadmap

Shipped & next

  • GGUF v3 loader + I2_S tensor ops
  • LoRA merge at F32 — zero rounding loss
  • CPU inference · AVX2 F16C + OpenMP
  • OpenAI-compatible server
  • GPU-resident forward pass + KV-cache
  • cuBLAS GEMM for large-batch prefill
  • Metal backend (Apple Silicon)
  • Python bindings — ternative-py

Get the engine.

Apache 2.0 — free for research and commercial use. Windows & Linux.