An inference engine for ternary-weight LLMs with runtime LoRA — the llama.cpp of BitNet models. It loads a BitNet I2_S base and a separate LoRA adapter, merges them at full F32 precision, and serves the result over an OpenAI-compatible HTTP server — on GPU or CPU-only hardware.
Merging a LoRA adapter into an I2_S base and re-quantizing rounds every delta to zero — the fine-tuning is silently discarded (delta magnitude ≈ 10⁻⁵ versus base weight ≈ 1.2). ternative keeps the LoRA separate and applies it at full F32 precision at load time.
| Engine | BitNet I2_S | Runtime LoRA | I2_S + LoRA | Server |
|---|---|---|---|---|
| llama.cpp | ⚠ type-36 error | ✓ Q4/Q8 only | ✗ | via llama-server |
| bitnet.cpp | ✓ native | ✗ no path | ✗ | ✗ |
| ternative | ✓ | ✓ full precision | ✓ | ✓ built-in |
The pipeline keeps the alignment intact where everyone else loses it: the I2_S base is de-quantized to F32, the LoRA delta is applied at full precision, then cast to F16 for inference and cached to disk for fast reloads.
14 layers F16 + 16 layers INT8 · 3,296 MB VRAM
AVX2 F16C GEMM + OpenMP, no GPU needed
Batched GEMV kernel — 100 prompt tokens in milliseconds
# clone & build git clone --depth 1 \ https://github.com/michelangeloromerochisco/ternative cd ternative cmake -B build -DCMAKE_BUILD_TYPE=Release \ && cmake --build build --parallel # GPU build: add -DTERNATIVE_CUDA=ON
./build/ternative \ --model ./models/ggml-model-i2_s.gguf \ --lora ./models/dpo_aligned-lora.gguf \ --server --port 8080 # then point any OpenAI client at http://localhost:8080/v1
| Model | Format | Status |
|---|---|---|
| Orchid 1.0 | I2_S + LoRA | Production |
| BitNet b1.58-2B-4T | I2_S | ✓ |
| Terse | I2_S + ext | Upcoming |
Apache 2.0 — free for research and commercial use. Windows & Linux.