The "Memory Wall" Bottleneck
Standard deep learning operating in 16-bit precision requires massive memory clusters. A 90B parameter model requires over 1 Terabyte of memory to store weights, gradients, and optimizer states. Qsbits compresses this footprint by 95%, making consumer-hardware training viable.
Training Memory Footprint Comparison (90B Model)
Asymmetric Quantization Specification
Qsbits maps parameters to word-aligned memory structures (32-bit registers). Weights are mapped strictly to 1-bit, while gradients and states use a 1.58-bit ternary mapping. Hover over the bits below to see the decoded values.
Weight Tensor (1-bit)
32 weights packed into one uint32_t (4 bytes).
Gradient/State Tensor (1.58-bit)
16 states packed into one uint32_t (2 bits per state).
Register-Level 16-Bit Math (CUDA BFE)
Weights are never unpacked in VRAM. They are expanded to 16-bit strictly within the GPU Arithmetic Logic Unit via Bit-Field Extract (BFE).
// Thread-level dot product of 32 elements
uint32_t packed_w = W_packed[col * K_packed + k_idx];
#pragma unroll
for (int i = 0; i < 32; ++i) {
float x_val = __bfloat162float(X[row * K + (k_idx * 32) + i]);
// Extract bit without branching (BFE instruction)
uint32_t bit = (packed_w >> i) & 1U;
// Add if 1, subtract if 0
acc += bit ? x_val : -x_val;
}
2-Step Inertia Finite State Machine (FSM)
Because 1-bit weights cannot support continuous floating-point updates like Adam, Qsbits uses an integer-based FSM. It acts as a low-pass filter: a state must build momentum before it flips a weight. Interact with the simulator below to understand the logic.
Interactive FSM Simulator
1. Send Gradient Pulse
Phased Sparse Training & Offloading
By limiting active trainable parameters to a 1.5B subset per phase, Qsbits fits effortlessly within 8GB VRAM budgets, while the frozen 1-bit base weights are streamed sequentially from System RAM.
No OS Crashes
The total active execution memory in VRAM is strictly bounded under 1GB, preventing Out-of-Memory (OOM) exceptions typical in large model training.
Latency Hiding
Because 1-bit weights are extremely small (125MB per 1B parameters), PCIe transfer takes ~3.9ms. This transfer is entirely hidden behind the async CUDA compute kernel.
Unified Portability
C++ dynamic dispatch allows transparent routing between NVIDIA (CUDA PTX BFE) and Intel (SYCL DPC++) hardware architectures without modifying the PyTorch Python frontend.
Hardware Efficiency Metrics
By bypassing floating-point multiply-accumulate (FMAD) pipelines, Qsbits achieves significant reductions in both computational complexity and energy consumption per operation.
Operation Cycle Complexity
2× Reduction in execution unit cycles.
Energy per Operation (pJ)
4.1× Compute energy efficiency improvement.