Qsbits: HyperSpeed, Low-Memory Deep Learning Training Framework

The "Memory Wall" Bottleneck

Standard deep learning operating in 16-bit precision requires massive memory clusters. A 90B parameter model requires over 1 Terabyte of memory to store weights, gradients, and optimizer states. Qsbits compresses this footprint by 95%, making consumer-hardware training viable.

Training Memory Footprint Comparison (90B Model)

Impact: Qsbits reduces the total training state from ~1.08 TB to just 56.25 GB, eliminating the severe PCIe bottleneck (5.7 seconds per step) associated with standard hardware offloading.

Asymmetric Quantization Specification

Qsbits maps parameters to word-aligned memory structures (32-bit registers). Weights are mapped strictly to 1-bit, while gradients and states use a 1.58-bit ternary mapping. Hover over the bits below to see the decoded values.

Weight Tensor (1-bit)

32 weights packed into one uint32_t (4 bytes).

Mapping: 0 → -1.0 | 1 → +1.0

Gradient/State Tensor (1.58-bit)

16 states packed into one uint32_t (2 bits per state).

Mapping: 00 → 0 | 01 → +1 | 10 → -1

Register-Level 16-Bit Math (CUDA BFE)

Weights are never unpacked in VRAM. They are expanded to 16-bit strictly within the GPU Arithmetic Logic Unit via Bit-Field Extract (BFE).

// Thread-level dot product of 32 elements
uint32_t packed_w = W_packed[col * K_packed + k_idx];
#pragma unroll
for (int i = 0; i < 32; ++i) {
    float x_val = __bfloat162float(X[row * K + (k_idx * 32) + i]);
    // Extract bit without branching (BFE instruction)
    uint32_t bit = (packed_w >> i) & 1U;
    // Add if 1, subtract if 0
    acc += bit ? x_val : -x_val;
}

2-Step Inertia Finite State Machine (FSM)

Because 1-bit weights cannot support continuous floating-point updates like Adam, Qsbits uses an integer-based FSM. It acts as a low-pass filter: a state must build momentum before it flips a weight. Interact with the simulator below to understand the logic.

Interactive FSM Simulator

1. Send Gradient Pulse

→

Internal State (Inertia)

0

Requires 2 pushes to flip

→

Current Weight

+1

1-Bit Parameter

Simulator ready. Awaiting gradient input...

Phased Sparse Training & Offloading

By limiting active trainable parameters to a 1.5B subset per phase, Qsbits fits effortlessly within 8GB VRAM budgets, while the frozen 1-bit base weights are streamed sequentially from System RAM.

SYSTEM RAM (16 GB Limit)

90B Base Weights (1-bit) 11.25 GB

OS & System Overhead ~4.00 GB

↓ PCIe Gen 4 Stream (125 MB per layer) ↓

GPU VRAM (8 GB Limit)

Active Weights (1-bit, 1.5B) 187.5 MB

Active Gradients (2-bit) 375.0 MB

Active FSM States (2-bit) 375.0 MB

Residual Free VRAM (Activations) ~7.06 GB

No OS Crashes

The total active execution memory in VRAM is strictly bounded under 1GB, preventing Out-of-Memory (OOM) exceptions typical in large model training.

Latency Hiding

Because 1-bit weights are extremely small (125MB per 1B parameters), PCIe transfer takes ~3.9ms. This transfer is entirely hidden behind the async CUDA compute kernel.

Unified Portability

C++ dynamic dispatch allows transparent routing between NVIDIA (CUDA PTX BFE) and Intel (SYCL DPC++) hardware architectures without modifying the PyTorch Python frontend.

Hardware Efficiency Metrics

By bypassing floating-point multiply-accumulate (FMAD) pipelines, Qsbits achieves significant reductions in both computational complexity and energy consumption per operation.

Operation Cycle Complexity

2× Reduction in execution unit cycles.

Energy per Operation (pJ)

4.1× Compute energy efficiency improvement.