Unlocking AI Tokenomics: Real-World Impact from WekaIO Labs

Unlocking AI Tokenomics: Real-World Impact from WekaIO Labs

Weka Maor Ben Dayan


By
Maor Ben-Dayan, Co-founder and Chief Architect, WekaIO, Inc.

Recently, we discussed the evolving landscape of AI token economics, popularly referred to as ‘tokenomics’, and its potential to enhance memory and storage efficiencies.

Weka Blog Og Ai Tokenomics Results Lab IntroIn our analysis, DeepSeek’s recent advancements in token processing demonstrated a 26x increase in efficiency, reducing the time taken to generate the first token (TFFT) from a 128K-token prompt from 13 seconds to merely 0.5 seconds, thereby highlighting a global shift towards enhanced AI efficiency.

WEKA has been engaged in similar innovations; however, it is essential to go beyond theoretical discussions to present empirical results from our own lab experiments focused on radical tokenomics. To illustrate the advantages of integrating GPU high-bandwidth memory with rapid storage, we rigorously tested the WEKA Data Platform, utilizing NVIDIA Magnum IO GPUDirect Storage (GDS) in conjunction with a high-performance 8-node WEKApod. This setup yielded remarkable improvements in token processing, achieving a 41x faster token prefill time for 105,000 tokens without compression, while also showing significant gains at shorter context lengths.

Challenge: AI Token Processing and Memory Limitations
AI inference at scale necessitates innovations to address inherent memory limitations. Modern GPUs are adept at processing large volumes of data simultaneously; however, their memory capacity is finite. As models increase in complexity and context lengths lengthen, their memory requirements often exceed what a single GPU can support. This leads to scenarios where GPUs become constrained by memory, creating bottlenecks during token generation, particularly noticeable in the decoding phase of Large Language Models (LLMs), which rely on rapid data retrieval for effective input processing.

Moreover, the capacity of GPUs to scale memory independently is limited. Enhancing memory necessitates the addition of more GPUs, which consequently escalates costs without yielding proportional performance benefits. Numerous AI tasks exhibit a disconnect in resource utilization, which can lead to 60% idle GPU time due to underutilized processing power stemming from insufficient memory availability.

Addressing the question: How can we separate GPU compute from memory restrictions to enhance efficiency and lower costs?

Tackling AI’s Biggest Challenges at Scale
WEKA is actively addressing critical challenges within AI at scale, especially concerning training and inference. Large-scale AI training demands substantial computational resources alongside effective data pathways to manage ever-expanding data sets and intricate model structures. The demand for superior storage solutions and efficient GPU utilization necessitates innovative approaches to alleviate bottlenecks. Efficient deployment of AI models at scale presents its own difficulties, as inference workloads frequently contend with latency, memory limitations, and the need to adapt dynamically to varying workload demands. As model complexity increases, maintaining inference pipelines that can accommodate longer context lengths without unnecessary delays is vital.

Expanding context lengths in inference processes presents an emerging challenge, with models needing to handle considerably longer sequences, thereby intensifying demands on memory and computational resources.

In tests conducted with LLaMA3.1 70B without quantization, it was noted that a large 100,000-token prompt could take approximately 24 seconds for prefill significantly lengthening initialization time before output generation. Many current inference systems discard this prefill data post-usage, leading to repeated computational overhead with every model execution.

DeepSeek’s innovation illustrates a method to preserve the KV cache for reloading as needed, thus markedly decreasing redundant computations. This technique, which has seen uptake in various laboratories, delivers substantial performance and fiscal advantages. Nonetheless, while proprietary solutions are abundant, WEKA aims to extend this capability to enterprises in a manner that is open and scalable.

The pressing challenge remains: how swiftly can we reload and utilize this KV cache at scale? WEKA’s expertise in high-performance storage and data management is critical here, facilitating AI workloads to retain, access, and leverage inference data with minimal latency, thereby unlocking new efficiencies for enterprise AI applications.

WEKA Vision: Purpose-Built for AI
Positioned distinctively to tackle these challenges, WEKA avoids the pitfalls of general-purpose storage solutions, instead honing in on AI workloads to ensure optimizations are adeptly designed for performance, scalability, and efficiency.

WEKA’s architecture allows data processing to align reads and writes into GPU memory (through GDS) directly to the closest NIC, maximizing performance by minimizing unnecessary data transfers and latency. This approach creates an efficient pipeline that enables AI models to receive data at unprecedented speeds. Furthermore, our design incorporates the segmentation of I/O into highly parallel streams, ensuring rapid data access and processing for AI tasks.

Beyond storage optimization, WEKA also engages directly with inference engines. Consequently, we are assessing how extending GPU memory to ultra-fast storage can remarkably enhance token processing efficiency. Our experiments reveal that integrating high-speed storage with intelligent data management enables AI workloads to achieve increased throughput alongside reduced inference costs, unlocking new avenues for enterprises seeking to expand their AI infrastructures.

WEKA’s Proof of Concept: Pushing Tokenomics Even Further
To examine how augmenting GPU memory with ultra-fast storage can significantly enhance token processing performance and showcase the capabilities of the WEKA architecture, we conducted various tests with differing token counts and configurations.

  • NVIDIA DGX H100
  • 8-node WEKApod equipped with PCIe Gen 5
  • NVIDIA Quantum-2 QM9700 64-port NDR 400Gb InfiniBand switches

Our evaluations revealed an astounding 41x decrease in prefill time on LLaMA3.1-70B, dropping from 23.97 seconds to just 0.58 seconds. This represents a significant advancement in resolving one of the primary bottlenecks within inference workloads. Of the 0.58 seconds recorded, less than 0.2 seconds was attributed to data transfer, indicating potential for even further reduction in inference overhead. This level of efficiency has yet to be matched by other optimization efforts aimed at inference. As AI models continue to scale their context lengths, such advancements are imperative to ensure both scalability and enhanced efficiency throughout the inference process.

Importantly, this test was conducted without compressing the KV Cache or employing quantization, thereby ensuring that GPUs were fully dedicated to inferencing tasks. A major consideration for our enterprise clients is the pursuit of maximum accuracy; this methodology guarantees that precision is upheld. Notably, we observed substantial prefill time improvements even at smaller context sizes, down to 50 tokens.

Key Takeaways: Why This Matters for GenAI
Across varying token counts, it is evident that these optimizations yield considerable efficiency enhancements for GPUs. Reducing prefill compute time by 24 seconds per request liberates GPUs to produce more output tokens without delay for data initialization. When combined with rapid model loading over GDS, swift checkpointing during training, and quick resumption of inference tasks, organizations can effectively pool their training and inference efforts and swiftly transition between the two. Consolidating workloads using a unified infrastructure minimizes hardware sprawl, simplifies management, and promotes optimal use of AI resources, thereby enhancing cost efficiency and operational agility.

Weka Kv Cache Speed Up

The Future of AI Tokenomics
In light of the growing adoption of AI technologies, efforts to optimize tokenomics will inevitably become a vital competitive advantage. Organizations that do not actively reduce token processing costs risk falling behind in a rapidly evolving market. Thanks to advancements in caching, storage, and latency reduction, WEKA is leading the charge in creating scalable, cost-effective, and accessible AI solutions, whether on-premises or in the public cloud. WEKA is committed to productizing these enhancements alongside our design partners, ensuring enterprises can exploit storage-accelerated AI to boost token processing efficiency on a large scale.

Resources:
Stay tuned for the company’s next updates and join the company at NVIDIA GTC to see these game-changing results in action.
More about WEKA and the evolving landscape of AI tokenomics.