smallm.works

nuff said!

Rise of the Inference Stack

While language models promise to fundamentally change how we use AI across all industries, actually serving these models with low latency is quite challenging and slow even on expensive hardware.

This inference problem has drawn the attention of many groups and organizations to create a solution for fast and efficient inference that works on various setups, from a single GPU to large distributed clusters, supporting both language and multimodal models.

Until recently this stack really did not exist. You just went with a library like vLLM and developed your APIs on top of them. However, with increasing use of self-hosted models a more sophisticated set of tools have begun to appear.

The importance of inference extends beyond just serving responses in production; it’s also vital during training. A strong inference system helps evaluate progress in real-world situations. This is crucial during evaluation, where generating, scoring, and ranking responses affects the next training steps.

To achieve all this, you need a solid system that can handle AI inference at scale.

In simple terms, the AI Inference Stack is the software needed to run the inference process on language models.

Open Source Actors in the Model Inference Play

vLLM – Is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.

SG-Lang – SGLang is a high-performance serving framework for large language models and multimodal models. Essentially a vLLM alternative. There exist a few other alternatives too such as HuggingFace Transformers, Guidance, LightLLM, FlashInfer etc. However, vLLM seems to be the dominant player.

llm-d – Is a Kubernetes-native high-performance distributed LLM inference framework.

KServe – Provides a unified platform for both Generative and Predictive AI inference on Kubernetes for self hosted AI.

Anatomy of the Inference Stack

The Inference Engine

At the core of the stack is the Inference Engine which runs the model and generates the outputs. This layer orchestrates the entire inference lifecycle that includes –

  • Validation, tokenization and pre-processing of the input prompt
  • Loading the LLM weights onto the hardware
  • Executing forward passes on the neural network model
  • Scheduling queries for concurrent execution
  • Managing the KV Cache memory through different caching/memory management algorithms (Paged Attention, Radix Attention, Continuous Batching etc.)
  • Consolidating the output with guided decoding

This layer is provided by libraries like vLLM and SGL.

The Web Serving Layer

The core inference engine is decoupled from the “frontend” web API server. Multiple replicas of the inference engine run as separate processes in the “backend”. And these communicate with the “frontend” API server process via a “Coordinator” process.

If serving needs to be Kubernetes native, then solutions like llmd come into the picture. llmd brings advanced distributed inference capabilities. Most importantly llmd promotes a disaggregated inference process where prompt processing (prefill) and token generation (decode) processes run as different services on different pods. This separation allows for independent scaling and optimization (prefill is compute bound and decode is memory bound) of each phase, since they have different computational demands.
It also improves the Kubernetes Gateway API, allowing for better routing of incoming requests. It uses real-time data such as KV cache use and pod load to send requests to the best instance, increasing cache hits and balancing the workload across the cluster. This Kubernetes-native approach also provides the policy, security, and observability layer for generative AI inference.

References

Published by

Leave a comment