System Optimizations for LLM Inferencing

Fast LLM inferencing requires sufficient GPU memory to accommodate the entire model. However, many users find low-end GPUs inadequate or even unusable for this purpose. Our focus is on developing system methodologies and solutions to speed up LLM inferencing, such as offloading, data traffic optimization, neuron and weight redistribution, and token batching and rescheduling.

To address the challenges of LLM inferencing on low-end GPUs, we propose a series of system optimizations to enhance performance and efficiency. These optimizations include:

Previous topic
LLM-Assisted Configuration Tuning for Storage and Memory systems
Next topic
LSM-based Key-Value Store Redesign for Disaggregated Infrastructure