DoubleSparse reduces KV-cache memory access in LLM inference through sparse attention, and the algorithm was later merged into SGLang.
Post-Training Sparse Attention with Double Sparsity
Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng
|
Aug 1, 2024
