Post-Training Sparse Attention with Double Sparsity

Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng | Aug 1, 2024

DoubleSparse reduces KV-cache memory access in LLM inference through sparse attention, and the algorithm was later merged into SGLang.