Publications

Selected first-author publications are highlighted on the homepage. This page includes those papers together with additional collaborative work across LLM serving, sparse attention, and video generation.

OSDI 2026 GPU Networking Transport Layer

UCCL

An extensible software transport layer for GPU networking.

Details

NeurIPS 2025 Benchmark World Models

WorldModelBench

Judging video generation models as world models.

Details

NeurIPS 2025 Spotlight Adaptive Sparsity Long Context

Twilight

Adaptive attention sparsity with hierarchical top-p pruning.

Details

ICML 2025 Semantic Sparsity Sparse Attention

HashAttention

Semantic sparsity for faster inference.

Details

Sparse Attention KV Cache LLM Inference

Post-Training Sparse Attention with Double Sparsity

Sparse attention for reducing KV-cache bandwidth in LLM inference.

Details

MLSys 2024 LoRA Serving CUDA Kernels

S-LoRA

Serving thousands of concurrent LoRA adapters.

Details

Data Quality Benchmark Contamination Evaluation

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Decontamination and benchmark overlap analysis for language models.

Details