Publications
Selected first-author publications are highlighted on the homepage. This page includes those papers together with additional collaborative work across LLM serving, sparse attention, and video generation.

OSDI 2026
GPU Networking
Transport Layer
UCCL
An extensible software transport layer for GPU networking.

NeurIPS 2025
Benchmark
World Models
WorldModelBench
Judging video generation models as world models.

NeurIPS 2025 Spotlight
Adaptive Sparsity
Long Context
Twilight
Adaptive attention sparsity with hierarchical top-p pruning.

ICML 2025
Semantic Sparsity
Sparse Attention
HashAttention
Semantic sparsity for faster inference.

Sparse Attention
KV Cache
LLM Inference
Post-Training Sparse Attention with Double Sparsity
Sparse attention for reducing KV-cache bandwidth in LLM inference.

MLSys 2024
LoRA Serving
CUDA Kernels
S-LoRA
Serving thousands of concurrent LoRA adapters.

Data Quality
Benchmark Contamination
Evaluation
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Decontamination and benchmark overlap analysis for language models.