Hi, my name is

Shuo Yang.

I build full-stack machine learning systems.

I am a Ph.D. student in EECS at UC Berkeley, advised by Ion Stoica. I work on full-stack machine learning systems, from kernel optimization and efficient system design to text and multimodal algorithms, with the goal of making modern AI workloads efficient on real hardware.

Shuo Yang profile image

About

I am a member of Sky Computing Lab and LMSYS. My work spans the full stack of machine learning systems: kernel optimization at the hardware-software boundary, efficient system design for large-scale inference and generation, and text and multimodal algorithms that benefit from those systems advances.

I am especially interested in algorithm-system co-design: building methods that are not only theoretically appealing, but also practical and efficient when deployed at scale. Recent projects include LLM serving, sparse attention, exact GPU K-Means, and efficient video generation.

Recent highlights include the Amazon AI PhD Fellowship, a research scientist internship at Amazon Neuron Science, and an upcoming research internship at Meta.

Previously, I graduated from the ACM Honors Class at Shanghai Jiao Tong University.

Selected Publications

Flash-KMeans
Fastest K-Means Kernel Optimization 500 stars
Flash-KMeans
Fast and memory-efficient exact K-Means designed as a systems primitive.
Quant VideoGen
Quantization KV Cache World Model
Quant VideoGen
Long-video generation via 2-bit KV-cache quantization.
Sparse VideoGen2
NeurIPS 2025 Spotlight Semantic Permutation Video Generation
Sparse VideoGen2
Semantic-aware permutation for efficient sparse attention in video generation.
Sparse VideoGen
ICML 2025 Sparse Attention Video Generation
Sparse VideoGen
Accelerating video diffusion transformers with spatial-temporal sparsity.
BlendServe
ASPLOS 2026 Offline Inference LLM Serving
BlendServe
Resource-aware batching for offline inference of autoregressive large models.