Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica | Jan 1, 2024

This work studies benchmark contamination under rephrasing and motivates better dataset hygiene and evaluation practices for language models.