This work studies benchmark contamination under rephrasing and motivates better dataset hygiene and evaluation practices for language models.
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica
|
Jan 1, 2024
