Earlier this year, 55 researchers from 44 global institutions proposed GEM (Generation, Evaluation & Metrics), a new benchmark environment for Natural Language Generation. It evaluates models through an interactive result exploration system. This enables a much better understanding of model limitations & improvement opportunities, and does not misrepresent the complex interactions of individual measures.
This is a significant shift away from the much-followed practice of using single-metric systems for #ModelEvaluation. While a single score system has its own advantages, it runs the risk of disregarding important considerations, such as model size, fairness, and practical usage.