Comparing between Deep Neural Network (DNN) models based on their performance on unseen data is crucial for the progress of the NLP field. However, these models have a large number of hyper-parameters and, being non-convex, their convergence point depends on the random values chosen at initialization and during training. Proper DNN comparison hence requires a comparison between their empirical score distributions on unseen data, rather than between single evaluation scores as is standard for more simple, convex models. In this paper, we propose to adapt to this problem a recently proposed test for the Almost Stochastic Dominance relation between two distributions. We define the criteria for a high quality comparison method between DNNs, and show, both theoretically and through analysis of extensive experimental results with leading DNN models for sequence tagging tasks, that the proposed test meets all criteria while previously proposed methods fail to do so. We hope the test we propose here will set a new working practice in the NLP community.
|Title of host publication||ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference|
|Publisher||Association for Computational Linguistics (ACL)|
|Number of pages||13|
|State||Published - 2020|
|Event||57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 - Florence, Italy|
Duration: 28 Jul 2019 → 2 Aug 2019
|Name||ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference|
|Conference||57th Annual Meeting of the Association for Computational Linguistics, ACL 2019|
|Period||28/07/19 → 2/08/19|
Bibliographical notePublisher Copyright:
© 2019 Association for Computational Linguistics.
ASJC Scopus subject areas
- Language and Linguistics
- Computer Science (all)
- Linguistics and Language