CleanCoNLL

CleanCoNLL is a nearly noise-free dataset for named entity recognition (NER). Use it to train and evaluate your NER models!

CleanCoNLL

The classic CoNLL-03 dataset is arguably the most commonly-used dataset to evaluate named entity recognition (NER) approaches. However, prior works found that many labels in this dataset are in fact not correct. This makes it impossible to use CoNLL-03 to fairly evaluate NER models.

With CleanCoNLL, we present a significantly improved version of CoNLL-03. We semi-automatically corrected over 7% of all NER labels in CoNLL-03. The resulting dataset is nearly noise-free.

You can use this resource to train and evaluate your state-of-the-art NER model!

Getting Started

Publication

CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset.Susanna Rücker and Alan Akbik. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023.