This dataset was taken from the SIGARRA information system at the University of Porto (UP). Every organic unit has its own domain and produces academic news. We collected a sample of 1000 news, manually annotating 905 using the Brat rapid annotation tool. This dataset consists of three files. The first is a CSV file containing news published between 2016-12-14 and 2017-03-01. The second file is a ZIP archive containing one directory per organic unit, with a text file and an annotations file per news article. The third file is an XML containing the complete set of news in a similar format to the HAREM dataset format. This dataset is particularly adequate for training named entity recognition models.
Paper | Code | Results | Date | Stars |
---|