Tecling logo » The universe is not perfect, but it's working on it.      ABOUT RESEARCH SOLUTIONS SOFTWARE CONTACT
Technologies for Linguistic Analysis

The purpose of this research is to develop a methodology for the detection and categorisation of named entities or proper names (PPNN), in the categories of geographical place, person and organisation. The hypothesis is that the context of occurrence of the entity –a context window of n words before the target– as well as the components of the PN itself may provide good estimators of the type of PN. To that end, we developed a supervised categorisation algorithm, with a training phase in which the system receives a corpus already annotated by another NERC system. In the case of these experiments, such system was the open-source suite of language analysers FreeLing, annotating the corpus of the Spanish Wikipedia. During this training phase, the system learns to associate the category of entity with words of the context as well as those from the PN itself. We evaluate results with the CONLL- 2002 and also with a corpus of geopolitics from the journal Le Monde Diplomatique in its Spanish edition, and compare the results with some well-known NERC systems for Spanish.

Web demo: http://www.tecling.com/pol

Source code: http://www.tecling.com/pol/source/sourcePol.zip

It contains:

  • config.pm: Configuration file. The user needs to adjust its values before execution.
  • poltrain.pl: Script used for training.
  • pol.pl: Script used for the actual processing of new data.
  • convertmodel.pl: Script used to convert the model produced by poltrain.pl to the model that pol.pl needs to work.
Comments within the same scripts are at the moment only in Spanish.
To train POL for making a new model, you need to have Perl's Storable module installed.

Corpus and models: experiments have only been conducted in Spanish for the moment. Models for new languages will be added in the future. If you would like to help, you are welcome.

  • WikipediaFreeling.zip (2,6Gb !!!). This is the training corpus, a Spanish Wikipedia tagged with Freeling.
  • Model.zip: An example of model produced after training and conversion, ready to be used with pol.pl.

These models were created with a x86_64 HP Proliant machine with GenuineIntel CPU 1064.000 MHz running Linux (Ubuntu 14.04). If you have a different kind of machine (e.g., a Desktop pc on Windows), then you will probably need to create the models again by using poltrain.pl.

Funding: This research is part of the Fondecyt Project 11140686: “Inducción automática de taxonomías de sustantivos generales y especializados a partir de corpus textuales desde el enfoque de la lingüística cuantitativa” (Automatic taxonomy induction from corpora for terminology and general vocabulary using quantitative measures). Lead researcher: Rogelio Nazar.

Related publications:

+ Nazar, R.; Arriagada, P. (2017). POL: un nuevo sistema para la detección y clasificación de nombres propios. Procesamiento del Lenguaje Natural, n. 58, pp. 13-20.

Related concepts: Named entities, proper names, text linguistics