Technologies for Linguistic Analysis

29 de enero de 2024
Seminario en Santiago de Compostela

Irene Renau y Rogelio Nazar presentaron una charla-seminario abierta al público en la Facultad de Filología de la Universidad de Santiago de Compostela, con el título 'Desambiguación semántica y extracción de patrones verbales para la base de datos léxica Verbario', enmarcada en el proyecto Proyecto Fondecyt 1231594, dirigido por Irene Renau y financiado por ANID, Gobierno de Chile.

Ha sido un privilegio poder compartir con los asistentes los resultados de un trabajo muy reciente y sobre el que estaremos informando más en breve por este y otros canales. La reacción del público fue entusiasta y la retroalimentación amplia y muy útil. ¡Gracias!

December 31, 2023
And we say good by to 2023 with yet another paper

Nice surprise to end the year: a new paper on modal operators (in Spanish):
Obreque, J.; Nazar, R. (2023). Detección de operadores modales: una primera exploración en castellano. Linguamatica. 15(2): 37--49.

ABSTRACT: This article presents a mixed methods approach — with emphasis on the quantitative side — for the detection and recording of modal operators. These units are defined as a broad and heterogeneous set of expressions used in written and oral communication to imprint the subjective vision of the writers/speakers in their own utterance. The present proposal is based on the exploitation of a parallel corpus to augment with quantitative means an initial list of examples obtained in a qualitative stage. The methodology is simple, effective, and language independent, although in this first test we focus on the Spanish language.

There is also a companion website (in English) with documentation, the code and data used in the experiments:

More on that later...

December 28, 2023
We have new paper about Termout.org

Nazar, R.; Acosta, N. (2023). A Lightweight Statistical Method for Terminology Extraction. Journal of Computer-Assisted Linguistic Research, 7:43-59.

ABSTRACT: We propose a method for the task of automatic terminology extraction in the context of a larger project devoted to the automation of part of the tasks involved in the production of terminological databases. Terminology extraction is the key to drafting the macrostructure of a terminological resource (i.e., the list of entries), to which information can be later added at the microstructural level with grammatical or semantic information. To this end, we developed a statistical method that is conceptually simple compared to modern neural network approaches. It is a lightweight method because it is based on term dispersion and co-occurrence statistics that can be computed with basic hardware. For the evaluation, we experimented with corpora of lexicography and linguistics in English and Spanish of ca. 66 million tokens. Results improve baselines in almost 20%.

December 4, 2023
We released the regex Perl script

We published the Regex script, which we have used for years in classes to teach how to apply regular expressions in Perl. It is a very simple script that opens a text file and prints lines or segments based on a regular expression provided by the user. It is basically a grep, but you can easily adapt it to different situations.


November 28, 2023
We just finished the very first corpus linguistics course at PUCV.cl

The first edition of the Corpus Linguistics course, from the postgraduate program in Linguistic Studies (LIN1000 Lingüística de Corpus), finished successfully yesterday. It is the first of its kind in the history of Pontificia Universidad Católica de Valparaíso, and was imparted by Irene Renau and Rogelio Nazar.

Seven students presented as many great papers. From right to left in the picture: Skarlett Ramirez studied the distribution of vocabulary of men and women in the lyrics of music genres; Annais Quintana studied the vocabulary richness of investiture speeches of Chilean presidents; Yvone Laines did a quantitative comparison of discursive properties between natural and artificial texts; Constanza Suy compared the metaphorical use of verbs in specialized and non-specialized discourse; Felipe Sánchez studied the semantic prosody of ethnicity-denoting nouns; Francisca Calderon studied the properties of online hate-speech and Javiera Ahumada investigated the vocabulary shared by media of different ideologies regarding specific political issues.

Great job, guys. We'd love to see these papers published soon.

November 22, 2023
WOPATEC is happening right now!

The 2023 edition of the Workshop for Automatic Text Processing (WOPATEC) is taking place right now at the Universidad Católica de la Santísima Concepción (Chile), collocated with the IV Congreso Internacional (ALES 2023) and in a hybrid format. Many researchers are presenting their NLP-related work, from Chile and other countries as well.
More updates will be available at the website:
The picture shows some of the people who is physically present at the venue. Many others are watching and participating from the clouds...

10 de noviembre, 2023
Javier Obreque e Ignacio Lobos presentan su investigación sobre herramientas de retroalimentación automática de la escritura

En el marco del V Congreso en Docencia en Educación y I Congreso Latinoamericano y del Caribe de Innovación en Investigación en Educación Superior 2023 que se está realizando en la Universidad de La Serena, Javier Obreque e Ignacio Lobos expusieron los primeros resultados de un proyecto de investigación vinculado a la descripción comparativa de los alcances de los feedback proporcionados por herramientas de retroalimentación automática de la escritura (entre los que está Estilector) y su comparación con una inteligencia artificial generativa. El proyecto en curso es financiado por la Dirección de Innovación e Investigación Aplicada del Instituto Profesional Duoc UC, donde ambos ejercen docencia.
En la foto aparece el equipo completo: Marjory Astudillo, Ignacio Lobos, Javier Obreque y Karin Arismendi.

3 de noviembre, 2023
Se publica el manual de Lexicografía hispánica de Routledge

Se acaba de publicar el manual «Lexicografía hispánica» (The Routledge Handbook of Spanish Lexicography), editado por Sergi Torner, Paz Battaner e Irene Renau. Componen la obra 44 capítulos de 60 autores de América, Europa y Asia, que cubren el estado de la cuestión de la lexicografía en lengua española, teórica y práctica. Entre los autores hay un par de miembros del Grupo Tecling.

27 de octubre, 2023
Tesis de Javier Obreque gana concurso internacional ALED

Estamos tremendamente orgullosos de anunciar que Javier Obreque, miembro de la vieja guardia del grupo Tecling, ha ganado el primer premio Anamaría Harvey 2023, el concurso internacional de ALED (Asociación Latinoamericana de Estudios del Discurso) por su tesis de Magister titulada "Una propuesta metodológica para la detección de operadores modales en lengua castellana". Es una investigación que está en un cruce entre el análisis del discurso y la lingüística computacional. Fue dirigida por Rogelio Nazar, que ya lleva dos tesistas de posgrado con premios internacionales. Vamos a tener que abrir un espumante.

20 October, 2023
New web-interface for Poppins, our text classifier

This was in our to-do list for many many years, but you know what they say: better late than never. We have a new web interface for Poppins, our dear old text classifier:


Poppins has been online since 2005, and one of the reasons it has never been so popular is that its web interface was not too user-friendly. Now it's way easier to use and also better documented.

The program lets you classify documents with any criteria. You just have to train it, and as you may have guessed, training here means showing the program some examples of texts already classified in a number of categories.

We think it's pretty cool, and in all these years we haven't seen anything like it. Give it try and tell us what you think. (We only changed the interface: the inner workings of the program continues to be exactly the same).

Tools & demos

We have implemented different types of applications and most of them can be tested online. Take a look.

+ Compare: a simple script to compare two lists of words

+ Cryptoman: a script to generate cryptograms

+ Dismark: a multilingual taxonomy of discourse markers

+ Dsele: a model dictionary for ELE learners

+ Estilector: computer assisted writing for Spanish

+ GeNom: a program to detect the gender of proper nouns

+ HAT: a project for the treatment of polysemy in lexical taxonomies

+ Jaguar: a tool for statistic corpus analysis

+ Kind: a lexical taxonomy induction algorithm

+ Kwico: a concordancer for big corpora

+ Lealem: a reading pacer for parallel German-Spanish texts

+ Leafran: a reading pacer for parallel French-Spanish texts

+ Linguini: a language detector

+ Neven: a program to detect eventive nouns

+ POL: named entity recognition and classification

+ Poppins: a supervised text classifier (new interface!)

+ Porcus: an interface for various taggers and parsers for Spanish

+ pullPOS: a project for the detection of plurals in Spanish

+ Punkt: punktuation of discourse markers in Spanish

+ Randall: a list randomizer

+ Readeutsch: a reading pacer for parallel German-English texts

+ Regex: a Perl script for regular expressions (new!)

+ Sapo: a program to detect similarities between documents

+ Sicam: a program to analyze Spanish poetry

+ Termout: a terminology extraction system (new version!)

+ TEXT·A·GRAM: a program to analyze Spanish texts

+ Verbario: corpus pattern analysis in Spanish


This is the view from where we are located, in the Sausalito lagoon, a quiet and lovely place in Viña del Mar, Chile. Sunny days. Birds can be seen in the center of the lagoon (click to enlarge).

As researchers, we are currently affiliated to:
Pontificia Universidad Católica de Valparaíso
Instituto de Literatura y Ciencias del Lenguaje

Av. El Bosque 1290, Viña del Mar, Chile

Upcoming Events
[UPDATED: January 30, 2024]

January and February 2024: Irene Renau and Rogelio Nazar are staying at Santiago de Compostela, in Galicia, Spain, for a research stay with the Humboldt group (GI 1920) at Universidade de Santiago de Compostela.

Latest ideas & research projects

We are developing new projects in computational linguistics and natural language processing:

+ Fondecyt Regular (2023-2027): "Mapa de las metáforas conceptuales en sustantivos y verbos del español: un estudio de los patrones metafóricos basado en corpus". Lead researcher: Irene Renau. Co-researcher: Rogelio Nazar. Ref.: 1231594.

+ Fondecyt Regular (2019-2021): "Polisemia regular de los sustantivos del español: análisis semiautomático de corpus, caracterización y tipología" (Regular polysemy of nouns in Spanish: semiautomatic analysis of corpus, characterization and tipology). Lead researcher: Irene Renau. Ref.: 1191204.

+ Fondecyt Regular (2019-2021): "Inducción automática de taxonomías de marcadores discursivos a partir de corpus multilingües" (Automatic induction of taxonomies of discourse markers from multilingual corpora). Lead researcher: Rogelio Nazar. Ref.: 1191481.

+ Ecos-Sud (International Project between Chile and France): "Inducción automática de taxonomías del español y el francés mediante técnicas cuantitativas y estadística de corpus". Lead researcher: Irene Renau. Ref.: C16H02.

+ Fondecyt Regular: "Desarrollo de la competencia terminológica a lo largo de la inserción disciplinar". Lead Researcher: Sabela Fernández. Co-researcher: Rogelio Nazar. Ref.: 11121597.

+ See more.

Recent publications

+ Nazar, R.; Renau, I.; Robledo, H. (In press). Dismark and Text·a·Gram: Automatic identification and categorization of discourse markers in texts. In Proceedings of DISROM 2022 (Discourse Markers in Romance Languages, Craiova, 16-18 June 2022).

+ Obreque, J.; Nazar, R. (2023). Detección de operadores modales: una primera exploración en castellano. Linguamatica. 15(2): 37--49. PDF

+ Renau, Irene. (2023). A corpus-based study of semantic neology of the Covid-19 pandemic. Quaderns de Filologia: Estudis Lingüístics XXVIII: 55-76. PDF

+ Nazar, R. (2023). Extensión, variación y evolución del léxico español. In Battaner, P., Torner, S, Renau, I. Lexicografía hispánica / The Routledge Handbook of Spanish Lexicography. Cap. 14, pp. 204-218.

+ López-Hidalgo, B.; Renau, I.; Nazar, R. (2023). Correlación entre la metáfora orientacional BUENO ES ARRIBA / MALO ES ABAJO y polaridad positiva/negativa en verbos del español: un estudio con estadística de corpus. Humanidades Digitales, Corpus y Tecnología del Lenguaje. University of Groningen Press, pp. 307-323. PDF

+ Nazar, R. & Acosta, N. (2023). Termout: a tool for the semi-automatic creation of term databases. In Haddad, Amal; Terryn, Ayla; Mitkov, Ruslan; Rapp, Reinhard; Zweigenbaum, Pierre and Sharoff, Serge (eds.) Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC), INCOMA, Shoumen, Bulgaria, pp. 9-18. PDF

+ Nazar, R. & Renau, I. (2023). Estilector: un sistema de evaluación automática de la escritura académica en castellano. Revista Perspectiva Educacional, 62(2): 37-59. PDF

+ Robledo, H.; Nazar, R. (2023). A proposal for the inductive categorisation of parenthetical discourse markers in Spanish using parallel corpora. International Journal of Corpus Linguistics. http://doi.org/10.1075/ijcl.20017.rob

+ Renau, I.; Nazar, R. (2022). Towards a multilingual dictionary of discourse markers: automatic extraction of units from parallel corpus. In: Klosa-Kückelhaus, A.; Engelberg, S.; Möhrs, C.; Storjohann, P. Dictionaries and Society. Proceedings of the XX EURALEX International Congress, Mannheim: IDS-Verlag, pp. 262-272. PDF

+ Nazar, R; Lindemann, D. (2022). Terminology extraction using co-occurrence patterns as predictors of semantic relevance. Proceedings of the TERM21 Workshop. Language Resources and Evaluation Conference (LREC 2022), Marseille, 20-25 June 2022, pp. 26-29. PDF

Solutions for text processing

It is critical for organizations to have the ability to process information automatically, and very often that information is contained in documents to be read by humans rather than machines. We have different methods for text processing depending on the goal.

We can be helpful teaching people how to automatize their text processing routines. We can batch-process thousands of documents to extract information from them or to derive different types of statistics. We can also change these document, or generate databases or email correspondence based on information extracted from them. Anything that involves intelligent management of information can benefit from different degrees of automatization, and by doing that we can free time, effort and resources.

Tell us which are your needs and we will show you what we can do about it.