Tecling logo » The universe is not perfect, but it's working on it.      ABOUT RESEARCH SOLUTIONS SOFTWARE CONTACT
Technologies for Linguistic Analysis

November 6, 2022
Randall: a script to sort a list in random order

Sometimes our students need to sort things randomly and they usually don't have a simple method to do it, or when they do it is something like a web page with advertisement. Now, using this script you can paste a list of words or lines or whatever and it will sort the same material in random order:

2 de noviembre, 2022
Ya vamos por la mitad del seminario de Lexicografía basada en Corpus

Hoy tendremos el tercer día del seminario sobre lexicografía basada en corpus en la Facultad de Filosofía y Letras de la Universidad Nacional de Cuyo, Mendoza. Antes hemos hablado acerca de qué es un corpus y cómo trabajar con ellos en un proyecto lexicográfico. Hoy estaremos hablando de sistemas de gestión de bases de datos léxicas y terminológicas.
Mañana jueves ya estaremos sumergiéndonos en el procesamiento automatizado de datos lingüísticos.

27 de octubre, 2022
Se acerca la fecha del Seminario de Lexicografía Basada en Corpus

Organizado por la Universidad Nacional de Cuyo, en Mendoza, Argentina, iniciará el día 31 de octubre de 2022 en la Facultad de Filosofía y letras el Seminario de Lexicografía Basada en Corpus, continuando toda esa semana hasta el 4 de noviembre de 2022.
Será dictado por Irene Renau y Rogelio Nazar, y ofrecerá un recorrido por los métodos y técnicas para la explotación de corpus textuales con fines lexicográficos y terminológicos.
Será en modalidad presencial únicamente. Para informes y contacto:
cursosposgrado@ffyl.uncu.edu.ar / +54 261 4494168

October 19, 2022
We are going mixlingual

Yes, after some thought we decided it is best go mixlingual, i.e., we are going to be mixing content in Spanish and English. The decision for this process is very simple, as the flowchart shows: if what we have to say is relevant for a general, international audience, then we use English. In contrast, if the news is relevant only for local audiences or specifically directed to Spanish-speaking people, we will then write in Spanish. We find this is more reasonable than trying to offer translations of all contents.

September 28, 2022
We have new paper on Spanish orthography

How lovely is the smell of a just-printed paper in the morning! Today we woke up with the news that the following article is already published (in Spanish): Renau, I., Nazar, R. y Díaz, L. (2022). La Ortografía de la lengua española (2010) y su impacto en la prensa de cinco países hispanohablantes. Normas, 12, 91-109, doi: 10.7203/Normas.v12i1.25102

July 15, 2022
Six of our students presented their theses

Extremely talented young people working with us... This Friday, six of our students presented their theses, after a year's hard work. Pedro Bolbarán, Camila Pérez, Bahony Saavedra, Gabriela Cacciuttolo, Héctor Ramos and Javiera Silva are seen smiling at the camera after the defense, surrounding a very proud adviser. They were supposed to write undergraduate theses, but their work looks more like PhD theses! The manuscripts will soon be available online at the library of PUCV.cl

July 13, 2022
We presented new paper at Euralex 2022

Irene Renau presented the paper ``Towards a multilingual dictionary of discourse markers: automatic extraction of units from parallel corpus'', at the EURALEX 2022 Congress, held in Mannheim, Germany. The talk described our project Dismark, the multilingual database of discourse markers, which is now in the process of becoming a dictionary.
The paper is available here.

June 20, 2022
We just presented a paper at Terminology in the 21st century

Today, at 16:45 Central European Time (that is, 10:45 Chilean time) Rogelio Nazar and David Lindemann presented the talk Terminology extraction using co-occurrence patterns as predictors of semantic relevance in the workshop on Terminology in the 21st century: many faces, many places (Term 21), co-located with LREC 2022 in Marseille, France.
The paper is available here.
The program of the workshop is available here, and the full Proceedings are available as well.

June 16, 2022
We delivered an online presentation at DISROM 7

Today, Irene Renau, Rogelio Nazar and Hernán Robledo presented the talk Automatic extraction of discourse markers from parallel corpus at the Discourse Markers in Romance Languages Conference (DISROM 7). The conference is taking place these days (16-18 June 2022) in Craiova, Romania, but it's also broadcast online. There are many other interesting presentations that you can follow:

June 6, 2022
We have a new toy: Clusterre

This script is intended as a friendly interface to R's clustering function. It will create dendrograms from several (up to nine) lists of items, as well as the data matrix if you want to use it for something else. There must be only one item per line. Any other information will be ignored, as these are treated as binary values only. The result is the matrix and the dendrogram.
Happy clustering!
What's new? You can now name the objects using the first element in each list as header. Just remember to click on the checkbox if those are your intentions.

May 7, 2022
MANDINGA is back!

Mandinga, our dear old word sense induction algorithm, is now back online, after many years forgotten. Given an input word, it tells if said unit is polysemous and, if so, it produces a list of the possible senses. Of course, it does not use any lexicographic resource. It does all using only corpora and graph-based co-occurrence algorithms:
Update (May 9, 2022): At this moment the system is available in Spanish, English and French.

May 6, 2022
NEOPTER: identification of neologisms in a list of Spanish words

This is what happens when we have a lot of paperwork to do: we divert efforts to the creation of new demos. Now we have Neopter, a little script that takes a list of Spanish words and identifies those that had not been used prior to 2012. To enjoy (with moderation):

April 30, 2022
GEOMOT: a script to find the distribution of Spanish words per country

We have a new product on display today. Geomot is a script that will accept one or more Spanish words and will tell in which countries they are most frequently used. We find it useful for different types of Spanish lexicographic projects. Next step will be to do the same in Arabic, as this language shares with Spanish the phenomenon of wide geographical lexical variation. http://www.tecling.com/geomot

April 23, 2022
MORFOL: a new Spanish morphological analyzer

Morfol is a brand new morphological analyzer for Spanish that we are using for the categorization of terms and neologisms. It accepts a list of terms/words/neologisms (one per line) and then proceeds to classify them. It will produce the grammatical category, grammatical genre and also, by trying to identify prefixes and suffixes, the internal morphological structure. Give it a try and tell us what you think about it: http://www.tecling.com/morfol

Tools & demos

We have implemented different types of applications and most of them can be tested online. Take a look.

+ Compare: a simple script to compare two lists of words

+ Cryptoman: a script to generate cryptograms

+ Dismark: a multilingual taxonomy of discourse markers (new!)

+ Dsele: a model dictionary for ELE learners

+ Estilector: computer assisted writing for Spanish

+ GeNom: a program to detect the gender of proper nouns

+ HAT: a project for the treatment of polysemy in lexical taxonomies

+ Jaguar: a tool for statistic corpus analysis

+ Kind: a lexical taxonomy induction algorithm

+ Kwico: a concordancer for big corpora

+ Lealem: a reading pacer for parallel German-Spanish texts

+ Leafran: a reading pacer for parallel French-Spanish texts

+ Linguini: a language detector

+ Neven: a program to detect eventive nouns

+ POL: named entity recognition and classification

+ Poppins: a supervised text classifier

+ Porcus: an interface for various taggers and parsers for Spanish

+ pullPOS: a project for the detection of plurals in Spanish

+ Randall: a list randomizer (new)

+ Readeutsch: a reading pacer for parallel German-English texts

+ Sapo: a program to detect similarities between documents

+ Sicam: a program to analyze Spanish poetry

+ Termout: a terminology extraction system

+ Termoutling: an automatic linguistics glossary

+ TEXT·A·GRAM: a program to analyze Spanish texts

+ Verbario: corpus pattern analysis in Spanish


This is the view from where we are located, in the Sausalito lagoon, a quiet and lovely place in Viña del Mar, Chile. Sunny days. Birds can be seen in the center of the lagoon (click to enlarge).

As researchers, we are currently affiliated to:
Pontificia Universidad Católica de Valparaíso
Instituto de Literatura y Ciencias del Lenguaje

Av. El Bosque 1290, Viña del Mar, Chile

Upcoming Events

31 October - 4 November, 2022 (THE EVENT HAS BEEN POSTPONED BY THE ORGANIZERS: The original dates were 31-19 August, 2022): Irene Renau and Rogelio Nazar will be teaching a posgraduate course with the title "Lexicografía Basada en Corpus" (Corpus-based lexicography) in Universidad Nacional de Cuyo (Mendoza, Argentina).

Latest ideas & research projects

We are developing new projects in computational linguistics and natural language processing:

+ Fondecyt Regular (2019-2021): "Polisemia regular de los sustantivos del español: análisis semiautomático de corpus, caracterización y tipología" (Regular polysemy of nouns in Spanish: semiautomatic analysis of corpus, characterization and tipology). Lead researcher: Irene Renau. Ref.: 1191204.

+ Fondecyt Regular (2019-2021): "Inducción automática de taxonomías de marcadores discursivos a partir de corpus multilingües" (Automatic induction of taxonomies of discourse markers from multilingual corpora). Lead researcher: Rogelio Nazar. Ref.: 1191481.

+ Ecos-Sud (International Project between Chile and France): "Inducción automática de taxonomías del español y el francés mediante técnicas cuantitativas y estadística de corpus". Lead researcher: Irene Renau. Ref.: C16H02.

+ Fondecyt Regular: "Desarrollo de la competencia terminológica a lo largo de la inserción disciplinar". Lead Researcher: Sabela Fernández. Co-researcher: Rogelio Nazar. Ref.: 11121597.

+ See more.

Recent publications

+ Renau, I.; Nazar, R. (2022). Towards a multilingual dictionary of discourse markers: automatic extraction of units from parallel corpus. In: Klosa-Kückelhaus, A.; Engelberg, S.; Möhrs, C.; Storjohann, P. Dictionaries and Society. Proceedings of the XX EURALEX International Congress, Mannheim: IDS-Verlag, pp. 262-272. PDF

+ Nazar, R; Lindemann, D. (2022). Terminology extraction using co-occurrence patterns as predictors of semantic relevance. Proceedings of the TERM21 Workshop. Language Resources and Evaluation Conference (LREC 2022), Marseille, 20-25 June 2022, pp. 26-29. PDF

+ Nazar, R. (2021). "Inducción automática de una taxonomía multilingüe de marcadores discursivos: primeros resultados en castellano, inglés, francés, alemán y catalán". Procesamiento del Lenguaje Natural, núm 67, pp. 127-138. PDF

+ Nazar, R. (2021). "Automatic induction of a multilingual taxonomy of discourse markers". Iztok Kosem et al. (eds.) Electronic lexicography in the 21st century: post-editing lexicography. Lexical Computing CZ s.r.o., Brno, pages 440-454. PDF

+ Castro, A.; Nazar, R.; Renau, I. (2021). "New verbs and dictionaries: a method for the automatic detection of neology in Spanish verbs". International Journal of Lexicography, ...

+ Nazar, R.; Renau, I., Acosta, N., Robledo, H., Soliman, H., Zamora, S. (2021). "Corpus-Based Methods for Recognizing the Gender of Anthroponyms". Names: A Journal of Onomastics, vol. 69 num. 3. PDF

+ See more.

Solutions for text processing

It is critical for organizations to have the ability to process information automatically, and very often that information is contained in documents to be read by humans rather than machines. We have different methods for text processing depending on the goal.

We can be helpful teaching people how to automatize their text processing routines. We can batch-process thousands of documents to extract information from them or to derive different types of statistics. We can also change these document, or generate databases or email correspondence based on information extracted from them. Anything that involves intelligent management of information can benefit from different degrees of automatization, and by doing that we can free time, effort and resources.

Tell us which are your needs and we will show you what we can do about it.