Tecling logo » The universe is not perfect, but it's working on it.
Technologies for Linguistic Analysis

June 14, 2024
We have a new version of Bifid

We have a new version of Bifid, our dear old statistical parallel corpus aligner.
Bifid is a program that takes a set of documents with their translations (an unaligned parallel corpus) and does several things:

  1. It separates the set of documents in the two languages
  2. It aligns each document with its translation
  3. It aligns the sentences in each pair of documents
  4. It extracts a bilingual vocabulary from the aligned sentences
  5. It export results in csv and tmx formats
  6. It imports tmx documents, in case you already have your corpus aligned at the sentence level and what you want is to obtain a bilingual vocabulary.

Bifid had been online since 2004 (yes, it's going to be 20 years now)
but lately its server had gone down and it was neglected.
But here it is now, again, restored to its former glory!
We are planing some kind of celebration for its 20th birthday (none
remembers the actual date so we will celebrate the whole year).
We will be updating on this soon.

Some (old) publications on the project:
Nazar, R. (2011). Parallel corpus alignment at the document, sentence and vocabulary levels.
Procesamiento del Lenguaje Natural, n. 47.

Nazar, R. (2012). Bifid: un alineador de corpus paralelo a nivel de documento, oración y vocabulario.
Linguamatica, vol. 4, no. 2.

May 31, 2024
We have a new version of Kind

We have a new version of Kind, our lexical taxonomy. This version is based on the Spanish Wiktionary, so at the moment it is only available in this language. Versions for some others will be available at some point in the future here.

This database has 30912 different nouns as entries, which populate a relatively large top-ontology, with 525 semantic types.
Take a look:

You can, for instance, enter any arbitrary common single-noun or a list of them to obtain their hypernymy chains.
You can also navigate categories, for example here with árbol (tree):
and see what kinds of trees we have.

A lot of work still needs to be done, though. For instance, the category sistema (system) still has some Borgean vibes. As a list, it is heterogeneous to the ridicule. Larger categories should be further subdivided in a logical order. Another limitation is that it's only text is the Wiktionary. It does not extract information from a corpus, as in the older version (still available, by the way).

This is part of a work in progress and there is no documentation yet. It was designed and developed by Rogelio Nazar with the help of Irene Renau, Daniel Mora and Nicolás Acosta. This Spanish taxonomy is just a single cog in a larger project directed by Irene Renau.

May 24, 2024
Irene Renau presented two talks in Murcia

Irene Renau made two presentations today at the X Congreso Internacional de Lexicografía Hispánica, held at Universidad de Murcia, Spain. This time, the title of the conference is «Variación y panhispanismo en lexicografía» (Variation and pan-Hispanism in lexicography). The title of the first talk was 'ChatGPT para la detección de metáforas en sustantivos del español' (ChatGPT for the detection of metaphors in Spanish nouns), which she coauthored with Eduardo Puraivan. The title of the second presentation was 'Extracción de patrones verbales para una base de datos léxica del castellano' (Extraction of verbal patterns for a lexical database of Spanish) and it was based on a paper, currently in progress, with coauthors Rogelio Nazar and Daniel Mora.

May 6, 2024
We started to clean-up the Spanish Wiktionary

Compared to what it was 10 years ago, the Spanish Wiktionary is today an amazing resource that can be harnessed for different uses in computational linguistics and other fields. It really shows the power of collaboration, an example of what can be achieved by an army of writers from different backgrounds. However, as it stands on the web, it still bears the marks of its collaborative nature: most people creating this resource are not professional lexicographers, and many of the definitions still lack the touch of commercial dictionaries. Furthermore, it is plagued by elements that should not be there, like proper nouns or very specialized technical terminology. Most importantly, however, as a whole it shows the lack of cohesion and uniformity of style, as a natural consequence of having been produced by an heterogeneous group of people. Given these circumstances, we embarked on the task of cleaning it up a little bit, to make it more useful for Spanish NLP applications.

[NEW UPDATE May 26, 2024]
Just when we thought we had finished, new things appeared and had to be cleaned. But now, yes, we think it should be good enough for many purposes:

http://tecling.com/wiktionary/cleanedupWiktSpa-26May2024.zip (2Mb)

The next step will be to prepare a taxonomy of nouns from this. We will be informing about that...
(by the way, this version only includes nouns: if you were here for the verbs or adjectives, etc., you will be disappointed).

April 8, 2024
We have a new paper at EURALEX 2024!

The paper has the title 'Towards the automatic generation of a pattern-based dictionary of Spanish verbs' and is the result of a collaboration between Daniel Mora, Rogelio Nazar and Irene Renau. Irene will be presenting it in person in Cavtat, Croatia, between 8-12 October 2024. On this occasion, the conference is organized by the Institute of Croatian Language and Linguistics.

Próximo 3 de abril de 2024
Conferencia de Carles Tebé en la inauguración del Doctorado en Lingüística PUCV

El Doctorado en Lingüística de la PUCV inaugura su Año Académico 2024 con la conferencia 'La terminología puntual en la traducción especializada: propuesta de sistematización', a cargo del Prof. Carles Tebé Soriano.
El Prof. Tebé realizó su Doctorado en Lingüística Aplicada en la Universitat Pompeu Fabra y tiene amplia experiencia de trabajo en el ámbito de la lexicografía y la terminología tanto en el sector privado como en la academia. Fue presidente de RITERM (Red Iberoamericana de Terminología) entre 2014 y 2018 y actualmente es profesor de la Pontificia Universidad Católica de Chile.
Su conferencia tendrá lugar el miércoles 3 de abril a las 17:30h en el la Sala Híbrida (Auditorio) del 6º piso del edificio ILCL del Campus Sausalito.

20 de marzo de 2024
Jornada de Investigación sobre Análisis de Metáforas en Corpus

Celebraremos una Jornada de Investigación sobre Análisis de Metáforas en Corpus, que representa la primera reunión de todo el grupo de investigadores del Fondecyt Regular 1231594 (2023-2027) 'Mapa de las metáforas conceptuales en sustantivos y verbos del español: un estudio de los patrones metafóricos basado en corpus', dirigido por Irene Renau, con Rogelio Nazar como coinvestigador, y la participación de Benjamín López, Constanza Suy, Daniel Mora y Eduarno Puraivan.

29 de enero de 2024
Seminario en Santiago de Compostela

Irene Renau y Rogelio Nazar presentaron una charla-seminario abierta al público en la Facultad de Filología de la Universidad de Santiago de Compostela, con el título 'Desambiguación semántica y extracción de patrones verbales para la base de datos léxica Verbario', enmarcada en el proyecto Proyecto Fondecyt 1231594, dirigido por Irene Renau y financiado por ANID, Gobierno de Chile.

Ha sido un privilegio poder compartir con los asistentes los resultados de un trabajo muy reciente y sobre el que estaremos informando más en breve por este y otros canales. La reacción del público fue entusiasta y la retroalimentación amplia y muy útil. ¡Gracias!

December 31, 2023
And we say good by to 2023 with yet another paper

Nice surprise to end the year: a new paper on modal operators (in Spanish):
Obreque, J.; Nazar, R. (2023). Detección de operadores modales: una primera exploración en castellano. Linguamatica. 15(2): 37--49.

ABSTRACT: This article presents a mixed methods approach — with emphasis on the quantitative side — for the detection and recording of modal operators. These units are defined as a broad and heterogeneous set of expressions used in written and oral communication to imprint the subjective vision of the writers/speakers in their own utterance. The present proposal is based on the exploitation of a parallel corpus to augment with quantitative means an initial list of examples obtained in a qualitative stage. The methodology is simple, effective, and language independent, although in this first test we focus on the Spanish language.

There is also a companion website (in English) with documentation, the code and data used in the experiments:

More on that later...

December 28, 2023
We have new paper about Termout.org

Nazar, R.; Acosta, N. (2023). A Lightweight Statistical Method for Terminology Extraction. Journal of Computer-Assisted Linguistic Research, 7:43-59.

ABSTRACT: We propose a method for the task of automatic terminology extraction in the context of a larger project devoted to the automation of part of the tasks involved in the production of terminological databases. Terminology extraction is the key to drafting the macrostructure of a terminological resource (i.e., the list of entries), to which information can be later added at the microstructural level with grammatical or semantic information. To this end, we developed a statistical method that is conceptually simple compared to modern neural network approaches. It is a lightweight method because it is based on term dispersion and co-occurrence statistics that can be computed with basic hardware. For the evaluation, we experimented with corpora of lexicography and linguistics in English and Spanish of ca. 66 million tokens. Results improve baselines in almost 20%.

December 4, 2023
We released the regex Perl script

We published the Regex script, which we have used for years in classes to teach how to apply regular expressions in Perl. It is a very simple script that opens a text file and prints lines or segments based on a regular expression provided by the user. It is basically a grep, but you can easily adapt it to different situations.


Tools & demos

We have implemented different types of applications and most of them can be tested online. Take a look.

+ Bifid: a parallel corpus aligner (new interface!)

+ Compare: a simple script to compare two lists of words

+ Cryptoman: a script to generate cryptograms

+ Dismark: a multilingual taxonomy of discourse markers

+ Dsele: a model dictionary for ELE learners

+ Estilector: computer assisted writing for Spanish

+ GeNom: a program to detect the gender of proper nouns

+ HAT: a project for the treatment of polysemy in lexical taxonomies

+ Jaguar: a tool for statistic corpus analysis

+ Kind: a lexical taxonomy induction algorithm

+ Kwico: a concordancer for big corpora

+ Lealem: a reading pacer for parallel German-Spanish texts

+ Leafran: a reading pacer for parallel French-Spanish texts

+ Linguini: a language detector

+ Neven: a program to detect eventive nouns

+ POL: named entity recognition and classification

+ Poppins: a supervised text classifier

+ Porcus: an interface for various taggers and parsers for Spanish

+ pullPOS: a project for the detection of plurals in Spanish

+ Punkt: punktuation of discourse markers in Spanish

+ Randall: a list randomizer

+ Readeutsch: a reading pacer for parallel German-English texts

+ Regex: a Perl script for regular expressions

+ Sapo: a program to detect similarities between documents

+ Sicam: a program to analyze Spanish poetry

+ Termout: a terminology extraction system

+ TEXT·A·GRAM: a program to analyze Spanish texts

+ Verbario: corpus pattern analysis in Spanish


This is the view from where we are located, in the Sausalito lagoon, a quiet and lovely place in Viña del Mar, Chile. Sunny days. Birds can be seen in the center of the lagoon (click to enlarge).

As researchers, we are currently affiliated to:
Pontificia Universidad Católica de Valparaíso
Instituto de Literatura y Ciencias del Lenguaje

Av. El Bosque 1290, Viña del Mar, Chile

Upcoming Events
[UPDATED: June 6, 2024]

    October 8-12, 2024: Irene Renau will be presenting a paper at EURALEX 2024, to be held in Cavtat, Croatia, The paper has the title 'Towards the automatic generation of a pattern-based dictionary of Spanish verbs' and is the result of a collaboration between her and Daniel Mora and Rogelio Nazar.

    November 8, 2024: at 16h Madrid time (GMT+2) or 12h in Chilean time (GMT-4) Irene Renau and Rogelio Nazar will be presenting their research results at the II Seminario UAM: “Jornadas de lexicología y lexicografía del español: modelos, metodologías y herramientas” (Conference on Spanish lexicology and lexicography: models, methodologies and tools), event organized by Rosario González, Beatriz Méndez, Elena de Miguel y Alberto Anula. More details will soon be available here.

Latest ideas & research projects

We are developing new projects in computational linguistics and natural language processing:

+ Fondecyt Regular (2023-2027): "Mapa de las metáforas conceptuales en sustantivos y verbos del español: un estudio de los patrones metafóricos basado en corpus". Lead researcher: Irene Renau. Co-researcher: Rogelio Nazar. Ref.: 1231594.

+ Fondecyt Regular (2019-2021): "Polisemia regular de los sustantivos del español: análisis semiautomático de corpus, caracterización y tipología" (Regular polysemy of nouns in Spanish: semiautomatic analysis of corpus, characterization and tipology). Lead researcher: Irene Renau. Ref.: 1191204.

+ Fondecyt Regular (2019-2021): "Inducción automática de taxonomías de marcadores discursivos a partir de corpus multilingües" (Automatic induction of taxonomies of discourse markers from multilingual corpora). Lead researcher: Rogelio Nazar. Ref.: 1191481.

+ Ecos-Sud (International Project between Chile and France): "Inducción automática de taxonomías del español y el francés mediante técnicas cuantitativas y estadística de corpus". Lead researcher: Irene Renau. Ref.: C16H02.

+ Fondecyt Regular: "Desarrollo de la competencia terminológica a lo largo de la inserción disciplinar". Lead Researcher: Sabela Fernández. Co-researcher: Rogelio Nazar. Ref.: 11121597.

Recent publications

+ Nazar, R.; Renau, I.; Robledo, H. (In press). Dismark and Text·a·Gram: Automatic identification and categorization of discourse markers in texts. In Proceedings of DISROM 2022 (Discourse Markers in Romance Languages, Craiova, 16-18 June 2022).

+ Obreque, J.; Nazar, R. (2023). Detección de operadores modales: una primera exploración en castellano. Linguamatica. 15(2): 37--49. PDF

+ Renau, Irene. (2023). A corpus-based study of semantic neology of the Covid-19 pandemic. Quaderns de Filologia: Estudis Lingüístics XXVIII: 55-76. PDF

+ Nazar, R. (2023). Extensión, variación y evolución del léxico español. In Battaner, P., Torner, S, Renau, I. Lexicografía hispánica / The Routledge Handbook of Spanish Lexicography. Cap. 14, pp. 204-218.

+ López-Hidalgo, B.; Renau, I.; Nazar, R. (2023). Correlación entre la metáfora orientacional BUENO ES ARRIBA / MALO ES ABAJO y polaridad positiva/negativa en verbos del español: un estudio con estadística de corpus. Humanidades Digitales, Corpus y Tecnología del Lenguaje. University of Groningen Press, pp. 307-323. PDF

+ Nazar, R. & Acosta, N. (2023). Termout: a tool for the semi-automatic creation of term databases. In Haddad, Amal; Terryn, Ayla; Mitkov, Ruslan; Rapp, Reinhard; Zweigenbaum, Pierre and Sharoff, Serge (eds.) Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC), INCOMA, Shoumen, Bulgaria, pp. 9-18. PDF

+ Nazar, R. & Renau, I. (2023). Estilector: un sistema de evaluación automática de la escritura académica en castellano. Revista Perspectiva Educacional, 62(2): 37-59. PDF

+ Robledo, H.; Nazar, R. (2023). A proposal for the inductive categorisation of parenthetical discourse markers in Spanish using parallel corpora. International Journal of Corpus Linguistics. http://doi.org/10.1075/ijcl.20017.rob

+ Renau, I.; Nazar, R. (2022). Towards a multilingual dictionary of discourse markers: automatic extraction of units from parallel corpus. In: Klosa-Kückelhaus, A.; Engelberg, S.; Möhrs, C.; Storjohann, P. Dictionaries and Society. Proceedings of the XX EURALEX International Congress, Mannheim: IDS-Verlag, pp. 262-272. PDF

+ Nazar, R; Lindemann, D. (2022). Terminology extraction using co-occurrence patterns as predictors of semantic relevance. Proceedings of the TERM21 Workshop. Language Resources and Evaluation Conference (LREC 2022), Marseille, 20-25 June 2022, pp. 26-29. PDF

Solutions for text processing

It is critical for organizations to have the ability to process information automatically, and very often that information is contained in documents to be read by humans rather than machines. We have different methods for text processing depending on the goal.

We can be helpful teaching people how to automatize their text processing routines. We can batch-process thousands of documents to extract information from them or to derive different types of statistics. We can also change these document, or generate databases or email correspondence based on information extracted from them. Anything that involves intelligent management of information can benefit from different degrees of automatization, and by doing that we can free time, effort and resources.

Tell us which are your needs and we will show you what we can do about it.