Tecling: Technologies for Linguistic Analysis

Octubre 14, 2024

New paper on statistical models of discourse genres

Rogelio Nazar just pusblished a new paper on the Springer Nature Computer Science journal:
Nazar, R. Statistical Modeling of Discourse Genres: The Case of the Opinion Column in Spanish. SN COMPUT. SCI. 5, 959 (2024). https://doi.org/10.1007/s42979-024-03329-8
The paper describes how the new version of Text·a·Gram can be used to explore some interesting quantitative characteristics of discourse genres. In particular, in this occasion the paper describes how different discourse mechanisms such as discourse markers, deictics and modal operators are distributed from beginning to end of a typically opinion column.

4 de octubre, 2024

Rogelio Nazar dictará charla sobre Text·a·Gram

Rogelio Nazar hará una presentación en línea este martes 8 de octubre a las 17 h titulada 'Text·a·Gram: métodos cuantitativos para el análisis del discurso'. Este evento es organizado por el IDI Research Group, de la Universidad de las Américas. El objetivo es presentar una línea de investigación sobre modelado de géneros discursivos y la herramienta Text·a·Gram, generada en el marco de ese proyecto, que permite extraer estadísticas descriptivas sobre la distribución de marcadores discursivos, deícticos y operadores modales en lengua castellana. La charla servirá, además, como presentación de una nueva versión del sistema, con nuevas funcionalidades.

Update:

Ya está disponible el código fuente del programa para descargar desde la web del proyecto:
https://www.tecling.com/textagram
Muy probablemente estaremos actualizando esta versión del código en los próximos días.

September 5, 2024

Prof. Elisabetta Jezek in the Winter Seminars on Lexical Semantics 2024

Great talks in the Winter Seminars on Lexical Semantics 2024 by Prof. Elisabetta Jezek, from University of Pavia. We had a room crowded with PhD, MA and undergrad students. We talked about syntax, semantics, word sense disambiguation and Corpus Pattern Analysis. We are thrilled to have Elisabetta with us these days at @ILCLPUCV!

5 de septiembre, 2024

Hernán Robledo presentó su trabajo en la Universidad de Londres

Hernán presentó hoy su trabajo en el V Congreso Internacional RECoD:
https://recod.org/
celebrado en Birkbeck, Inglaterra. La ponencia se titula “Variantes formales de marcadores del dicurso del español: exploraciones en tres géneros académicos” y se enmarca dentro del proyecto Fondecyt de Postdoctorado ANID no. 3230617, patrocinado por Irene Renau y la PUCV.
Bien, colega! Ahora toca pasear por las calles de Londres...

22 de agosto, 2024

Impresionante convocatoria del taller de Python

Esperábamos un total de entre 3 y 4 interesados en el taller de introducción a Python y en lugar de eso estuvimos a sala llena. De hecho, tuvimos que dejar a 16 personas afuera porque ya no entraban más.
Tendremos que hacer una segunda edición del taller en las próximas semanas para darle oportunidad de participar a aquellos que se inscribieron pero quedaron fuera de cupo. Estaremos informando pronto sobre las fechas de esta segunda edición.

17 de agosto, 2024

TEXT·A·GRAM está otra vez en línea

Después de un tiempo de baja debido a un cambio de servidor, TEXT·A·GRAM vuelve a estar otra vez en línea.
El programa está diseñado para realizar el barrido de textos en los siguientes niveles:

Los referentes del texto (los objetos de los que habla el texto)
Marcadores discursivos ( utilizando la taxonomía de Dismark)
Deixis (personal, temporal y espacial)
Modalización

8 de agosto 2024

Taller de introducción a Python

En el marco del Doctorado en Lingüística del Instituto de Literatura y Ciencias del Lenguaje de la Pontificia Universidad Católica de Valparaíso, los días miércoles 21 y 28 de agosto de 2024, entre las 15 y las 17 h, Rogelio Nazar estará dictando un taller de introducción a Python. En este enlace se encuentra el formulario de inscripcción y más detalles sobre el evento.

2 de agosto 2024

Nuestra colaboradora Ana Castro obtiene el primer puesto de la beca ANID

Ana Castro, colaboradora de este grupo en diversos proyectos desde el año 2014 y coautora nuesta en varios artículos y presentaciones a congresos, ha obtenido el primer puesto a nivel nacional en la beca ANID (Beca Chile), sobre un total de 413 candidatos. Ana cursará el Doctorado en Filología Española, en la línea de investigación de lexicología y lexicografía, en la Universidad Autónoma de Barcelona. Tremendamente orgullosos estamos de este impresionante logro con una beca tan competitiva como esta. ¡Grande Anita!

July 29, 2024

We will be presenting two papers at ICAI 2024

Today we woke up to the news that we have two papers accepted at the 7th International Conference on Applied Informatics (ICAI 2024), which this year will be held at the Universidad Andrés Bello, located in Viña del Mar, Chile (Quillota 980). One of the papers is titled 'Metaphor identification and interpretation in corpora with ChatGPT', by I. Renau, E. Puraivan and J. Riquelme, and the title of the other one is 'Statistical modeling of discourse genres: the case of the opinion column in Spanish', by R. Nazar. Both papers have been selected as 'best papers', which means that they will be published (ca. November 2024) in the SN Computer Science Journal (by Springer Nature).

July 21, 2024

We are updating this website

We have been doing some maintenance in our servers, which for years was a pending task. We reinstalled the operating system and changed the disk, which was old and frail. Now we are in the process of migrating every app to the new disk. This may, of course, produce some errors and 'file not found' messages. You can help by sending us an email if you encounter one of those.

July 11, 2024

Two talks by Irene Renau in less than 24 hours

Intense working day for Irene. She participated in a talk in honor to Humberto López Morales in the Chilean Academy of Language, and only some hours later she delivered another talk at Phrasalex III, a workshop on lexicology and lexicography. She also payed tribute to the late Patrick Hanks, the famous lexicographer and also dear friend.

10 de julio 2024

Tenemos nuevo artículo sobre marcadores discursivos

Nos han publicado un nuevo artículo en el último número de la Revista Logos:

Alvarado, Camila; Nazar, Rogelio (2024). Detección de marcadores discursivos: el caso de los conectores causal-consecutivos y su polifuncionalidad. Logos: Revista de Lingüística, Filosofía y Literatura. 34(1): 293-308.
https://revistas.userena.cl/index.php/logos/issue/view/159

Se trata de una investigación que deriva de la tesis de Licenciatura de Camila, que abordó, como becaria del Proyecto Fondecyt Regular 1191481 (2019-2021), una investigación sobre marcadores discursivos del castellano. En particular, se centró en un tipo de particular, el de los conectores causales, y desarrolló un algoritmo para identificarlos. Estudió, además, un método para detectar aquellos casos que presentan polifuncionalidad, y los resultados, evaluados en términos de precisión y cobertura, son muy prometedores.

July 1, 2024

We have a new version of Termout

We are again in some kind of productivity rush, and this time we have a new version of Termout, our terminology extraction system:
http://www.tecling.cl/cgi-bin/termout2024
Actually, what we have is more like a preview, as the only part currently available to the public is the terminology extraction part, but that is the most important part of the process.
Compared to its predecessor, this new version is blazingly fast, and performance evaluation (soon to be published) shows that it also has better precision and recall.
We are working very hard to get the rest of the functions ready and update the documentation. The old version (2023) will continue to exist for a while, until we finish the migration.
In the meantime, try the new term extractor and tell us about your experience sending some emails to rogelio dot nazar at pucv dot cl

28 de junio, 2024

Lucía Castillo presenta conferencia sobre Ciencia Abierta

Lucía Castillo, de la Universidad de Concepción, ha venido a presentar una conferencia en el Doctorado en Lingüística de la Pontificia Universidad Católica de Valparaíso acerca de los métodos, técnicas y prácticas que existen actualmente para llevar adelante lo que se conoce como Ciencia Abierta, un paradigma de trabajo en el que compartir los datos de la investigación científica (insumos, resultados, y el código que los genera) es considerado una parte fundamental de la honestidad profesional.
Amablemente nos ha compartido Lucía las diapositivas de su presentación:
LCastilloCienciaAbierta2024.pdf

June 14, 2024

We have a new version of Bifid

We have a new version of Bifid, our dear old statistical parallel corpus aligner.
Bifid is a program that takes a set of documents with their translations (an unaligned parallel corpus) and does several things:

It separates the set of documents in the two languages

It aligns each document with its translation

It aligns the sentences in each pair of documents

It extracts a bilingual vocabulary from the aligned sentences
[NEW UPDATE: 21 june 2024]: The vocabulary now includes multiword units.

It export results in csv and tmx formats

It imports tmx documents, in case you already have your corpus aligned at the sentence level and what you want is to obtain a bilingual vocabulary.

Bifid had been online since 2004 (yes, it's going to be 20 years now)
but lately its server had gone down and it was neglected.
But here it is now, again, restored to its former glory!
We are planing some kind of celebration for its 20th birthday (none
remembers the actual date so we will celebrate the whole year).
We will be updating on this soon.

Some (old) publications on the project:
Nazar, R. (2011). Parallel corpus alignment at the document, sentence and vocabulary levels.
Procesamiento del Lenguaje Natural, n. 47.
Nazar, R. (2012). Bifid: un alineador de corpus paralelo a nivel de documento, oración y vocabulario.
Linguamatica, vol. 4, no. 2.

May 31, 2024

We have a new version of Kind

We have a new version of Kind, our lexical taxonomy. This version is based on the Spanish Wiktionary, so at the moment it is only available in this language. Versions for some others will be available at some point in the future here.

This database has 30912 different nouns as entries, which populate a relatively large top-ontology, with 525 semantic types.
Take a look:
http://www.tecling.com/cgi-bin/kind/2024
You can, for instance, enter any arbitrary common single-noun or a list of them to obtain their hypernymy chains.
You can also navigate categories, for example here with árbol (tree):
http://www.tecling.com/cgi-bin/kind/2024/index.pl?input=%C3%A1rbol&act=tot
and see what kinds of trees we have.

A lot of work still needs to be done, though. For instance, the category sistema (system) still has some Borgean vibes. As a list, it is heterogeneous to the ridicule. Larger categories should be further subdivided in a logical order. Another limitation is that it's only text is the Wiktionary. It does not extract information from a corpus, as in the older version (still available, by the way).

This is part of a work in progress and there is no documentation yet. It was designed and developed by Rogelio Nazar with the help of Irene Renau, Daniel Mora and Nicolás Acosta. This Spanish taxonomy is just a single cog in a larger project directed by Irene Renau.

May 24, 2024

Irene Renau presented two talks in Murcia

Irene Renau made two presentations today at the X Congreso Internacional de Lexicografía Hispánica, held at Universidad de Murcia, Spain. This time, the title of the conference is «Variación y panhispanismo en lexicografía» (Variation and pan-Hispanism in lexicography). The title of the first talk was 'ChatGPT para la detección de metáforas en sustantivos del español' (ChatGPT for the detection of metaphors in Spanish nouns), which she coauthored with Eduardo Puraivan. The title of the second presentation was 'Extracción de patrones verbales para una base de datos léxica del castellano' (Extraction of verbal patterns for a lexical database of Spanish) and it was based on a paper, currently in progress, with coauthors Rogelio Nazar and Daniel Mora.

May 6, 2024

We started to clean-up the Spanish Wiktionary

Compared to what it was 10 years ago, the Spanish Wiktionary is today an amazing resource that can be harnessed for different uses in computational linguistics and other fields. It really shows the power of collaboration, an example of what can be achieved by an army of writers from different backgrounds. However, as it stands on the web, it still bears the marks of its collaborative nature: most people creating this resource are not professional lexicographers, and many of the definitions still lack the touch of commercial dictionaries. Furthermore, it is plagued by elements that should not be there, like proper nouns or very specialized technical terminology. Most importantly, however, as a whole it shows the lack of cohesion and uniformity of style, as a natural consequence of having been produced by an heterogeneous group of people. Given these circumstances, we embarked on the task of cleaning it up a little bit, to make it more useful for Spanish NLP applications.

[NEW UPDATE May 26, 2024]
Just when we thought we had finished, new things appeared and had to be cleaned. But now, yes, we think it should be good enough for many purposes:

http://tecling.com/wiktionary/cleanedupWiktSpa-26May2024.zip (2Mb)

The next step will be to prepare a taxonomy of nouns from this. We will be informing about that...
(by the way, this version only includes nouns: if you were here for the verbs or adjectives, etc., you will be disappointed).

Tools & demos

We have implemented different types of applications and most of them can be tested online. Take a look.

+ Bifid: a parallel corpus aligner

+ Compare: a simple script to compare two lists of words

+ Cryptoman: a script to generate cryptograms

+ Dismark: a multilingual taxonomy of discourse markers

+ Estilector: computer assisted writing for Spanish

+ GeNom: a program to detect the gender of proper nouns

+ Jaguar: a tool for statistic corpus analysis

+ Kind: a lexical taxonomy induction algorithm

+ Kwico: a concordancer for big corpora

+ Lealem: a reading pacer for parallel German-Spanish texts

+ Leafran: a reading pacer for parallel French-Spanish texts

+ Linguini: a language detector

+ Neven: a program to detect eventive nouns

+ POL: named entity recognition and classification

+ Poppins: a supervised text classifier

+ Porcus: an interface for various taggers and parsers for Spanish

+ pullPOS: a project for the detection of plurals in Spanish

+ Punkt: punktuation of discourse markers in Spanish

+ Randall: a list randomizer

+ Readeutsch: a reading pacer for parallel German-English texts

+ Regex: a Perl script for regular expressions

+ Sapo: a program to detect similarities between documents

+ Sicam: a program to analyze Spanish poetry

+ Termout: a terminology extraction system

+ TEXT·A·GRAM: a program to analyze Spanish texts

+ Verbario: corpus pattern analysis in Spanish

This is the view from where we are located, in the Sausalito lagoon, a quiet and lovely place in Viña del Mar, Chile. Sunny days. Birds can be seen in the center of the lagoon (click to enlarge).

As researchers, we are currently affiliated to:

Instituto de Literatura y Ciencias del Lenguaje

Av. El Bosque 1290, Viña del Mar, Chile

Upcoming Events

[UPDATED: September 29, 2024]

8 de octubre de 2024 a las 17 horas de Chile: Rogelio Nazar estará presentando en línea para el IDI Research Group, de la Universidad de las Américas, una charla titulada 'Text·a·Gram: métodos cuantitativos para el análisis del discurso'. El objetivo es presentar una línea de investigación sobre modelado de géneros discursivos y una herramienta de código abierto, generada en el marco de ese proyecto, que permite extraer estadísticas descriptivas sobre la distribución de marcadores discursivos, deícticos y operadores modales en lengua castellana.

October 8-12, 2024: Irene Renau will be presenting a paper at EURALEX 2024, to be held in Cavtat, Croatia, The paper has the title 'Towards the automatic generation of a pattern-based dictionary of Spanish verbs' and is the result of a collaboration between her and Daniel Mora and Rogelio Nazar.

October 24-26, 2024: Irene Renau and Rogelio Nazar will be presenting two papers at the 7th International Conference on Applied Informatics (ICAI 2024), which this year will be held at the Universidad Andrés Bello, located in Viña del Mar, Chile (Quillota 980).

November 8, 2024: at 16h Madrid time (GMT+2) or 12h in Chilean time (GMT-4) Irene Renau and Rogelio Nazar will be presenting their research results at the II Seminario UAM: “Jornadas de lexicología y lexicografía del español: modelos, metodologías y herramientas” (Conference on Spanish lexicology and lexicography: models, methodologies and tools), event organized by Rosario González, Beatriz Méndez, Elena de Miguel y Alberto Anula. The title of the presentation is 'La lingüística aplicada en acción: experimentos con herramientas para el procesamiento de texto' (Applied linguistics in action: experiments with text processing tools).

Tweets by TeclingGroup

Latest ideas & research projects

We are developing new projects in computational linguistics and natural language processing:

+ Fondecyt Regular (2023-2027): "Mapa de las metáforas conceptuales en sustantivos y verbos del español: un estudio de los patrones metafóricos basado en corpus". Lead researcher: Irene Renau. Co-researcher: Rogelio Nazar. Ref.: 1231594.

+ Fondecyt Regular (2019-2021): "Polisemia regular de los sustantivos del español: análisis semiautomático de corpus, caracterización y tipología" (Regular polysemy of nouns in Spanish: semiautomatic analysis of corpus, characterization and tipology). Lead researcher: Irene Renau. Ref.: 1191204.

+ Fondecyt Regular (2019-2021): "Inducción automática de taxonomías de marcadores discursivos a partir de corpus multilingües" (Automatic induction of taxonomies of discourse markers from multilingual corpora). Lead researcher: Rogelio Nazar. Ref.: 1191481.

+ Ecos-Sud (International Project between Chile and France): "Inducción automática de taxonomías del español y el francés mediante técnicas cuantitativas y estadística de corpus". Lead researcher: Irene Renau. Ref.: C16H02.

+ Fondecyt Regular: "Desarrollo de la competencia terminológica a lo largo de la inserción disciplinar". Lead Researcher: Sabela Fernández. Co-researcher: Rogelio Nazar. Ref.: 11121597.

+ See more.

Recent publications

+ Nazar, R.; Renau, I.; Robledo, H. (In press). Dismark and Text·a·Gram: Automatic identification and categorization of discourse markers in texts. In Proceedings of DISROM 2022 (Discourse Markers in Romance Languages, Craiova, 16-18 June 2022).

+ Obreque, J.; Nazar, R. (2023). Detección de operadores modales: una primera exploración en castellano. Linguamatica. 15(2): 37--49. PDF

+ Renau, Irene. (2023). A corpus-based study of semantic neology of the Covid-19 pandemic. Quaderns de Filologia: Estudis Lingüístics XXVIII: 55-76. PDF

+ Nazar, R. (2023). Extensión, variación y evolución del léxico español. In Battaner, P., Torner, S, Renau, I. Lexicografía hispánica / The Routledge Handbook of Spanish Lexicography. Cap. 14, pp. 204-218.

+ López-Hidalgo, B.; Renau, I.; Nazar, R. (2023). Correlación entre la metáfora orientacional BUENO ES ARRIBA / MALO ES ABAJO y polaridad positiva/negativa en verbos del español: un estudio con estadística de corpus. Humanidades Digitales, Corpus y Tecnología del Lenguaje. University of Groningen Press, pp. 307-323. PDF

+ Nazar, R. & Acosta, N. (2023). Termout: a tool for the semi-automatic creation of term databases. In Haddad, Amal; Terryn, Ayla; Mitkov, Ruslan; Rapp, Reinhard; Zweigenbaum, Pierre and Sharoff, Serge (eds.) Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC), INCOMA, Shoumen, Bulgaria, pp. 9-18. PDF

+ Nazar, R. & Renau, I. (2023). Estilector: un sistema de evaluación automática de la escritura académica en castellano. Revista Perspectiva Educacional, 62(2): 37-59. PDF

+ Robledo, H.; Nazar, R. (2023). A proposal for the inductive categorisation of parenthetical discourse markers in Spanish using parallel corpora. International Journal of Corpus Linguistics. http://doi.org/10.1075/ijcl.20017.rob

+ Renau, I.; Nazar, R. (2022). Towards a multilingual dictionary of discourse markers: automatic extraction of units from parallel corpus. In: Klosa-Kückelhaus, A.; Engelberg, S.; Möhrs, C.; Storjohann, P. Dictionaries and Society. Proceedings of the XX EURALEX International Congress, Mannheim: IDS-Verlag, pp. 262-272. PDF

+ Nazar, R; Lindemann, D. (2022). Terminology extraction using co-occurrence patterns as predictors of semantic relevance. Proceedings of the TERM21 Workshop. Language Resources and Evaluation Conference (LREC 2022), Marseille, 20-25 June 2022, pp. 26-29. PDF

Solutions for text processing

It is critical for organizations to have the ability to process information automatically, and very often that information is contained in documents to be read by humans rather than machines. We have different methods for text processing depending on the goal.

We can be helpful teaching people how to automatize their text processing routines. We can batch-process thousands of documents to extract information from them or to derive different types of statistics. We can also change these document, or generate databases or email correspondence based on information extracted from them. Anything that involves intelligent management of information can benefit from different degrees of automatization, and by doing that we can free time, effort and resources.

Tell us which are your needs and we will show you what we can do about it.


*» The universe is not perfect, but it's working on it.* ABOUT RESEARCH SOLUTIONS SOFTWARE CONTACT		Technologies for Linguistic Analysis