Bifid: Parallel corpus alignment at the document, sentence and vocabulary levels
Bifid is a program for parallel corpora alignment:
State of this project on January 17, 2021:Last year we had to interrupt this service due to security
issues detected in the server and our lack of time to solve them. We had to put
the server down until we had the time for a compelete overhaul of that piece of machinery.
But in the meantime, we have been planning also to improve Bifid's software
making it less computationally expensive and easier to install in other hardware.
Up to now, Bifid was too dependent on the Jaguar Project, which has problems of its own.
So what we are doing is to integrate parts of Jaguar's code into Bifid and also doing
some other major changes, with the inclusion of preloaded information about different languages.
This is a significant departure from the original project, explained in these publications:
Nazar, R. (2011). Parallel corpus alignment at the document, sentence and vocabulary levels.
Procesamiento del Lenguaje Natural, n. 47.
Nazar, R. (2012). Bifid: un alineador de corpus paralelo a nivel de documento, oración y vocabulario.
Linguamatica, vol. 4, no. 2.
One of the interesting features of the original proposal was the aim at total
linguistic agnosticism. Ideally, we will try to maintain some functionality for the
cases of languages unknown for the system. But from a practical point of view,
it could be argued that there is no need for the said agnosticism in the case
of well-known languages like English, Spanish, French, German and others.
Such knowledge would help Bifid take better decisions and faster.
The situation on the ground today is the following:
We have considerably improved our ability to detect sentences, and we have
a new prototype to do just that:
We also developed a language detection algorithm that also detects
fragments writen in languages other than the main one. We call it
In the coming days (or, probably, weeks!) we will be working in integrating all this in the
new version of Bifid.
If you have questions, feel free to send email: rogelio dot nazar at pucv dot cl
Error while reading file.
Nazar, R. (2011). "Parallel corpus alignment at the document, sentence and vocabulary levels". Procesamiento del Lenguaje Natural, n. 47.
Nazar, R. (2012). "Bifid: un alineador de corpus paralelo a nivel de documento, oración y vocabulario". Linguamatica, vol. 4, no. 2.
Contact: rogelio.nazar at gmail.com
Related concepts: Parallel Corpus Alignment, Bilingual Vocabulary Extraction, Machine Translation, Computational Linguistics, Computational Lexicography
DSELE: a dictionary of Spanish verbs with 'se'
This proposal is aimed at improving academic writing skills of students through the creation, development and implementation of a web tool that assists in detecting these problems of style that can be found in drafting academic work. It offers additional explanations, bibliographic support and online resources. The tool is not intended to correct grammatical or spelling errors, but those problems such as repeating words close in the text, poor vocabulary, the use of colloquialisms, the unequal structure of paragraphs, and so on. All these issues cannot be detected by programs such as Word, and yet they are critical to academic achievement. Our proposal is not to create a merely "corrector", but a teaching tool that fosters independent learning because the student can work on these aspects independently of the work of the classroom, albeit also complementary. The idea is that the tool will help students improve their writing during the process of performing the task. In addition, the program also encourages autonomy in the sense that it suggests solutions to the student, but does not correct the text, so that it is the student who ultimately decides whether or not the suggested changes apply.
GeNom: automatic detection of the gender of proper names is a project we have been granted on June 20, 2017, funded by the Technology Prototypes track of the Innovation and Entrepreneurship 2017 Competition (Vicerrectoría de Investigación y Estudios Avanzados - Pontificia Universidad Católica de Valparaíso). The result is offered as a web service for batch processing of information for terminography or lexicography projects or for mailing purposes.
Abstract: This software is designed to automatically determine the gender of a list of names based on their co-occurrence with words and abbreviations in a large corpus. GeNom is different from other forms of automatic name gender recognition software because it is based on natural language processing and does not rely on already compiled lists of first names, systems that get quickly outdated and cannot analyze previously unseen names. GeNom uses corpora to address the problem, because it offers the possibility of obtaining real and up-to-date name-gender links and performs better than machine learning methods: 93% precision and 88% recall on a database of ca. 10,000 mixed names. This software can be used to conduct large scale studies about gender, as gender bias for example, or for a variety of other NLP tasks, such as information extraction, machine translation, anaphora resolution and others. It is designed to work with Spanish names, as it works with a Spanish corpus, but it will be able to process names in other languages as well, provided that they use the same alphabet.
Web demo: http://www.tecling.com/genom
Jaguar is a tool for corpus exploitation. This software can analyze textual corpora from a user or from the web and it is currently available as a web application as well as a Perl module. The functions that are available at this moment are: vocabulary analysis of corpora, concordance extractions, n-gram sorting and measures of association, distribution and similarity.
Jaguar is essentially a Perl module instantiated as a web application. A web application has the advantage of being executable in any platform without installation procedures. However, with the module users are capable of building their own sequence of procedures, taking the output of a process to be the input of another process. The web interface has the limitation that only one procedure can be executed at a time, meaning that the output of a process has to be manually fed as input for the next process.
The project is a full renovation and extension of the old "Jaguar Project" carried out at Universitat Pompeu Fabra in Barcelona from 2006 to 2012. The title of the current project is: "Jaguar: an open-source prototype for quantitative corpus analysis"
The results of this project will be officialy presented in January 2017 at the university headquarters, in Av. Brasil #2950, Valparaíso, Chile.
We are also planning to offer an introductory Workshop on the use of this tool in the summer of 2017, maybe in Valparaíso, maybe in Santiago, or maybe in both places. Drop a line if interested.
KIND (aka The Taxonomy Project)
We designed a statistically-based
taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all
quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional
similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and
identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results
of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We
evaluate the quality of the taxonomy both manually and also using Spanish Wordnet as a gold-standard. We estimate an average of
89.07% precision and 25.49% recall considering only the results which the algorithm presents with high degree of certainty, or 77.86%
precision and 33.72% recall considering all results.
+ Nazar, R.; Balvet, A.; Ferraro, G.; Marín, R.; Renau, I. (2020). "Pruning and repopulating a lexical taxonomy: experiments in Spanish, English and French". Journal of Intelligent Systems, vol. 30 num. 1, pp. 376-394.
This project is part (or a "spin-off") of the Perl module Jaguar, which is currently ongoing with funding from the Innovation and Entrepeneurship 2016 Program of Pontificia Universidad Católica de Valparaíso, within the "Technological Prototyes" track.
KWiCo is a corpus indexing algorithm. It takes a corpus as input and produces a table with an index of the corpus, thus significantly reducing the time needed to retrieve concordances, especially when the corpus is very large.
We present a study in the field of the automatic
detection of non-deverbal eventive nouns, which
are those nouns that designate events but have not
experienced a process of derivation from verbs, such
as fiesta (‘party’) or cóctel (‘cocktail’) and, for this
reason, do not present the typical morphological features
of deverbal nouns, such as -ci´on, -miento, and
are therefore more difficult to detect.
In the present research we continue and extend the
work initiated by Resnik
(2010), who offers a number
of cues for the detection of this type of lexical unit. We
apply Resnik’s ideas and we also add new ones, among
them, the inductive analysis of the words that tend to
co-occur with eventive nouns in corpora, in order to
use them as predictors of this condition. Furthermore,
we simplify the classification algorithm considerably,
and we apply the experiments to a larger corpus, the
EsTenTen (Kilgarriff & Renau, 2013), comprising more
than 9 billion running words. Finally, we present
the first results of the automatic extraction of eventive
nouns from the corpus, among which we find plenty
perl neven.pl input.txt > result.htm
Termout.org is the first implementation of a new method for terminology extraction based on distributional analysis. The intuition behind the algorithm is that single or multi-word lexical units that refer to specialised concepts will show a characteristic co-occurrence pattern, described as a tendency to appear in the same contexts with other conceptually related terms. E.g. the term fluoxetine will systematically appear in the same sentences with other related terms such as depression, serotonin reuptake inhibitor, obsessive–compulsive disorder and others. Of course, terms will co-occur with general vocabulary units as well, but not with a characteristic pattern as when a conceptual relation holds. Experimental evaluation of this method was conducted in a corpus of psychiatry journals from Spain and Latin America, and concluded that the results are significantly better than other methods.
The purpose of this research is to develop a methodology for the detection
and categorisation of named entities or proper names (PPNN), in the categories of
geographical place, person and organisation. The hypothesis is that the context of
occurrence of the entity –a context window of n words before the target– as well as
the components of the PN itself may provide good estimators of the type of PN. To
that end, we developed a supervised categorisation algorithm, with a training phase
in which the system receives a corpus already annotated by another NERC system.
In the case of these experiments, such system was the open-source suite of language
analysers FreeLing, annotating the corpus of the Spanish Wikipedia. During this
training phase, the system learns to associate the category of entity with words of
the context as well as those from the PN itself. We evaluate results with the CONLL-
2002 and also with a corpus of geopolitics from the journal Le Monde Diplomatique
in its Spanish edition, and compare the results with some well-known NERC systems
To train POL for making a new model, you need to have Perl's Storable module installed.
These models were created
with a x86_64 HP Proliant machine with GenuineIntel CPU 1064.000 MHz running Linux (Ubuntu 14.04). If you have a different kind of machine (e.g., a Desktop pc on Windows), then you will probably need to create the models again by using poltrain.pl.
Poppins a very simple and yet effective algorithm for document categorization.
Text categorization has became a very popular
issue in computational linguistics and it has developed to great complexity, motivating a large
amount of literature.
Document categorization can be used in many scenarios. For instance,
an experiment on authorship attribution can be seen as a text categorization problem.
That is to say, each author represents a category and the
documents are the elements to be classified.
This system can be
used as a general purpose document classifier, for example by content instead of authorship,
because it only reproduces the criterion that it learned during the training phase.
This program is language independent because it uses purely mathematical
knowledge: an n-gram model of texts. It works in a very simple way and is therefore easy to
modify. In spite of its simplicity, this program is capable of classifying documents by author
obtaining more than 90% of accuracy.
Verbario is our first attempt to extract lexical patterns using corpus statistics. A pattern is a structure that combines syntactic and semantic features and is linked to a conventional meaning of a word. This means, for example, that the verb to die does not have intrinsic meanings, but potential meanings which are activated by the context: in ‘His mother died when he was five’, the meaning of the verb differs from ‘His mother is dying to meet you’, due to collocational restrictions and syntactic differences. With the automatic analysis of thousands of concordances per verb, we can make a first approach to the problem of detecting these structures in corpora, a very time-consuming task for lexicographers. The average precision is around 50%. The next step to increase precision is adding a dependency parser to the system and make adjustments to the automatic taxonomy we have created for semantic labeling.