Wiki Wordnets

Wordnet is a lexical database built first for English. Wiki Wordnets are our ongoing effort for building Wordnets automatically from Wikipedias, Wiktionaries and available Wordnets. The multi-lingual Wordnets are aligned by the Princeton Wordnet synsets. The built Wordnets can have errors, with an estimated accuracy between 80% to 96%. We have expanded some already existing Wordnets and built Wordnets. The Wordnets are built from Creative Commons license resources, and are provided with Creative Commons license. They can be used for any purposes including commercial, with appropriate attribution. This project is an ongoing research, where we plan to further optimize the results for better recall and precision, and with additional features such as automatically finding glosses for synsets in the target language.

Resources

Wordnets

Following Wordnets are automatically built from scratch. The file format is compatible with DEBVisdic.

Output Wordnets built from Wiktionary dumps 08/2017
Language Higher Accuracy Comprehensive

Aside from the languages above the following table shows the Wordnet base coverage of some Wordnets built. The xml files for these languages can also be generated using the tools we will provide and the translation graph files. In the meanwhile you can email us with your target language.

Wordnet Base Concept Coverage Languages
88.2 German
80 Russian
> 75 Ido, Czech, Hungarian
> 65 Esperanto, Turkish, Kurdish
> 55 Ukraininan, Latin
> 50 Korean, Estonian, Occitan, Serbian
> 45 Chinese, Makedonian, Georgian
> 40 Vietnamese, Hindi, Armenian, Malay
> 35 Azerbaijani, Belarusian, Latvian, Kazakh, Swahili, Irish
> 30 Malagasy, Tagalog, Maori, Limburgish, Gaelic, Welsh, Yiddish
> 25 Turkmen, Faroese, Tajik, Khmer, Uzbek, Afrikaans, Interlingua, Tatar
> 20 Kirghiz, Telugu, Mongolic, Urdu, Lao, Bengali, Burmese

Table below shows some of the Wordnets available from OpenMultilingual Wordnet web site and expanded with our algorithm

Coverages and sense counts before and after expanding the Wordnets
ISO Language Before After Before After
dan Danish 81 84.7 4,476 31,313
ell Greek 57 83.55 18,049 41,272
fas Persian 41 63.75 17,759 26,124
fra French 92 96.67 59,091 66,143
heb Hebrew 27 57.34 5,448 23,710
ita Italian 83 91.96 35,001 54,739
glg Galician 36 63.43 19,312 26,355
spa Spanish 76 93.23 38,512 58,525
nld Dutch 67 88.9 30,177 49,280
pol Polish 54 84 33,826 47,623
por Portuguese 84 91.2 43,895 50,305
lit Lithuanian 35 60.1 9,462 23,541
slk Slovak 58 70.73 18,507 27,779
Graph Files

We also provide the igraph Graph files, which contain all the languages and inferred senses. After installing python and igraph the graph files can be processed as given in the example below. The nodes contain the following attributes

iGraph Attributes
Attribute Name Description
langLanguage code of the word.
synsetsSynsets that are extracted from Open Multilingual Wordnet project
wordThe word itself. Note that it is not unique as a word in two different languages can have the same string
nameUnique identifier for the nodes. The language code is prepended to the word. You can retrieve a node by g.vs.find(name="tur:araba")
predsynsetsPredicted synsets for the word.
Example Code for graph files
from igraph import *
Graph.Read_Picklez("5x1024_0.9.deeppredictedgraph155.picklez")
v["lang"]=="tur"]
vs[0]["predsynsets"]
  • Graph file with parameter 0.9 (more accurate, less comprehensive)
  • Graph file with parameter 0.7 (less accurate, more comprehensive)

Science behind

Main Objective

Wordnet is a valuable resource for many Natural language processing and Information Retrieval tasks. It is composed of a taxonomy for hyponymy and hypernymy and contains meronymy/holonymy relationships. English Wordnet also known as Princeton Wordnet is a comprehensive resource, however, for other languages it is not always possible to find a similar extensive knowledge-base. Building Wordnet from scratch is a labourous task, requiring many hours of work by expert linguists. On the other hand Wiki based knowledge-bases offer an extensive resource for many languages, but in a less structured manner. This web-site presents our findings about how a Wordnet for a new language can be built using Wiktionaries, Wikipedias and available Wordnets.

Translation Graph

By acquiring the translations in all Wiktionaries we build a large massively multilingual translation graph. For extracting the translations the Java Wiktionary Library is extended by a Turkish Wiktionary parser . We also utilize wikt2dict to parse other language Wiktionary editions as well. We also experimented with Panlex translations (our results in articles exclude Panlex translations). We have experimented with some automated methods for increasing the number of translations such as triangulation. This translation graph is enriched with senses from Open Multilingual Wordnet projects. We are also working on integrating Wikipedia embedding based graphs.

Synset Learning

To build a new language Wordnet, the words in a language should be associated with correct synsets. We have devised multiple algorithms for learning these associations. The details of our algorithms will be available as soon as our articles are publicly available.

Publications

  • Uğur Sopaoğlu, Gönenç Ercan. Evaluation of Semantic Relatedness Measures for Turkish Language. CICLING 2016 (to appear)
  • Gonenc Ercan, Farid Haziyev. Synset Expansion on Translation Graph for Automatic Wordnet Construction. Information Processing and Management Journal (Under Review)
  • Gönenç Ercan, Zeynep Şanlı. Türkçe Vikisözlüğün Doğal Dil İşlemede Kullanımları. Gazi Bilişim Dergisi (Prepared)

Team

Gönenç Ercan

Assistant Professor Principal Investigator of the project.

Farid Haziyev

Graduate Student Contributed to graph expansion algorithm. Evaluation procedures, creation of xml files. Worked on graph based community detection methods for automatic Wordnet construction. Master Thesis topic: using Wikipedia to construct Wordnets automatically.

Zeynep Sanli

Graduate Student Contributed to Wiktionary parser, reading Open Multilingual Wordnets, supervised hypernymy/hyponymy detection. Master thesis topic: Learning hypernym and hyponym relationships from cross lingual word embeddings.

Mustafa Köse

Graduate Student Contributed to lexicon and translation learning from Wikipedia resources. Triangulation based approaches for inferring new translations. Ph.D. topic: To be decided

Merve Selçuk Şimşek

Graduate Student Contributed to parsing translations, coded the Turkish Wiktionary Parser, filtering the translation graph.

Enes Soylu

Undergraduate Student Contributed to Node.js based Wordnet collaborative editing tool. Web-based word embedding visualization tool using d3js.