Automatic Wordnet Construction

Wiki Wordnets

Wordnet is a lexical database built first for English. Wiki Wordnets are our ongoing effort for building Wordnets automatically from Wikipedias, Wiktionaries and available Wordnets. The multi-lingual Wordnets are aligned by the Princeton Wordnet synsets. The built Wordnets can have errors, with an estimated accuracy between 80% to 96%. We have expanded some already existing Wordnets and built Wordnets. The Wordnets are built from Creative Commons license resources, and are provided with Creative Commons license. They can be used for any purposes including commercial, with appropriate attribution. This project is an ongoing research, where we plan to further optimize the results for better recall and precision, and with additional features such as automatically finding glosses for synsets in the target language.

Resources

Wordnets

Following Wordnets are automatically built from scratch. The file format is compatible with DEBVisdic.

Output Wordnets built from Wiktionary dumps 08/2017
Language	Higher Accuracy	Comprehensive

Aside from the languages above the following table shows the Wordnet base coverage of some Wordnets built. The xml files for these languages can also be generated using the tools we will provide and the translation graph files. In the meanwhile you can email us with your target language.

Wordnet Base Concept Coverage	Languages
88.2	German
80	Russian
> 75	Ido, Czech, Hungarian
> 65	Esperanto, Turkish, Kurdish
> 55	Ukraininan, Latin
> 50	Korean, Estonian, Occitan, Serbian
> 45	Chinese, Makedonian, Georgian
> 40	Vietnamese, Hindi, Armenian, Malay
> 35	Azerbaijani, Belarusian, Latvian, Kazakh, Swahili, Irish
> 30	Malagasy, Tagalog, Maori, Limburgish, Gaelic, Welsh, Yiddish
> 25	Turkmen, Faroese, Tajik, Khmer, Uzbek, Afrikaans, Interlingua, Tatar
> 20	Kirghiz, Telugu, Mongolic, Urdu, Lao, Bengali, Burmese

Table below shows some of the Wordnets available from OpenMultilingual Wordnet web site and expanded with our algorithm

Coverages and sense counts before and after expanding the Wordnets
ISO	Language	Before	After	Before	After
dan	Danish	81	84.7	4,476	31,313
ell	Greek	57	83.55	18,049	41,272
fas	Persian	41	63.75	17,759	26,124
fra	French	92	96.67	59,091	66,143
heb	Hebrew	27	57.34	5,448	23,710
ita	Italian	83	91.96	35,001	54,739
glg	Galician	36	63.43	19,312	26,355
spa	Spanish	76	93.23	38,512	58,525
nld	Dutch	67	88.9	30,177	49,280
pol	Polish	54	84	33,826	47,623
por	Portuguese	84	91.2	43,895	50,305
lit	Lithuanian	35	60.1	9,462	23,541
slk	Slovak	58	70.73	18,507	27,779

Graph Files

We also provide the igraph Graph files, which contain all the languages and inferred senses. After installing python and igraph the graph files can be processed as given in the example below. The nodes contain the following attributes

iGraph Attributes
Attribute Name	Description
lang	Language code of the word.
synsets	Synsets that are extracted from Open Multilingual Wordnet project
word	The word itself. Note that it is not unique as a word in two different languages can have the same string
name	Unique identifier for the nodes. The language code is prepended to the word. You can retrieve a node by g.vs.find(name="tur:araba")
predsynsets	Predicted synsets for the word.

Example Code for graph files

from igraph import *
Graph.Read_Picklez("5x1024_0.9.deeppredictedgraph155.picklez")
v["lang"]=="tur"]
vs[0]["predsynsets"]

Graph file with parameter 0.9 (more accurate, less comprehensive)
Graph file with parameter 0.7 (less accurate, more comprehensive)

Science behind

Main Objective

Wordnet is a valuable resource for many Natural language processing and Information Retrieval tasks. It is composed of a taxonomy for hyponymy and hypernymy and contains meronymy/holonymy relationships. English Wordnet also known as Princeton Wordnet is a comprehensive resource, however, for other languages it is not always possible to find a similar extensive knowledge-base. Building Wordnet from scratch is a labourous task, requiring many hours of work by expert linguists. On the other hand Wiki based knowledge-bases offer an extensive resource for many languages, but in a less structured manner. This web-site presents our findings about how a Wordnet for a new language can be built using Wiktionaries, Wikipedias and available Wordnets.

Translation Graph

By acquiring the translations in all Wiktionaries we build a large massively multilingual translation graph. For extracting the translations the Java Wiktionary Library is extended by a Turkish Wiktionary parser . We also utilize wikt2dict to parse other language Wiktionary editions as well. We also experimented with Panlex translations (our results in articles exclude Panlex translations). We have experimented with some automated methods for increasing the number of translations such as triangulation. This translation graph is enriched with senses from Open Multilingual Wordnet projects. We are also working on integrating Wikipedia embedding based graphs.

Synset Learning

To build a new language Wordnet, the words in a language should be associated with correct synsets. We have devised multiple algorithms for learning these associations. The details of our algorithms will be available as soon as our articles are publicly available.

Publications

Uğur Sopaoğlu, Gönenç Ercan. Evaluation of Semantic Relatedness Measures for Turkish Language. CICLING 2016 (to appear)
Gonenc Ercan, Farid Haziyev. Synset Expansion on Translation Graph for Automatic Wordnet Construction. Information Processing and Management Journal (Under Review)
Gönenç Ercan, Zeynep Şanlı. Türkçe Vikisözlüğün Doğal Dil İşlemede Kullanımları. Gazi Bilişim Dergisi (Prepared)

Team

Gönenç Ercan

Assistant Professor Principal Investigator of the project.

Farid Haziyev

Graduate Student Contributed to graph expansion algorithm. Evaluation procedures, creation of xml files. Worked on graph based community detection methods for automatic Wordnet construction. Master Thesis topic: using Wikipedia to construct Wordnets automatically.

Zeynep Sanli

Graduate Student Contributed to Wiktionary parser, reading Open Multilingual Wordnets, supervised hypernymy/hyponymy detection. Master thesis topic: Learning hypernym and hyponym relationships from cross lingual word embeddings.

Mustafa Köse

Graduate Student Contributed to lexicon and translation learning from Wikipedia resources. Triangulation based approaches for inferring new translations. Ph.D. topic: To be decided

Merve Selçuk Şimşek

Graduate Student Contributed to parsing translations, coded the Turkish Wiktionary Parser, filtering the translation graph.

Enes Soylu

Undergraduate Student Contributed to Node.js based Wordnet collaborative editing tool. Web-based word embedding visualization tool using d3js.