Wordnet is a lexical database built first for English. Wiki Wordnets are our ongoing effort for building Wordnets automatically from Wikipedias, Wiktionaries and available Wordnets. The multi-lingual Wordnets are aligned by the Princeton Wordnet synsets. The built Wordnets can have errors, with an estimated accuracy between 80% to 96%. We have expanded some already existing Wordnets and built Wordnets. The Wordnets are built from Creative Commons license resources, and are provided with Creative Commons license. They can be used for any purposes including commercial, with appropriate attribution. This project is an ongoing research, where we plan to further optimize the results for better recall and precision, and with additional features such as automatically finding glosses for synsets in the target language.
Following Wordnets are automatically built from scratch. The file format is compatible with DEBVisdic.
Language | Higher Accuracy | Comprehensive |
---|---|---|
Aside from the languages above the following table shows the Wordnet base coverage of some Wordnets built. The xml files for these languages can also be generated using the tools we will provide and the translation graph files. In the meanwhile you can email us with your target language.
Wordnet Base Concept Coverage | Languages |
---|---|
88.2 | German |
80 | Russian |
> 75 | Ido, Czech, Hungarian |
> 65 | Esperanto, Turkish, Kurdish |
> 55 | Ukraininan, Latin |
> 50 | Korean, Estonian, Occitan, Serbian |
> 45 | Chinese, Makedonian, Georgian |
> 40 | Vietnamese, Hindi, Armenian, Malay |
> 35 | Azerbaijani, Belarusian, Latvian, Kazakh, Swahili, Irish |
> 30 | Malagasy, Tagalog, Maori, Limburgish, Gaelic, Welsh, Yiddish |
> 25 | Turkmen, Faroese, Tajik, Khmer, Uzbek, Afrikaans, Interlingua, Tatar |
> 20 | Kirghiz, Telugu, Mongolic, Urdu, Lao, Bengali, Burmese |
Table below shows some of the Wordnets available from OpenMultilingual Wordnet web site and expanded with our algorithm
ISO | Language | Before | After | Before | After |
dan | Danish | 81 | 84.7 | 4,476 | 31,313 |
ell | Greek | 57 | 83.55 | 18,049 | 41,272 |
fas | Persian | 41 | 63.75 | 17,759 | 26,124 |
fra | French | 92 | 96.67 | 59,091 | 66,143 |
heb | Hebrew | 27 | 57.34 | 5,448 | 23,710 |
ita | Italian | 83 | 91.96 | 35,001 | 54,739 |
glg | Galician | 36 | 63.43 | 19,312 | 26,355 |
spa | Spanish | 76 | 93.23 | 38,512 | 58,525 |
nld | Dutch | 67 | 88.9 | 30,177 | 49,280 |
pol | Polish | 54 | 84 | 33,826 | 47,623 |
por | Portuguese | 84 | 91.2 | 43,895 | 50,305 |
lit | Lithuanian | 35 | 60.1 | 9,462 | 23,541 |
slk | Slovak | 58 | 70.73 | 18,507 | 27,779 |
We also provide the igraph Graph files, which contain all the languages and inferred senses. After installing python and igraph the graph files can be processed as given in the example below. The nodes contain the following attributes
Attribute Name | Description |
lang | Language code of the word. |
synsets | Synsets that are extracted from Open Multilingual Wordnet project |
word | The word itself. Note that it is not unique as a word in two different languages can have the same string |
name | Unique identifier for the nodes. The language code is prepended to the word. You can retrieve a node by g.vs.find(name="tur:araba") |
predsynsets | Predicted synsets for the word. |
from igraph import * Graph.Read_Picklez("5x1024_0.9.deeppredictedgraph155.picklez") v["lang"]=="tur"] vs[0]["predsynsets"]
Wordnet is a valuable resource for many Natural language processing and Information Retrieval tasks. It is composed of a taxonomy for hyponymy and hypernymy and contains meronymy/holonymy relationships. English Wordnet also known as Princeton Wordnet is a comprehensive resource, however, for other languages it is not always possible to find a similar extensive knowledge-base. Building Wordnet from scratch is a labourous task, requiring many hours of work by expert linguists. On the other hand Wiki based knowledge-bases offer an extensive resource for many languages, but in a less structured manner. This web-site presents our findings about how a Wordnet for a new language can be built using Wiktionaries, Wikipedias and available Wordnets.
By acquiring the translations in all Wiktionaries we build a large massively multilingual translation graph. For extracting the translations the Java Wiktionary Library is extended by a Turkish Wiktionary parser . We also utilize wikt2dict to parse other language Wiktionary editions as well. We also experimented with Panlex translations (our results in articles exclude Panlex translations). We have experimented with some automated methods for increasing the number of translations such as triangulation. This translation graph is enriched with senses from Open Multilingual Wordnet projects. We are also working on integrating Wikipedia embedding based graphs.
To build a new language Wordnet, the words in a language should be associated with correct synsets. We have devised multiple algorithms for learning these associations. The details of our algorithms will be available as soon as our articles are publicly available.
Graduate Student Contributed to graph expansion algorithm. Evaluation procedures, creation of xml files. Worked on graph based community detection methods for automatic Wordnet construction. Master Thesis topic: using Wikipedia to construct Wordnets automatically.
Graduate Student Contributed to Wiktionary parser, reading Open Multilingual Wordnets, supervised hypernymy/hyponymy detection. Master thesis topic: Learning hypernym and hyponym relationships from cross lingual word embeddings.
Graduate Student Contributed to lexicon and translation learning from Wikipedia resources. Triangulation based approaches for inferring new translations. Ph.D. topic: To be decided
Graduate Student Contributed to parsing translations, coded the Turkish Wiktionary Parser, filtering the translation graph.
Undergraduate Student Contributed to Node.js based Wordnet collaborative editing tool. Web-based word embedding visualization tool using d3js.