
In linear algebraic terms, the vector space spanned by each embedding model is different.

It is evident that for the same corpus, vectors will manifest disparately each time a different vector training algorithm is used. The primary downside of vector representations is that they quantify only relative semantics between words. The trained cross-lingual model, Transfer Function-based Generated Embedding (TFGE) synthesizes new vectors for unknown words by transfer learning with a minimal seed dictionary (five thousand words) from a resource-rich source language (English) to a resource-poor target language (Tamil). Hence, experiments are carried with the three most popular embedding algorithms, Word2Vec, Glove, and FastText. Recently developed contextual embeddings methods such as Contextual word Vectors (CoVe), Embeddings from Language Models (ELMo), and Bidirectional Encoder Representations from Transformers (BERT) do not support the transfer of knowledge across languages, mainly from resource-rich to resource-poor languages as they use robust baseline architectures specific to each task. Multiple experiments with various methodologies are carried out to obtain target word vectors for English–Tamil language pairs. In this paper, cross-lingual embedding is accomplished by mapping the vectors from one language’s embedding space into that of the other language through a transfer function. The ability of word embeddings to represent semantic relations between words as spatial distances is called the semantic property of the whole embedding set of vectors.

These neural features are now what we call word vectors or word embeddings. In addition, the process is completely unsupervised. Their overwhelming success in terms of state-of-the-art accuracy and excellent benchmark results, compared to extant statistical and other traditional methods, were supervised by the features of the deep learning network, automatically extracted from the corpus. Subsequently, all NLP tasks were effectively rebooted using deep neural networks. The first-ever word embedding model learned the word vector representation by predicting the next word in a sequence within a local context window, using a neural network model. These vectors translate relative semantics between words to spatial positions. Quantitative representation of semantics is made possible with the evolution of word embeddings these are densely distributed vector representations of words. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths.

This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. Linguists have been focused on a qualitative comparison of the semantics from different languages.
