Word2vec
Machine learning and data mining |
---|
Machine learning venues |
|
Word2vec is a group of related models that are used to produce so-called word embeddings. These models are shallow, two-layer neural networks, that are trained to reconstruct linguistic contexts of words: the network is shown a word, and must guess which words occurred in adjacent positions in an input text. The order of the remaining words is not important (bag-of-words assumption).[1]
After training, word2vec models can be used to map each word to a vector of typically several hundred elements, which represent that word's relation to other words. This vector is the neural network's hidden layer.[2]
Word2vec relies on either skip-grams or continuous bag of words (CBOW) to create neural word embeddings. It was created by a team of researchers led by Tomas Mikolov at Google. The algorithm has been subsequently analysed and explained by other researchers [3][4] and a Bayesian version of the algorithm is proposed as well.[5]
Skip grams and CBOW
Skip grams are word windows from which one word is excluded, an n-gram with gaps. With skip-grams, given a window size of n words around a word w, word2vec predicts contextual words c; i.e. in the notation of probability . Conversely, CBOW predicts the current word, given the context in the window, .
Extensions
An extension of word2vec to construct embeddings from entire documents (rather than the individual words) has been proposed.[6] This extension is called paragraph2vec or doc2vec and has been implemented in the C, Python [7][8] and Java/Scala [9] tools (see below), with the Java and Python versions also supporting inference of document embeddings on new, unseen documents.
Word Vectors for Bioinformatics: BioVectors
An extension of word vectors for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad.[10] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of machine learning in proteomics and genomics. The results presented by[10] suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.
Analysis
The reasons for successful word embedding learning in the word2vec framework are poorly understood. Goldberg and Levy point out that the word2vec objective function causes words that occur in similar contexts to have similar embeddings (as measured by cosine similarity) and note that this is in line with J. R. Firth's distributional hypothesis. They also note that this explanation is "very hand-wavy".
Levy et al. (2015)[11] could show that much of the superior performance of word2vec or similar embeddings in downstream tasks is not a result of the models per se, but of the choice of specific hyperparameters. Transferring this hyperparameters to more 'traditional' approaches yields similar performances in downstream tasks.
Implementations
See also
- Autoencoder
- Document-term matrix
- Feature extraction
- Feature learning
- Language modeling § Neural net language models
- Vector space model
- thought vector
References
- ↑ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems.
- ↑ Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space" (PDF). Retrieved 2015-08-14.
- ↑ Goldberg, Yoav; Levy, Omer. "word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method" (PDF). Retrieved 2015-08-14.
- ↑ Řehůřek, Radim. Word2vec and friends (Youtube video). Retrieved 2015-08-14.
- ↑ Barkan, Oren (2015). "Bayesian Neural Word Embedding". arXiv:1603.06571 [cs.CL].
- ↑ Le, Quoc; et al. "Distributed Representations of Sentences and Documents." (PDF). Retrieved 2016-02-18.
- ↑ "Doc2Vec tutorial using Gensim". Retrieved 2015-08-02.
- ↑ "Doc2vec for IMDB sentiment analysis". Retrieved 2016-02-18.
- ↑ "Doc2Vec and Paragraph Vectors for Classification". Retrieved 2016-01-13.
- 1 2 Asgari, Ehsaneddin; Mofrad, Mohammad R.K. (2015). "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics". PloS one 10 (11): e0141287.
- ↑ Levy, Omer; Goldberg, Yoav; Dagan, Ido (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics.
|