Word2vec

Machine learning and data mining

Problems Classification Clustering Regression Anomaly detection Association rules Reinforcement learning Structured prediction Feature engineering Feature learning Online learning Semi-supervised learning Unsupervised learning Learning to rank Grammar induction
Supervised learning (classification • regression) Decision trees Ensembles (Bagging, Boosting, Random forest) k-NN Linear regression Naive Bayes Neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH Hierarchical k-means Expectation-maximization (EM) DBSCAN OPTICS Mean-shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA t-SNE
Structured prediction Graphical models (Bayes net, CRF, HMM)
Anomaly detection k-NN Local outlier factor
Neural nets Autoencoder Deep learning Multilayer perceptron RNN Restricted Boltzmann machine SOM Convolutional neural network
Theory Bias-variance dilemma Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine learning venues NIPS ICML JMLR ArXiv:cs.LG
Machine learning portal

Word2vec is a group of related models that are used to produce so-called word embeddings. These models are shallow, two-layer neural networks, that are trained to reconstruct linguistic contexts of words: the network is shown a word, and must guess which words occurred in adjacent positions in an input text. The order of the remaining words is not important (bag-of-words assumption).^[1]

After training, word2vec models can be used to map each word to a vector of typically several hundred elements, which represent that word's relation to other words. This vector is the neural network's hidden layer.^[2]

Word2vec relies on either skip-grams or continuous bag of words (CBOW) to create neural word embeddings. It was created by a team of researchers led by Tomas Mikolov at Google. The algorithm has been subsequently analysed and explained by other researchers ^[3]^[4] and a Bayesian version of the algorithm is proposed as well.^[5]

Skip grams and CBOW

Skip grams are word windows from which one word is excluded, an n-gram with gaps. With skip-grams, given a window size of $n$ words around a word $w$ , word2vec predicts contextual words $c$ ; i.e. in the notation of probability $p(c|w)$ . Conversely, CBOW predicts the current word, given the context in the window, $p(w|c)$ .

Extensions

An extension of word2vec to construct embeddings from entire documents (rather than the individual words) has been proposed.^[6] This extension is called paragraph2vec or doc2vec and has been implemented in the C, Python ^[7]^[8] and Java/Scala ^[9] tools (see below), with the Java and Python versions also supporting inference of document embeddings on new, unseen documents.

Word Vectors for Bioinformatics: BioVectors

An extension of word vectors for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad.^[10] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of machine learning in proteomics and genomics. The results presented by^[10] suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

Analysis

The reasons for successful word embedding learning in the word2vec framework are poorly understood. Goldberg and Levy point out that the word2vec objective function causes words that occur in similar contexts to have similar embeddings (as measured by cosine similarity) and note that this is in line with J. R. Firth's distributional hypothesis. They also note that this explanation is "very hand-wavy".

Levy et al. (2015)^[11] could show that much of the superior performance of word2vec or similar embeddings in downstream tasks is not a result of the models per se, but of the choice of specific hyperparameters. Transferring this hyperparameters to more 'traditional' approaches yields similar performances in downstream tasks.

Implementations

References

↑ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems.
↑ Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space" (PDF). Retrieved 2015-08-14.
↑ Goldberg, Yoav; Levy, Omer. "word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method" (PDF). Retrieved 2015-08-14.
↑ Řehůřek, Radim. Word2vec and friends (Youtube video). Retrieved 2015-08-14.
↑ Barkan, Oren (2015). "Bayesian Neural Word Embedding". arXiv:1603.06571 [cs.CL].
↑ Le, Quoc; et al. "Distributed Representations of Sentences and Documents." (PDF). Retrieved 2016-02-18.
↑ "Doc2Vec tutorial using Gensim". Retrieved 2015-08-02.
↑ "Doc2vec for IMDB sentiment analysis". Retrieved 2016-02-18.
↑ "Doc2Vec and Paragraph Vectors for Classification". Retrieved 2016-01-13.
1 2 Asgari, Ehsaneddin; Mofrad, Mohammad R.K. (2015). "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics". PloS one 10 (11): e0141287.
↑ Levy, Omer; Goldberg, Yoav; Dagan, Ido (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics.

Natural language processing

General terms	Text corpus Speech corpus Stopwords Bag-of-words AI-complete n-gram (Bigram, Trigram)

Text analysis	Text segmentation Part-of-speech tagging Text chunking Compound term processing Collocation extraction Stemming Lemmatisation Named-entity recognition Coreference resolution Sentiment analysis Concept mining Parsing Word sense disambiguation Terminology extraction Truecasing

Automatic summarization	Multi-document summarization Sentence extraction Text simplification

Machine translation	Computer-assisted Example-based Rule-based

Automatic identification and data capture	Speech recognition Speech synthesis Optical character recognition Natural language generation

Topic model	Pachinko allocation Latent Dirichlet allocation Latent semantic indexing

Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing

Natural language user interface	Automated online assistant Chatterbot Interactive fiction Question answering

This article is issued from Wikipedia - version of the Tuesday, April 26, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.