# NLP - Word Embedding and Sequence Analysis

• Introduction
• Word2Vec
• Preliminary
• Projection Matrix
• Word Encoding
• Model Architecture and Training
• GloVe Model: Global Vectors for Word Representation
• Skip-Gram with Negative Sampling
• Sequence Analysis
• Deep Walk

### Introduction

One of the preliminary steps in many NLP works is to find a latent representation of words in the given vocabulary. This latent representation takes the form of a vector in $$\mathbb{R}^d$$ most of the time.

There are many methods to construct such latent representation in the literature. In this post, we will present the following three methods

• word2vec
• GloVe model
• Skip-Gram with Negative Sampling (SGNS)

In the last section, we will present a very interesting application of NLP techniques.

### Word2Vec

#### Preliminary

##### Projection Matrix

A projection matrix is used to project a vector from a higher dimension to a lower dimension. This can be expressed as matrix multiplication. In the neural network context, the projected vector is the next layer of the input.

For example, suppose our input has four dimensions which can be represented by a vector $$(x_1, x_2, x_3, x_4)$$. We can also think of an input layer of 4 nodes as illustrated in the figure below. Suppose the projection has 3 dimensions, which can be represented by a vector $$(y_1, y_2, y_3)$$ or a hidden layer of 3 nodes in a neural network. The weights in this super simple neural network are elements in the projection matrix. ##### Word Encoding

In the paper, the 1-of-V encoding is used. Let $$V$$ represent the whole vocabularies. Then a word can be represented by a vector with dimension $$|V|$$, where only one of the coordinates is 1 and all other coordinates are 0.

Suppose that vocabulary $$V$$ contains only 3 words: A, B, and C. We can then use the following representation:

$$\begin{eqnarray} A & = (1, 0, 0) \\ B & = (0, 1, 0) \\ C & = (0, 0, 1) \end{eqnarray}$$

#### Model Architecture and Training

Latent representation is a key concept in AutoEncoder as well. The idea is to train a neural network that can predict the input itself. By doing so, the neural network will learn the hidden structure of the inputs. The word2vec uses the same idea. Now the question is what to predict. There are many choices but two common practices are

• given a context, try to predict the missing word (CBOW)
• given a work, try to predict the context (Skip-gram)

Context generally means the surrounding words. ### GloVe Model: Global Vectors for Word Representation

As we see in the previous section, word2vec is similar to autoencoder. The target is the probability of seeing a word in a given context.

Here is the idea. The latent representation of the words must satisfy some conditions that are specific to the data set. Then the question becomes how we can set the weights so that the conditions are best met.

Let $$X_{ij}$$ denote the number of times word $$j$$ occurs in the context of the word $$i$$. In the paper GloVe: Global Vectors for Word Representation, the author argues that we should have

$$w_{i}^{T}\tilde{w}_k + b_i + \tilde{b}_k = log(X_{ik})$$

Translating into an optimziation problem, we have

$$J = \sum_{i,j=1}^{V} f(X_{ij})(w_{i}^{T}\tilde{w}_j + b_i + \tilde{b}_j - log(X_{ij}))^2$$
$$\begin{eqnarray} f(x) = \begin{cases} (x/x_{max})^{\alpha} & \textrm{if}\; x < x_{max} \\ 1 & \textrm{otherwise} \end{cases} \end{eqnarray}$$

### Skip-Gram with Negative Sampling

This section is copied directly from the paper Neural Word Embedding as Implicit Matrix Factorization.  ### Sequence Analysis

For many NLP models, the only input it requires is a corpus. Corpus is a collection of documents, which consists of sentences and which in turn consists of words. There is nothing special about words. They are just labels/tokens. NLP techniques do not require any concept of a language. To some extent, they are sequence analysis techniques. What really matters are

• labels/tokens
• sequences of labels/tokens
• collection of sequences

This is one of the reasons why NLP has broad applications. Genes are sequences of DNA; logs are sequences of events. There is no fundamental difference between genes (or logs) and languages. In the next subsection, we will present a very interesting application of NLP: it can be used to learn a latent representation of vertices in a graph.

#### Deep Walk

Deep Walk is an interesting application of NLP techniques. The objective is to learn a latent representation of nodes in a network. Intuitively, this latent representation should be closely related to the structure of the network. So the question is how to capture this information. The idea is genius: in order to capture the local structure of the node in a network, we can generate random walks starting from that work. In this case, the random walk is a sequence of visited nodes and we can apply NLP techniques to these sequences. ----- END -----  