Implementing Glove Model with Pytorch

Hallo leute! Today’s post we will implement the awesome model to learn word embeddings Glove: Global Vectors for Word Representation, using Pytorch. Developed by Stanford, the main idea is to leverage the matrix of word cooccurrences in order to extract “dimensions of meaning”, outputing word vectors that naturally captures word semantics and thus being useful representations in a lot of tasks, such word analogies. You can find more information in the original paper here.

For this project, we will use the text8 dataset (that can be download here). This dataset is a dump of cleaned wikipedia texts. More details here.

First, we just import the necessary libs.

from collections import Counter, defaultdict

import numpy as np

import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from sklearn.manifold import TSNE

Next we will create a dataset class to manage our feature extraction and batch generation. We must create a cooccurence matrix for every word to feed the glove model. We will generate a vocab of ~190k words. If we create a common matrix, it would have 190000² entries, which would result in ~36 billion entries. If we consider each value a 32bit integer, we are talking of ~135GB of memory, too much to store and process. To handle this issue we can leverage the fact that most of the entries of this matrix are 0 so we just need to store the non-zero values, which drastically reduce the amount of memory necessary.

class GloveDataset:
    def __init__(self, text, n_words=200000, window_size=5):
        self._window_size = window_size
        self._tokens = text.split(" ")[:n_words]
        word_counter = Counter()
        self._word2id = {w:i for i, (w,_) in enumerate(word_counter.most_common())}
        self._id2word = {i:w for w, i in self._word2id.items()}
        self._vocab_len = len(self._word2id)
        self._id_tokens = [self._word2id[w] for w in self._tokens]
        print("# of words: {}".format(len(self._tokens)))
        print("Vocabulary length: {}".format(self._vocab_len))
    def _create_coocurrence_matrix(self):
        cooc_mat = defaultdict(Counter)

        for i, w in enumerate(self._id_tokens):
            start_i = max(i - self._window_size, 0)
            end_i = min(i + self._window_size + 1, len(self._id_tokens))

            for j in range(start_i, end_i):
                if i != j:
                    c = self._id_tokens[j]
                    cooc_mat[w][c] += 1 / abs(j-i)
        self._i_idx = list()
        self._j_idx = list()
        self._xij = list()
        #Create indexes and x values tensors
        for w, cnt in cooc_mat.items():
            for c, v in cnt.items():
        self._i_idx = torch.LongTensor(self._i_idx).cuda()
        self._j_idx = torch.LongTensor(self._j_idx).cuda()
        self._xij = torch.FloatTensor(self._xij).cuda()
    def get_batches(self, batch_size):
        #Generate random idx
        rand_ids = torch.LongTensor(np.random.choice(len(self._xij), len(self._xij), replace=False))
        for p in range(0, len(rand_ids), batch_size):
            batch_ids = rand_ids[p:p+batch_size]
            yield self._xij[batch_ids], self._i_idx[batch_ids], self._j_idx[batch_ids]

dataset = GloveDataset(open("text8").read(), 10000000)
# of words: 10000000 
Vocabulary length: 189075
Wall time: 2min 8s

Here we create the class of our glove model. In its forward pass it will perform the yellow part of the Glove loss function, described in the original paper:


class GloveModel(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(GloveModel, self).__init__()
        self.wi = nn.Embedding(num_embeddings, embedding_dim)
        self.wj = nn.Embedding(num_embeddings, embedding_dim) = nn.Embedding(num_embeddings, 1) = nn.Embedding(num_embeddings, 1)
       , 1), 1)
    def forward(self, i_indices, j_indices):
        w_i = self.wi(i_indices)
        w_j = self.wj(j_indices)
        b_i =
        b_j =
        x = torch.sum(w_i * w_j, dim=1) + b_i + b_j
        return x

glove = GloveModel(dataset._vocab_len, EMBED_DIM)

We must define a function to compute the weighting term f(Xij) of the loss function as per described in the paper:

def weight_func(x, x_max, alpha):
    wx = (x/x_max)**alpha
    wx = torch.min(wx, torch.ones_like(wx))
    return wx.cuda()  

The loss function described in the Glove paper is a weighted mean squared error. Pytorch 1.0 doesn’t have implementation for it, so we must write it ourselves. A good practice is to reuse any piece of this function already implemented so we take advantage of any optimization it might have:

def wmse_loss(weights, inputs, targets):
    loss = weights * F.mse_loss(inputs, targets, reduction='none')
    return torch.mean(loss).cuda()

Although we are using a diferent configuration (like the dataset) of the original paper, we will use the same optimizer and learning rate it describes.

optimizer = optim.Adagrad(glove.parameters(), lr=0.05)

Now we can write our training loop. The ALPHA and X_MAX parameters are set accoring to the paper. We also save our model states every 100 epochs.

N_EPOCHS = 100
X_MAX = 100
ALPHA = 0.75

n_batches = int(len(dataset._xij) / BATCH_SIZE)

loss_values = list()

for e in range(1, N_EPOCHS+1):

    batch_i = 0
    for x_ij, i_idx, j_idx in dataset.get_batches(BATCH_SIZE):
        batch_i += 1
        outputs = glove(i_idx, j_idx)
        weights_x = weight_func(x_ij, X_MAX, ALPHA)
        loss = wmse_loss(weights_x, outputs, torch.log(x_ij))
        if batch_i % 100 == 0:
            print("Epoch: {}/{} \t Batch: {}/{} \t Loss: {}".format(e, N_EPOCHS, batch_i, n_batches, np.mean(loss_values[-20:])))  
    print("Saving model..."), "")
Epoch: 1/100 Batch: 100/10726 Loss: 1.1235822647809983
Epoch: 1/100 Batch: 200/10726 Loss: 1.0464201807975768
Epoch: 1/100 Batch: 300/10726 Loss: 1.0292260229587555
Epoch: 1/100 Batch: 400/10726 Loss: 0.9683106660842895
Epoch: 1/100 Batch: 500/10726 Loss: 0.9407412618398666
Epoch: 1/100 Batch: 600/10726 Loss: 0.9253258764743805
Epoch: 1/100 Batch: 700/10726 Loss: 0.922967490553855

Here we sum over the two embedding matrices (as per recommendation of the original paper) to improve results. We then plot the TSNE space of the top 300 words to validate our word embeddings.

emb_i = glove.wi.weight.cpu().data.numpy()
emb_j = glove.wj.weight.cpu().data.numpy()

emb = emb_i + emb_j

top_k = 300

tsne = TSNE(metric='cosine', random_state=123)
embed_tsne = tsne.fit_transform(emb[:top_k, :])

fig, ax = plt.subplots(figsize=(14, 14))
for idx in range(top_k):
    plt.scatter(*embed_tsne[idx, :], color='steelblue')
    plt.annotate(dataset._id2word[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)

Checking the words that are close, we can say that our model performs pretty good! We can check that it clusters the direction words north, south, west, east and even central. It also clusters together words with their plural form like system/systems and language/languages.

And thats it. Hope you enjoyed this implementation, and if you have any questions our comments please let them below, I will be happy to answer!

277 comments / Add your comment below

  1. Status Auto Liker, Working Auto Liker, autolike, Auto Like, autoliker, Autolike, auto liker, Auto Liker, Photo Liker, Increase Likes, Status Liker, Autoliker, Autoliker, Autolike International, Photo Auto Liker, ZFN Liker, auto like

  2. Aw, this was a really nice post. In idea I wish to put in writing like this moreover – taking time and precise effort to make an excellent article… however what can I say… I procrastinate alot and not at all appear to get something done.

  3. Hi! Someone in my Facebook group shared this site with us so I came to look it over. I’m definitely enjoying the information. I’m book-marking and will be tweeting this to my followers! Excellent blog and fantastic design.

Leave a Reply

Your email address will not be published.