# ðŸ‘‹ Welcome!

Welcome to the first exercise of the **Deep Learning for Natural Language Processing (DLNLP)** EPFL course.

Please work through it until the exercise session on September 27th, where we will discuss a possible solution and our findings.

This notebook is divided into two parts:
* **Practical**: where we will see how to load, display, manipulate, and evaluate pre-trained word embeddings
* **Exercise**: where you will be asked to prepare a dataset in order to train new word embeddings. Then, within a classification task, you will be asked to write a procedure that compares these new word embeddings with the pre-trained ones. The same comparison will have to be done with randomly initialized word embeddings

# Python3 environment requirements

This notebook was tested with Python 3.10.13 and the following library versions:

* vecto==0.2.21
* numpy==1.26.0
* chainer==7.8.1
* matplotlib==3.8.0
* scikit-learn==1.3.0

In [None]:
import os
import random
import numpy as np

def seed(
    seed = 1810
):
    random.seed(seed)
    np.random.seed(seed)

# set seed for reproducibility
SEED = 1810
seed(SEED)

# Introduction

[_vecto_](http://vecto.space/) is a Python library for conveniently working with vector space models. In contrast to [_gensim_](https://radimrehurek.com/gensim/), which is all about speed, _vecto_ is designed for scientific purposes: It is an attempt at establishing a standardized word embedding evaluation benchmark.

[_vecto_](http://vecto.space/) makes it easy to work with pretrained word embeddings in different formats. Some models can be downloaded from [here](https://vecto.readthedocs.io/en/docs/tutorial/getting_vectors.html#pre-trained-vsms).

In this notebook we will consider the 25-dimensional SkipGram model trained on Wikipedia (i.e., _word_linear_sg_25d_). Make sure to download, extract, and save it in `./data/word_linear_sg_25d`.

First, let's load the pre-trained SkipGram vector embeddings from the downloaded file and normalize them for faster similarity computations.

In [None]:
# import the load from directory function from the vecto library
from vecto.embeddings import load_from_dir

# load the pre-trained word embeddings
skipgram_model_wikipedia = load_from_dir('./data/word_linear_sg_25d')

# normalize the word embeddings
skipgram_model_wikipedia.normalize()

We can inspect the embedding for a particular word:

In [None]:
# retrieve the 'dog' word embedding
skipgram_model_wikipedia.get_vector('dog')

Let's also do a sanity check on the norm of the vector

In [None]:
# compute the norm of the 'dog' word embedding
dog_word_embedding_norm = np.linalg.norm(skipgram_model_wikipedia.get_vector('dog'))

# assert that the norm is equal to 1
assert np.isclose(dog_word_embedding_norm, 1.0)

We can also look for the most 'similar' word by formulating the following equation:

$$most\_similar(u) = \arg \max\limits_{w \in V} \frac{u \cdot w}{\lVert u \lVert \lVert w \lVert}$$

which can be interpreted as follows: for each word $w$ in the vocabulary $V$ we compute the cosine similarity with the input word $u$. We then return the argument that maximizes ($argmax$) the cosine similarity

**Nota bene**: $cosine\_distance = 1 - cosine\_similarity$

In [None]:
# let's retrieve the most similar words to 'dog'
skipgram_model_wikipedia.get_most_similar_words('dog')

As expected, the word most similar to 'dog' is the word 'dog' itself, since the distance between two identical vectors is $0$. This therefore corresponds to a similarity of $1$.

# Visualizing the vector space
By visualizing the embedding space, we can hopefully find out more about how relations between words are encoded in the vector space.
To that end, we reduce the space to two dimensions via [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) and plot the embeddings of some words in the 2D space.

In [None]:
# import libraries
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

#----------------------------------------------------------------------------------------------------

# perform PCA
def fit_pca(
    model
):
    
    # the pca_model will preserve only the first two principal components
    pca_model = PCA(
        n_components = 2
    )
    
    # perform PCA on the model's word embeddings matrix
    pca_model.fit(
        X = model.matrix
    )
    
    # return the pca_model
    return pca_model

#---------------------------------------------------------------------------------------------------

# plot words
def wordlist_2dplot(
    model,
    pca_model,
    word_list
):
    
    # convert the list of words the their relative word embeddings
    word_vecs = np.vstack([model.get_vector(w) for w in word_list])
    
    # project the word embeddings to the 2D subspace
    reduced_wordembs = pca_model.transform(word_vecs)
    
    # plot each projected word embedding
    fig, ax = plt.subplots()
    ax.scatter(reduced_wordembs[:, 0], reduced_wordembs[:, 1])
    for i, n in enumerate(word_list):
        ax.annotate(n, (reduced_wordembs[i, 0], reduced_wordembs[i, 1]))

In [None]:
# list of words
word_list = ["japan", "tokyo",
             "england", "london",
             "dog", "dogs",
             "mouse", "mice"]

# perform PCA
pca_model = fit_pca(
    model = skipgram_model_wikipedia
)

# plot the list of words
wordlist_2dplot(
    model = skipgram_model_wikipedia,
    pca_model = pca_model,
    word_list = word_list
)

# Word embedding algebra
The above visualization shows that some of the relations between words are encoded in their relative positions: The difference vector of mouse (singular) and mice (plural) is similar to that of dog (singular) and dogs (plural).
This has led to interesting phenomena where analogies can be detected via vector algebra in the embedding space:
$$mice - mouse + dog \approx dogs$$

In [None]:
# perform 3CosAdd method and return top_k candidates
def threeCosAdd(
    a,
    b,
    c,
    model,
    top_k = 3
):
    """
    Solve analogy question a : b ? c : x for x by computing x' = b - a + c and searching for word x whose
    word embedding is closest to x' in the embedding space.
    Usually, the query words a, b, c are excluded from this search.
    """
    x = model.get_most_similar_words(model.get_vector(b) - model.get_vector(a) + model.get_vector(c))
    candidates = [(word, similarity) for word, similarity in x if word not in [a, b, c]] 
    return candidates[:top_k]

In [None]:
print(*threeCosAdd('tokyo', 'japan', 'london', skipgram_model_wikipedia), sep = "\n")

In [None]:
print(*threeCosAdd('king', 'male', 'queen', skipgram_model_wikipedia), sep = "\n")

In [None]:
print(*threeCosAdd('one', 'two', 'three', skipgram_model_wikipedia), sep = "\n")

In [None]:
print(*threeCosAdd('north', 'south', 'east', skipgram_model_wikipedia), sep = "\n")

In [None]:
print(*threeCosAdd('batman', 'hero', 'joker', skipgram_model_wikipedia), sep = "\n")

In [None]:
print(*threeCosAdd('switzerland', 'swiss', 'netherlands', skipgram_model_wikipedia), sep = "\n")

In [None]:
print(*threeCosAdd('dog', 'dogs', 'mouse', skipgram_model_wikipedia), sep = "\n")

In [None]:
print(*threeCosAdd('car', 'driver', 'horse', skipgram_model_wikipedia), sep = "\n")

In [None]:
print(*threeCosAdd('tom', 'cat', 'jerry', skipgram_model_wikipedia), sep = "\n")

As we can observe this model is not perfect

# Evaluating word embeddings
There are two general approaches to evaluating word embedding models: intrinsic and extrinsic tasks. For an overview, see [Bakarov, 2018](https://arxiv.org/abs/1801.09536).

The analogy type of tasks belong to the intrinsic evaluation tasks because no resources apart from the trained word embeddings are needed.

[_vecto_](http://vecto.space/) is already equipped with an analogy-based benchmarking procedure. For example, we can evaluate the performance of the 25-dimensional Skipgram model trained on Wikipedia we used so far on a sample of a dataset such as the [Bigger Analogy Test Set (BATS)](http://vecto.space/projects/BATS).

This corresponds to writing only a few lines of code:

In [None]:
# import the analogy benchmark
from vecto.benchmarks.analogy import Benchmark as AnalogyBenchmark

# import the Dataset class 
from vecto.data import Dataset

# load the dataset
bats_dataset = Dataset("./data/BATS_sample")

# define the analogy benchmark
analogy_benchmark = AnalogyBenchmark(method = "3CosAdd")

# run the analogy benchmark for the skipgram model on the bats dataset and collect the results
results = analogy_benchmark.run(skipgram_model_wikipedia, bats_dataset)

In [None]:
# print results
print("experiment setup (category) \t result (accuracy)")
print("-"*55)
for r in results:
    print(r['experiment_setup']['category'], '\t', r['result']['accuracy'])

The above analogy test uses the vector-offset based 3CosAdd method we implemented earlier. While this method is still popular, it is important to note that [there are other possible ways of solving the analogy task](https://arxiv.org/abs/1801.09536) and that [it is not clear if vector-offset based methods work at all](https://hackingsemantics.xyz/2019/analogies/).

Another type of evaluation is the word embedding models' correlation with human judgment when evaluating semantic similarity of words. For example, for 'cup' and 'mug' we expect relatively high cosine similarity, whereas for 'cup' and 'airplane' it should be lower.

## Exercise: Extrinsic Evaluation

For extrinsic evaluation, the word embeddings are used as a part of a larger neural network model for some downstream task such as text classification.

In this exercise, we would like you to use aggregated word embeddings as features to a linear classifier.

You will explore different ways of pre-training the word embeddings as well as aggregating them.

First, you will prepare a movie review sentiment classification dataset. This dataset is then used to train a word embedding model on the prepared dataset.

Given trained word embeddings, you will combine the embeddings of all words in a review to yield a representation of the review.

Finally, you will use these as features to a linear classifier.

### Part 1: Preparing the dataset

The attached `./data/reviews.txt` contains 25,000 movie reviews, one per line, and the accompanying `./data/labels.txt` contains their sentiment label (positive or negative) in the same order.

Load the dataset and transform each review into a list of words.

In this exercise, you don't need to worry about tokenization and preprocessing yet - just splitting the review text by whitespace is okay.

Split the data into training set, development set, and test set.

In [None]:
# dataset paths
reviews_path = "./data/reviews.txt"
labels_path = "./data/labels.txt"

#----------------------------------------------------------------------------------------------------

# [TODO]: write code here

#----------------------------------------------------------------------------------------------------
# [SOLUTION]: remove before actual release

# import libraries
from sklearn.model_selection import train_test_split

# define test and development split size
TEST_SIZE = 0.1
DEV_SIZE = 0.2

# read reviews file
with open(reviews_path) as f:
    reviews = f.readlines()

# read labels file
with open(labels_path) as f:
    labels = f.readlines()
    
# tokenize reviews
reviews = [review.strip().split(sep = ' ') for review in reviews]

# turn labels into numbers
labels = [label.strip() == "positive" for label in labels]

train_reviews, test_reviews, train_labels, test_labels = train_test_split(reviews, labels, test_size = TEST_SIZE, shuffle = True)
train_reviews, dev_reviews, train_labels, dev_labels = train_test_split(train_reviews, train_labels, test_size = DEV_SIZE)

### Part 2: Training a word embedding model

Use [_vecto_](http://vecto.space/) to train a SkipGram word embedding model using the [arguments here](https://vecto.readthedocs.io/en/docs/tutorial/training_vectors.html#word2vec).

The documentation is not up-to-date but you can check the [_train_word2vec.py_](https://github.com/vecto-ai/vecto/blob/docs/vecto/embeddings/train_word2vec.py) script for more details about the input arguments

**Nota bene**: `path_corpus` and `path_out` are the only two arguments that have `required = True`. This means that you are required to specify at least those two things:
* `path_corpus`: path to the training corpus. Before calling the `train` function, the training data created in the previous part will have to be moved to disk. It will be your job to make sure that this requirement is met. Make sure you have one sentence per line.
* `path_out`: you will need to specify an output path where the trained model will be saved.

Make sure to use the same configuration as for the [25-dimensional Skipgram model available here](https://vecto.readthedocs.io/en/docs/tutorial/getting_vectors.html#pre-trained-vsms) we used before, as you will later compare against it. The only parameter that should not be copied from the configuration is `epoch` which should be set to 1. 

In [None]:
# pre-trained model configuration
print("pre-trained model configuration")
print("-"*35)
!cat ./data/word_linear_sg_25d/metadata.json

In [None]:
# import the train function
from vecto.embeddings.train_word2vec import train

#----------------------------------------------------------------------------------------------------

# the above train function requires some arguments as input
# you can create those arguments by writing something like
# args = Namespace(argument_1 = argument_1_value, 
#                  argument_2 = argument_2_value)
class Namespace:
    def __init__(
        self,
        **kwargs
    ):
        self.__dict__.update(kwargs)
        
#----------------------------------------------------------------------------------------------------

# [TODO]: write code here

# we provide you the default parameters of the train function as reported in the train_word2vec.py script
# you will need to change only few parameters
args = Namespace(gpu = -1,
                 dimensions = 100,
                 context_type = 'linear',
                 context_representation = 'word',
                 window = 2,
                 batchsize = 1000,
                 epoch = 1, # [Nota bene]: keep epoch = 1
                 model = 'skipgram',
                 language = 'eng',
                 subword = 'none',
                 negative_size = 5,
                 min_gram = 1,
                 max_gram = 5,
                 out_type = 'ns',
                 path_vocab = '',
                 path_word2chars = '',
                 path_vocab_ngram_tokens = '',
                 path_corpus = None,
                 path_out = None,
                 test = False,
                 verbose = False)

# [Nota bene]: depending on the resources, this can take some time to run, so please be patient
# you will then need to call the train function as follows:
#train(args)

#----------------------------------------------------------------------------------------------------
# [SOLUTION]: remove before actual release

import os
import shutil

def to_file(
    reviews,
    path
):
    we_reviews = []
    for r in reviews:
        we_reviews.append(r)
    with open(path, 'w') as out_f:
        for r in we_reviews:
            print(" ".join(r), file = out_f)

def remove_dir(
    dir_path
):
    shutil.rmtree(dir_path)
    return

def create_dir(
    dir_path
):
    if os.path.exists(dir_path):
        remove_dir(dir_path)
    os.makedirs(dir_path)
    return

args.dimensions = 25

dir_name = "{}_{}_{}_{}d_reviews".format(args.context_representation, args.context_type, args.model, args.dimensions)
create_dir(dir_name)

to_file(train_reviews, f"./{dir_name}/reviews.txt")

args = Namespace(gpu = -1,
                 dimensions = 25,
                 context_type = 'linear',
                 context_representation = 'word',
                 window = 2,
                 batchsize = 1000,
                 epoch = 1,
                 model = 'skipgram',
                 language = 'eng',
                 subword = 'none',
                 negative_size = 5,
                 min_gram = 1,
                 max_gram = 5,
                 out_type = 'ns',
                 path_vocab = '',
                 path_word2chars = '',
                 path_vocab_ngram_tokens = '',
                 path_corpus = f'./{dir_name}/',
                 path_out = f'./{dir_name}/',
                 test = False,
                 verbose = True)

train(args)

If everything went OK, you should now have a folder named `ep_001` inside the `path_out` you used. This is the path of the model you trained.

You can use the same procedure we used in the 'Introduction' section to load it and normalize the word embeddings

In [None]:
# [TODO]: write code here

#----------------------------------------------------------------------------------------------------
# [SOLUTION]: remove before actual release

skipgram_model_reviews = load_from_dir(f'./{dir_name}/ep_001')
skipgram_model_reviews.normalize()

### Part 3: Word embedding aggregation models

Aggregated word embeddings often serve as a strong baseline to more sophisticated neural network models, which we will explore in future exercise. Before the latter became ubiquituous, a lot of research focused on the question of how to best combine the word embeddings of words in a sentence to reflect the meaning of the their composition [(e.g., in this paper)](https://www.aclweb.org/anthology/P08-1028/). In this exercise, you will explore three very simple variants.

Your task is to write three functions that each receive a vecto word embedding model and a list of texts (i.e., a list of lists of words) as input, and return a list of embeddings as output (representing each input text). 

* the first function aggregates the words in a text by summing their embeddings
* sometimes it helps to normalize the sum by the length of the text, i.e., by the number of words, which constitutes the second function
* the third function aggregates the word embeddings via element-wise multiplication.

In [None]:
# [TODO]: write code here

#----------------------------------------------------------------------------------------------------
# [SOLUTION]: remove before actual release

def aggregate_embeddings(
    model,
    texts,
    neutral_elem = 0,
    op = np.add,
    return_length = False
):
    dim = model.metadata["dimensions"]
    
    text_embeddings = []
    lengths = []
    for t in texts:
        text_representation = np.empty(dim)
        text_representation.fill(neutral_elem)
        cnt = 0
        
        for w in t:
            try:
                word_embedding = model.get_vector(w)
                text_representation = op(text_representation, word_embedding)
                cnt = cnt + 1
            except:
                # don't do anything
                pass
    
        text_embeddings.append(text_representation)
        lengths.append(cnt)
    
    if return_length:
        return text_embeddings, np.array(lengths).astype(np.float32)
    return text_embeddings

# test
add_embeddings = lambda model, texts: aggregate_embeddings(model, texts, neutral_elem = 0, op = np.add)
multiply_embeddings = lambda model, texts: aggregate_embeddings(model, texts, neutral_elem = 1., op = np.multiply)

def average_embeddings(model,texts):
    embeddings, lengths = aggregate_embeddings(model, texts, neutral_elem = 0, op = np.add, return_length = True)
    embeddings = embeddings / lengths.reshape(-1,1)
    return embeddings

#print(add_embeddings(skipgram_model_wikipedia, train_reviews)[:10])

#print(multiply_embeddings(skipgram_model_wikipedia, train_reviews)[:10])

#print(average_embeddings(skipgram_model_wikipedia, train_reviews)[:10]) 

### Part 4: Train and evaluate a linear classifier

Write a function that receives a word embedding model and an aggregation function and trains a logistic regression classifier on the movie review dataset based on the aggregated word embeddings.

Make sure that the word embeddings are not finetuned in the process.

You can use any library such as PyTorch or sklearn.

Use the development set to select hyperparameters on a suitable grid.

In [None]:
# [TODO]: write code here

#----------------------------------------------------------------------------------------------------
# [SOLUTION]: remove before actual release

from sklearn.linear_model import LogisticRegression
from itertools import product

def train_model(
    model,
    aggregation
):
    # turn reviews into embeddings
    train_embeddings = aggregation(model, train_reviews)
    dev_embeddings = aggregation(model, dev_reviews)
    
    final_model = None
    max_score = -10e5
    C = [10e-4, 10e-3, 10e-2, 10e-1, 10e-0, 10e1, 10e2]
    for hyperparams in product(C):
        
        lr = LogisticRegression(C = hyperparams[0], max_iter = 100000)
        lr.fit(train_embeddings, train_labels)
        
        val_score = lr.score(dev_embeddings, dev_labels)
        if val_score > max_score:
            final_model = lr
    return final_model

### Part 5: Report and discuss results for different models

We would like to understand the impact of some of the model components.

First, we are interested in the effect of pretraining our word embeddings on the movie reviews vs. pretraining on Wikipedia (i.e., using the model from the beginning of the notebook).

Second, we are interested in the effect of the aggregation function (sum, mean, element-wise multiplication).

Report the results of all possible combinations in a 2x3 table and discuss your observations.

In [None]:
# [TODO]: write code here

#----------------------------------------------------------------------------------------------------
# [SOLUTION]: remove before actual release

for model, aggregation in product([("Wiki", skipgram_model_wikipedia), ("Reviews", skipgram_model_reviews)],
                                  [("add", add_embeddings), ("multiply", multiply_embeddings), ("average", average_embeddings)]):
    lr = train_model(model[1], aggregation[1])
    test_embeddings = aggregation[1](model[1], test_reviews)
    score = lr.score(test_embeddings, test_labels)
    print("WE model '{}' \t with aggregation '{}' \t achieves score \t {:.3f}".format(model[0], aggregation[0], score))

#### [SOLUTION]

| Word Embeddings / Aggregation | Add  | Multiply | Average |
|-------------------------------|------|----------|---------|
| Wikipedia                     | 70.7 | 50.0     | 71.6    |
| Reviews                       | 63.7 | 50.0     | 63.2    |

Main observations:
* 'multiply' is much worse than 'add' or 'average'
* word embeddings trained on reviews are not competitive to the pretrained model

Pretraining on your own data is not always worth it. Generally, you need a big corpus to effectively train word embeddings, otherwise you are better off using a model pretrained on a large, general corpus.

Why is 'multiply' so bad?
Many element-wise multiplications cause the values to vanish.
Renormalizing the embeddings before logistic regression could mitigate the problem.

In [None]:
#----------------------------------------------------------------------------------------------------
# [SOLUTION]: remove before actual release

print("Computing norms of embeddings...")
print("Add:", np.linalg.norm(np.array(add_embeddings(skipgram_model_wikipedia, train_reviews)), axis = 1).mean())
print("Multiply:", np.linalg.norm(np.array(multiply_embeddings(skipgram_model_wikipedia, train_reviews)), axis = 1).mean())
print("Average:", np.linalg.norm(np.array(average_embeddings(skipgram_model_wikipedia, train_reviews)), axis = 1).mean())