CS224N Assignment1 Section 1

運行環境需求html

 1 # All Import Statements Defined Here
 2 # Note: Do not add to this list.
 3 # All the dependencies you need, can be installed by running .
 4 # ----------------
 5 
 6 import sys
 7 assert sys.version_info[0]==3
 8 assert sys.version_info[1] >= 5
 9 
10 from gensim.models import KeyedVectors
11 from gensim.test.utils import datapath
12 import pprint
13 import matplotlib.pyplot as plt
14 plt.rcParams['figure.figsize'] = [10, 5]
15 import nltk
16 nltk.download('reuters')
17 from nltk.corpus import reuters
18 import numpy as np
19 import random
20 import scipy as sp
21 from sklearn.decomposition import TruncatedSVD
22 from sklearn.decomposition import PCA
23 
24 START_TOKEN = '<START>'
25 END_TOKEN = '<END>'
26 
27 np.random.seed(0)
28 random.seed(0)
29 # ----------------

 

Question 1.1: Implement distinct_words [code] (2 points)

Write a method to work out the distinct words (word types) that occur in the corpus. You can do this with for loops, but it's more efficient to do it with Python list comprehensions. In particular, this may be useful to flatten a list of lists. If you're not familiar with Python list comprehensions in general, here's more information.python

You may find it useful to use Python sets to remove duplicate words.git

 1 def distinct_words(corpus):
 2     """ Determine a list of distinct words for the corpus.
 3         Params:
 4             corpus (list of list of strings): corpus of documents
 5         Return:
 6             corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
 7             num_corpus_words (integer): number of distinct words across the corpus
 8     """
 9     corpus_words = []
10     num_corpus_words = -1
11     
12     # ------------------
13     # Write your implementation here.
14     raw_corpus_words = [word for corpu in corpus for word in corpu]
15     corpus_words = list(set(raw_corpus_words))
16     corpus_words = sorted(corpus_words)
17     num_corpus_words = len(corpus_words)
18 
19 
20     # ------------------
21 
22     return corpus_words, num_corpus_words

Question 1.2: Implement compute_co_occurrence_matrix [code] (3 points)

Write a method that constructs a co-occurrence matrix for a certain window-size 𝑛n (with a default of 4), considering words 𝑛n before and 𝑛n after the word in the center of the window. Here, we start to use numpy (np) to represent vectors, matrices, and tensors. If you're not familiar with NumPy, there's a NumPy tutorial in the second half of this cs231n Python NumPy tutorial.github

 

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
        # Write your implementation here.
    M = np.zeros(shape=(num_words, num_words), dtype=np.int32)
    for i in range(num_words):
        word2Ind[words[i]] = i

    for sent in corpus:
        for p in range(len(sent)):
            # ci for center word index
            ci = word2Ind[sent[p]]

            # proceeding
            for w in sent[max(0, p - window_size):p]:
                wi = word2Ind[w]
                M[ci][wi] += 1

            # subsequent
            for w in sent[p + 1 : p + 1 + window_size]:
                wi = word2Ind[w]
                M[ci][wi] += 1


    # ------------------

    return M, word2Ind

  

Question 1.3: Implement reduce_to_k_dim [code] (1 point)

Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional embeddings.app

Note: All of numpy, scipy, and scikit-learn (sklearn) provide some implementation of SVD, but only scipy and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized algorithm for calculating large-scale Truncated SVD. So please use sklearn.decomposition.TruncatedSVD.dom

 

 1 def reduce_to_k_dim(M, k=2):
 2     """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
 3         to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
 4             - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
 5     
 6         Params:
 7             M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
 8             k (int): embedding size of each word after dimension reduction
 9         Return:
10             M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
11                     In terms of the SVD from math class, this actually returns U * S
12     """    
13     n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
14     M_reduced = None
15     print("Running Truncated SVD over %i words..." % (M.shape[0]))
16     
17         # ------------------
18         # Write your implementation here.
19     svd = TruncatedSVD(n_components=k, n_iter=n_iters)
20     svd.fit(M.T)
21     M_reduced = svd.components_.T
22     
23     
24         # ------------------
25 
26     print("Done.")
27     return M_reduced

Question 1.4: Implement plot_embeddings [code] (1 point)

Here you will write a function to plot a set of 2D vectors in 2D space. For graphs, we will use Matplotlib (plt).ide

For this example, you may find it useful to adapt this code. In the future, a good way to make a plot is to look at the Matplotlib gallery, find a plot that looks somewhat like what you want, and adapt the code they give.oop

 

 1 def plot_embeddings(M_reduced, word2Ind, words):
 2     """ Plot in a scatterplot the embeddings of the words specified in the list "words".
 3         NOTE: do not plot all the words listed in M_reduced / word2Ind.
 4         Include a label next to each point.
 5         
 6         Params:
 7             M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
 8             word2Ind (dict): dictionary that maps word to indices for matrix M
 9             words (list of strings): words whose embeddings we want to visualize
10     """
11 
12     # ------------------
13     # Write your implementation here.
14     for w in words:
15         x = M_reduced[word2Ind[w]][0]
16         y = M_reduced[word2Ind[w]][1]
17         plt.scatter(x, y, marker = 'x', color = 'red')
18         plt.text(x, y, w)
19     plt.show()
20     # ------------------

Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)

Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed window of 4, over the Reuters "crude" corpus. Then we will use TruncatedSVD to compute 2-dimensional embeddings of each word. TruncatedSVD returns U*S, so we normalize the returned vectors, so that all the vectors will appear around the unit circle (therefore closeness is directional closeness). Note: The line of code below that does the normalizing uses the NumPy concept of broadcasting. If you don't know about broadcasting, check out Computation on Arrays: Broadcasting by Jake VanderPlas.this

Run the below cell to produce the plot. It'll probably take a few seconds to run. What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have? Note: "bpd" stands for "barrels per day" and is a commonly used abbreviation in crude oil topic articles.spa

 

 1 # -----------------------------
 2 # Run This Cell to Produce Your Plot
 3 # ------------------------------
 4 reuters_corpus = read_corpus()
 5 M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
 6 M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)
 7 
 8 # Rescale (normalize) the rows to make them each of unit-length
 9 M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
10 M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting
11 
12 words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
13 plot_embeddings(M_normalized, word2Ind_co_occurrence, words)