CS224d Assignment1 答案, Part(3/4)

時間 2021-01-13 標籤 cs224d 答案

Assignment1的答案一共被我分成了4部分，分別包含第1，2，3，4題。這部分包含第3題的答案。

3. word2vec (40 points + 5 bonus)

(a). (3 points) Assume you are given a predicted word vector vc corresponding to the center word c for skipgram, and word prediction is made with the softmax function found in word2vec models

y^0 = p (o | c) = exp (u T o v c) \sum W w = 1 exp (u T w v c) (4)

where w denotes the w -th word and uw (w=1,…,W) are the 「output」 word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word o is the expected word (the o -th element of the one-hot label vector is one), derive the gradients with respect to vc .
Hint: It will be helpful to use notation from question 2. For instance, letting y^ be the vector of softmax predictions for every word, y as the expected word vector, and the loss function

J s o f t m a x - C E (o, v c, U) = C E (y, y^) (5)

where U=[u1,u2,…,uW] is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.

解：設詞向量的維度爲 ndim ，且各詞向量爲列向量，即 vc 的維度爲 ndim×1 ， U 的維度爲 ndim×W 。並且記 θ=UTvc 。則有 y^=softmax(θ) 。由第2問(b)的結果可得：

\partial J s o f t m a x - C E \partial v c = \partial J s o f t m a x - C E \partial θ \cdot \partial θ \partial v c = U \cdot (y^- o)

(b)(3 points) As in the previous part, derive gradients for the 「output」 word vectors uw (including uo ).

解：同(a)中一樣，設 θ=UTvc ，則有：

\partial J s o f t m a x - C E \partial U i j = \sum k \partial J s o f t m a x - C E \partial θ k \partial θ k \partial U i j = \sum k (y^- o) | k \partial θ k \partial U i j

其中 θk=uTk⋅vc ，則 ∂θk∂Uij={vi0j=kj≠k ，其中 vi 表示 vc 的第 i 個元素。則有：

\sum k (y^- o) | k \partial θ k \partial U i j = (y^- o) | j v i = v c \cdot (y^- o) T | i, j

所以：

\partial J s o f t m a x - C E \partial U = v c \cdot (y^- o) T

(c). (6 points) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector vc , and the expected output word is o . Assume that K negative samples (words) are drawn, and they are 1,…,K , respectively for simplicity of notation (o∉{1,…,K}) . Again, for a given word, o , denote its output vector as uo . The negative sampling loss function in this case is

J n e g - s a m p l e (o, v c, U) = - log (σ (u T o v c)) - \sum k = 1 K log (σ (- u T k v c)) (6)

where σ(⋅) is the sigmoid function.
After you’ve done this, describe with one sentence why this cost function is much more efficient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e. the runtime of the softmax-CE loss divided by the runtime of the negative sampling loss).
Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.

解：設所取的 K 個索引所在的集合爲 S 。

\partial J n e g - s a m p l e \partial v c = - \partial log σ (u T o v c) \partial v c - \sum i \in S log σ (- u T i v c) \partial v c = [σ (u T o v c) - 1] u o - \sum i \in S [σ (- u T i v c) - 1] u i

\partial J n e g - s a m p l e \partial u w = - \partial log σ (u T o v c) \partial u w - \sum i \in S log σ (- u T i v c) \partial u w = ⎧ ⎩ ⎨ [σ (u T o v c) - 1] v c [1 - σ (- u T w v c)] v c 0 w = o w \in S w \neq o 且 w \notin S

之所以(6)式比(5)式快是因爲： runtime of softmax-CEruntime of negative sampling loss=O(W)O(K) （不知道這麼說是不是準確，望大神指正）。

(d). (8 points) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words [wordc−m,…,wordc−1,wordc,wordc+1,…,wordc+m] , where m is the context size. Denote the 「input」 and 「output」 word vectors for wordk as vk and uk respectively.
Hint: feel free to use F(o,vc) (where o is the expected word) as a placeholder for the Jsoftmax−CE(o,vc,…) or Jneg−sample(o,vc,…) cost functions in this part - you’ll see that this is a useful abstraction for the coding part. That is, your solution may contain terms of the form F(o,vc)∂… .
Recall that for skip-gram, the cost for a context centered around c is

J s k i p - g r a m (word c - m \dots c + m) = \sum - m \leq j \leq m, j \neq 0 F (w c + j, v c) (7)

where wc+j refers to the word at the j -th index from the center.
CBOW is slightly different. Instead of using vc as the predicted vector, we use v^ de fined below. For (a simpler variant of) CBOW, we sum up the input word vectors in the context

v^= \sum - m \leq j \leq m, j \neq 0 v c + j (8)

then the CBOW cost is

J C B O W (word c - m \dots c + m) = F (w c, v^) (9)

Note: To be consistent with the v^ notation such as for the code portion, for skip-gram v^=vc .

解：設 vk,uk 分別爲詞 k 所對應的內外向量。

skip-gram對應的答案：

\partial J s k i p - g r a m (word c - m \dots c + m) \partial v k = \sum - m \leq j \leq m, j \neq 0 \partial F (w c + j, v c) \partial v k

\partial J s k i p - g r a m (word c - m \dots c + m) \partial u k = \sum - m \leq j \leq m, j \neq 0 \partial F (w c + j, v c) \partial u k

其中 wc+j 爲從中心數第 j 個詞所對應的one-hot vector。

CBOW對應的答案：

∂JCBOW(wordc−m…+m)∂vk=∂F(wc,v^)∂vk=∂F(wc,v^)∂v^⋅∂v^∂vk={∂F(wc,v^)∂v^0k∈{wc−m,…,wc−1,wc+1,…,wc+m}k∉{wc−m,…,wc−1,wc+1,…,wc+m}

\partial J C B O W (word c - m \dots + m) \partial w k = \partial F (w c, v^) \partial w k

ps: 感覺這個答案好簡單的樣子，爲什麼要給8分呢？

(e)(f)(g)(h). 見代碼，略

附一張訓出來的圖，也就是我跑完q3_run.py之後出現的圖，reddit 上有人討論怎麼看這個圖是否合理：