Assignment1的答案一共被我分成了4部分,分別包含第1,2,3,4題。這部分包含第3題的答案。
3. word2vec (40 points + 5 bonus)
(a). (3 points) Assume you are given a predicted word vector
vc
corresponding to the center word
c
for skipgram, and word prediction is made with the softmax function found in word2vec models
y^0=p(o|c)=exp(uTovc)∑Ww=1exp(uTwvc)(4)
where
w
denotes the
w
-th word and
uw
(w=1,…,W)
are the 「output」 word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word
o
is the expected word (the
o
-th element of the one-hot label vector is one), derive the gradients with respect to
vc
.
Hint: It will be helpful to use notation from question 2. For instance, letting
y^
be the vector of softmax predictions for every word,
y
as the expected word vector, and the loss function
Jsoftmax−CE(o,vc,U)=CE(y,y^)(5)
where
U=[u1,u2,…,uW]
is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.
解:設詞向量的維度爲
ndim
,且各詞向量爲列向量,即
vc
的維度爲
ndim×1
,
U
的維度爲
ndim×W
。並且記
θ=UTvc
。則有
y^=softmax(θ)
。由第2問(b)的結果可得:
∂Jsoftmax−CE∂vc=∂Jsoftmax−CE∂θ⋅∂θ∂vc=U⋅(y^−o)
(b)(3 points) As in the previous part, derive gradients for the 「output」 word vectors
uw
(including
uo
).
解:同(a)中一樣,設
θ=UTvc
,則有:
∂Jsoftmax−CE∂Uij=∑k∂Jsoftmax−CE∂θk∂θk∂Uij=∑k(y^−o)|k∂θk∂Uij
其中
θk=uTk⋅vc
,則
∂θk∂Uij={vi0j=kj≠k
,其中
vi
表示
vc
的第
i
個元素。則有:
∑k(y^−o)|k∂θk∂Uij=(y^−o)|jvi=vc⋅(y^−o)T|i,j
所以:
∂Jsoftmax−CE∂U=vc⋅(y^−o)T
(c). (6 points) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector
vc
, and the expected output word is
o
. Assume that
K
negative samples (words) are drawn, and they are
1,…,K
, respectively for simplicity of notation
(o∉{1,…,K})
. Again, for a given word,
o
, denote its output vector as
uo
. The negative sampling loss function in this case is
Jneg−sample(o,vc,U)=−log(σ(uTovc))−∑k=1Klog(σ(−uTkvc))(6)
where
σ(⋅)
is the sigmoid function.
After you’ve done this, describe with one sentence why this cost function is much more efficient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e. the runtime of the softmax-CE loss divided by the runtime of the negative sampling loss).
Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.
解:設所取的
K
個索引所在的集合爲
S
。
∂Jneg−sample∂vc=−∂logσ(uTovc)∂vc−∑i∈Slogσ(−uTivc)∂vc=[σ(uTovc)−1]uo−∑i∈S[σ(−uTivc)−1]ui
∂Jneg−sample∂uw=−∂logσ(uTovc)∂uw−∑i∈Slogσ(−uTivc)∂uw=⎧⎩⎨[σ(uTovc)−1]vc[1−σ(−uTwvc)]vc0w=ow∈Sw≠o且w∉S
之所以(6)式比(5)式快是因爲:
runtime of softmax-CEruntime of negative sampling loss=O(W)O(K)
(不知道這麼說是不是準確,望大神指正)。
(d). (8 points) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words
[wordc−m,…,wordc−1,wordc,wordc+1,…,wordc+m]
, where
m
is the context size. Denote the 「input」 and 「output」 word vectors for
wordk
as
vk
and
uk
respectively.
Hint: feel free to use
F(o,vc)
(where
o
is the expected word) as a placeholder for the
Jsoftmax−CE(o,vc,…)
or
Jneg−sample(o,vc,…)
cost functions in this part - you’ll see that this is a useful abstraction for the coding part. That is, your solution may contain terms of the form
F(o,vc)∂…
.
Recall that for skip-gram, the cost for a context centered around
c
is
Jskip−gram(wordc−m…c+m)=∑−m≤j≤m,j≠0F(wc+j,vc)(7)
where
wc+j
refers to the word at the
j
-th index from the center.
CBOW is slightly different. Instead of using
vc
as the predicted vector, we use
v^
de fined below. For (a simpler variant of) CBOW, we sum up the input word vectors in the context
v^=∑−m≤j≤m,j≠0vc+j(8)
then the CBOW cost is
JCBOW(wordc−m…c+m)=F(wc,v^)(9)
Note: To be consistent with the
v^
notation such as for the code portion, for skip-gram
v^=vc
.
解:設
vk,uk
分別爲詞
k
所對應的內外向量。
skip-gram對應的答案:
∂Jskip−gram(wordc−m…c+m)∂vk=∑−m≤j≤m,j≠0∂F(wc+j,vc)∂vk
∂Jskip−gram(wordc−m…c+m)∂uk=∑−m≤j≤m,j≠0∂F(wc+j,vc)∂uk
其中
wc+j
爲從中心數第
j
個詞所對應的one-hot vector。
CBOW對應的答案:
∂JCBOW(wordc−m…+m)∂vk=∂F(wc,v^)∂vk=∂F(wc,v^)∂v^⋅∂v^∂vk={∂F(wc,v^)∂v^0k∈{wc−m,…,wc−1,wc+1,…,wc+m}k∉{wc−m,…,wc−1,wc+1,…,wc+m}
∂JCBOW(wordc−m…+m)∂wk=∂F(wc,v^)∂wk
ps: 感覺這個答案好簡單的樣子,爲什麼要給8分呢?
(e)(f)(g)(h). 見代碼,略
附一張訓出來的圖,也就是我跑完q3_run.py之後出現的圖,reddit 上有人討論怎麼看這個圖是否合理: