CS224d Assignment1 答案, Part(2/4)

Assignment1的答案一共被我分成了4部分,分別包含第1,2,3,4題。這部分包含第2題的答案。

2. Neural Network Basics (30 points)

(a). (3 points) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e. in some expression where only σ(x) , but not x , is present). Assume that the input x is a scalar for this question. Recall, the sigmoid function is

σ(x)=11+ex(2)

解:

σ(x)=(1+ex)2(ex)=ex(1+ex)2=11+ex(111+ex)=σ(x)[1σ(x)]


(b). (3 points) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e. fi nd the gradients with respect to the softmax input vector θ , when the prediction is made by y^=softmax(θ) . Remember the cross entropy function is

CE(y,y^)=iyilog(y^i)(3)

where y is the one-hot label vector, and y^ is the predicted probability vector for all classes. (Hint: you might want to consider the fact many elements of y are zeros, and assume that only the k-th dimension of y is one.)

解:根據提示,假設 y 的第k個值爲1,其餘值都爲0,即 yk=1 那麼有:

CE(y,y^)=yklog(y^k)=log(y^k)

對於 θ 中的第 i 個元素 θi ,有:
CE(y,y^)θi=logeθkjeθjθi=(θklogjeθj)θi=logjeθjθiθkθi={y^i(y^i1)ik,i=k

所以
CE(y,y^)θ=y^y


(c). (6 points) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, find Jx where J is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y , and cross entropy cost is used. (feel free to use σ(x) as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit)
One layer perceptron
Recall that the forward propagation is as follows

h=sigmoid(xW1+b1)y^=softmax(hW2+b2)

Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector. Wi and bi (i=1,2) are the weights and biases, respectively, of the two layers.

解:設 y 的第k個值爲1,其餘值都爲0,即 yk=1 那麼有:

J=yklog(y^k)=log(y^k)

hW2+b2=θ2 ,即 y^=softmax(θ2) ,且記 θ2 的第 i 個元素爲 θ(2)i W2 的第 i 行第 j 列個元素爲 W(2)ij 那麼有:
Jhi=jJθ(2)jθ(2)jhi=j(y^jyj)W(2)ij=(y^y)WT2|i

其中 θ(2)jhi=W(2)ij ,事實上,如果使用 愛因斯坦求和約定,那麼有 θ(2)j=hiW(2)ij+b(2)j ,則可得 θ(2)jhi=W(2)ij 。且 Jθ2=(y^y) 可由上一問得到的。

xW1+b1=θ1 ,即 h=σ(θ1) ,記 θ1 的第 i 個元素爲 θ(1)i W1 的第 i 行第 j 列個元素爲 W(1)ij 那麼有:

Jθ(1)i=jJhjhjθ(1)i=Jhihiθ(1)i=(y^y)WT2|iσ(θ1)|i

同時有:
Jxi=jJθ(1)jθ(1)jxi=jJθ(1)jW(1)ij=((y^y)WT2σ(θ1))WT1|i

其中 表示按元素的積(elementwise product) (小吐槽一下,這個推導這麼麻煩纔給6分,太摳了)


(d). (2 points) How many parameters are there in this neural network, assuming the input is Dx -dimensional, the output is Dy -dimensional, and there are H hidden units?
解: W1 的維度是 Dx×H b1 的維度是 1×H W2 的維度是 H×Dy b2 的維度是 1×Dy 。所以一共有 DxH+H+DyH+Dy 個參數。


(e)(f)(g). 見代碼,略