CS224d-Lecture8
時間 2021-01-13
標籤
機器學習
nlp
Language Model
probability of a sequence of words
Useful for machine learning:
word - ordering
- p(the cat is small) > p(small the is cat)
word - choice
- p(walking home after school) > p(walking house after school)
Traditional Language Model
條件概率,其中 window size = n
assumption
P(w1,w2,...,wT)=∏i=1mP(wi|w1,wi−1)≈∏i=1mP(wi|w1,wi−1)
n-gram
- unigram
p(w2|w1)=count(w1,w2)count(w1)
- bigram
p(w3|w1,w2)=count(w1,w2,w3)count(w1,w2)
n-gram 耗費大量內存
RNN
- 每步權重互聯
- 條件依賴於之前所有單詞
- RAM 耗費只同單詞量相關
ht=σ(Whhht−1+Whxxt)
y^t=softmax(Wsht)
訓練 RNN is hard
vanishing / exploding gradient problem
total error
∂E∂W=∑t=1T∂Et∂W
∂Et∂W=∑k=1T∂Et∂yt⋅∂yt∂ht⋅∂ht∂hk⋅∂hk∂W
其中
∂ht∂hk=∏j=k+1t∂hj∂hj−1
故
由於取
ht=Wf(ht−1)+W(hx)x[t]
則
∂ht∂hk=∏j=k+1t∂hj∂hj−1=∏j=k+1tWTdiag(f′(hj−1))
||∂hj∂hj−1||<=||WT||⋅||diag(f′(hj−1)||<=βWβh
||∂ht∂hk||=||∏j=k+1t∂hj∂hj−1||<=(βWβh)t−k
可能非常快的就變得很大或者很小。
vanishing gradient problem 使得許多步之前的對當前訓練的影響微乎其微
exploding gradient clip gradient
vanishing gradient -> Initialization + ReLus
softmax is huge and slow
雙向 RNN
深度雙向 RNN
F1 度量
precision = tp/(tp+fp) recall = tp/(tp+fn) F1 = 2(precision recall)/(precsion + recall)