CS224d-Lecture8

Language Model

probability of a sequence of words
  • P(w1, w2, …, wT)
Useful for machine learning:
word - ordering
  • p(the cat is small) > p(small the is cat)
word - choice
  • p(walking home after school) > p(walking house after school)

Traditional Language Model

條件概率,其中 window size = n

assumption

P(w1,w2,...,wT)=i=1mP(wi|w1,wi1)i=1mP(wi|w1,wi1)

n-gram
  • unigram p(w2|w1)=count(w1,w2)count(w1)
  • bigram p(w3|w1,w2)=count(w1,w2,w3)count(w1,w2)
    n-gram 耗費大量內存

RNN

  • 每步權重互聯
  • 條件依賴於之前所有單詞
  • RAM 耗費只同單詞量相關
    這裏寫圖片描述
    這裏寫圖片描述

ht=σ(Whhht1+Whxxt)
y^t=softmax(Wsht)

訓練 RNN is hard
vanishing / exploding gradient problem

total error

EW=t=1TEtW

EtW=k=1TEtytythththkhkW

其中
hthk=j=k+1thjhj1



由於取
ht=Wf(ht1)+W(hx)x[t]


hthk=j=k+1thjhj1=j=k+1tWTdiag(f(hj1))

||hjhj1||<=||WT||||diag(f(hj1)||<=βWβh

||hthk||=||j=k+1thjhj1||<=(βWβh)tk

可能非常快的就變得很大或者很小。

vanishing gradient problem 使得許多步之前的對當前訓練的影響微乎其微
exploding gradient clip gradient
vanishing gradient -> Initialization + ReLus
softmax is huge and slow
  • class based trick
雙向 RNN
  • 之前和之後的訓練詞對當前訓練都有影響
深度雙向 RNN
F1 度量

precision = tp/(tp+fp) recall = tp/(tp+fn) F1 = 2(precision recall)/(precsion + recall)