使用Stanford Word Segmenter and Stanford Named Entity Recognizer (NER)實現中文命名實體識

時間 2021-01-19

一、分詞介紹

http://nlp.stanford.edu/software/segmenter.shtml

斯坦福大學的分詞器，該系統需要JDK 1.8+，從上面鏈接中下載stanford-segmenter-2014-10-26，解壓之後，如下圖所示

，進入data目錄，其中有兩個gz壓縮文件，分別是ctb.gz和pku.gz，其中 CTB：賓州大學的中國樹庫訓練資料， PKU：中國北京大學提供的訓練資料。當然了，你也可以自己訓練，一個訓練的例子可以在這裏面看到 http://nlp.stanford.edu/software/trainSegmenter-20080521.tar.gz

二、NER介紹

http://nlp.stanford.edu/software/CRF-NER.shtml

斯坦福NER是採用Java實現，可以識別出（PERSON，ORGANIZATION，LOCATION），使用本軟件發表的研究成果需引用下述論文：

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.

下載地址在：

http://nlp.sta nford.edu/~manning/papers/gibbscrf3.pdf

在NER頁面可以下載到兩個壓縮文件，分別是stanford-ner-2014-10-26和stanford-ner-2012-11-11-chinese

將兩個文件解壓可看到

，默認NER可以用來處理英文，如果需要處理中文要另外處理。

Included with Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.

3 class:	Location, Person, Organization
4 class:	Location, Person, Organization, Misc
7 class:	Time, Location, Organization, Person, Money, Percent, Date

如上圖可以看到針對英文提供了3class、4class、7class， http://nlp.stanford.edu:8080/ner/ 但是中文並沒有，這是一個在線演示的地址，可以上去瞧瞧。

三、分詞和NER使用

在Eclipse中新建一個Java Project，將data目錄拷貝到項目根路徑下，再把stanford-ner-2012-11-11-chinese解壓的內容全部拷貝到classifiers文件夾下，將stanford-segmenter-3.5.0加入到classpath之中，將classifiers文件夾拷貝到項目根目錄，將stanford-ner-3.5.0.jar和stanford-ner.jar加入到classpath中。最後，去 http://nlp.stanford.edu/software/corenlp.shtml下載stanford-corenlp-full-2014-10-31，將解壓之後的stanford-corenlp-3.5.0也加入到classpath之中。最後的Eclipse中結構如下：

根據

We also provide Chinese models built from the Ontonotes Chinese named entity data. There are two models, one using distributional similarity clusters and one without. These are designed to be run on word-segmented Chinese . So, if you want to use these on normal Chinese text, you will first need to run Stanford Word Segmenter or some other Chinese word segmenter, and then run NER on the output of that!

Chinese NER

這段說明，很清晰，需要將中文分詞的結果作爲NER的輸入，然後才能識別出NER來。

同時便於測試，本Demo使用junit-4.10.jar，下面開始上代碼

[java]view plain copy 
          
 import edu.stanford.nlp.ie.AbstractSequenceClassifier;   
 import edu.stanford.nlp.ie.crf.CRFClassifier;   
 import edu.stanford.nlp.ling.CoreLabel;   
   
 /**  
 *  
 * <p>  
 * ClassName ExtractDemo  
 * </p>  
 * <p>  
 * Description 加載NER模塊  
 * </p>  
 *  
 * @author wangxu [email protected]  
 * <p>  
 * Date 2015年1月8日 下午2:53:45  
 * </p>  
 * @version V1.0.0  
 *  
 */   
 public class ExtractDemo {   
 private static AbstractSequenceClassifier<CoreLabel> ner;   
 public ExtractDemo() {   
 InitNer();   
 }   
 public void InitNer() {   
 String serializedClassifier = "classifiers/chinese.misc.distsim.crf.ser.gz"; // chinese.misc.distsim.crf.ser.gz   
 if (ner == null) {   
 ner = CRFClassifier.getClassifierNoExceptions(serializedClassifier);   
 }   
 }   
   
 public String doNer(String sent) {   
 return ner.classifyWithInlineXML(sent);   
 }   
   
 public static void main(String args[]) {   
 String str = "我 去 吃飯 ， 告訴 李強 一聲 。";   
 ExtractDemo extractDemo = new ExtractDemo();   
 System.out.println(extractDemo.doNer(str));   
 System.out.println("Complete!");   
 }   
   
 }   

[java]view plain copy 
          
 import java.io.File;   
 import java.io.IOException;   
 import java.util.Properties;   
   
 import org.apache.commons.io.FileUtils;   
   
 import edu.stanford.nlp.ie.crf.CRFClassifier;   
 import edu.stanford.nlp.ling.CoreLabel;   
   
 /**  
 *  
 * <p>  
 * ClassName ZH_SegDemo  
 * </p>  
 * <p>  
 * Description 使用Stanford CoreNLP進行中文分詞  
 * </p>  
 *  
 * @author wangxu [email protected]  
 * <p>  
 * Date 2015年1月8日 下午1:56:54  
 * </p>  
 * @version V1.0.0  
 *  
 */   
 public class ZH_SegDemo {   
 public static CRFClassifier<CoreLabel> segmenter;   
 static {   
 // 設置一些初始化參數   
 Properties props = new Properties();   
 props.setProperty("sighanCorporaDict", "data");   
 props.setProperty("serDictionary", "data/dict-chris6.ser.gz");   
 props.setProperty("inputEncoding", "UTF-8");   
 props.setProperty("sighanPostProcessing", "true");   
 segmenter = new CRFClassifier<CoreLabel>(props);   
 segmenter.loadClassifierNoExceptions("data/ctb.gz", props);   
 segmenter.flags.setProperties(props);   
 }   
   
 public static String doSegment(String sent) {   
 String[] strs = (String[]) segmenter.segmentString(sent).toArray();   
 StringBuffer buf = new StringBuffer();   
 for (String s : strs) {   
 buf.append(s + " ");   
 }   
 System.out.println("segmented res: " + buf.toString());   
 return buf.toString();   
 }   
   
 public static void main(String[] args) {   
 try {   
 String readFileToString = FileUtils.readFileToString(new File("澳門141人食物中毒與進食「問題生蠔」有關.txt"));   
 String doSegment = doSegment(readFileToString);   
 System.out.println(doSegment);   
   
 ExtractDemo extractDemo = new ExtractDemo();   
 System.out.println(extractDemo.doNer(doSegment));   
   
 System.out.println("Complete!");   
 } catch (IOException e) {   
 e.printStackTrace();   
 }   
   
 }   
 }   

注意一定是JDK 1.8+的環境，最後輸出結果如下

loading dictionaries from data/dict-chris6.ser.gz...Done. Unique words in ChineseDictionary is: 423200

done [23.2 sec].

serDictionary=data/dict-chris6.ser.gz

sighanCorporaDict=data

inputEncoding=UTF-8

sighanPostProcessing=true

INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false

INFO: TagAffixDetector: building TagAffixDetector from data/dict/character_list and data/dict/in.ctb

Loading character dictionary file from data/dict/character_list

Loading affix dictionary from data/dict/in.ctb

segmented res: 2008年 9月 9日新華網 9月 8日信息：（記者張家偉）澳門特區政府衛生局疾病預防及控制中心 8 日表示，目前累計有 141 人在本地自助餐廳進食後出現食物中毒症狀，其中大部分與進食「問題生蠔」有關。衛生局最早在 3 日公佈說，有 14 名來自三個羣體的港澳人士 8月 27日至 30日期間在澳門金沙酒店用餐後出現不適，患者陸續出現發熱、嘔吐和腹瀉等類諾沃克樣病毒感染的症狀。初步調查顯示，「上述情況可能和進食生蠔有關」。

2008年 9月 9日新華網 9月 8日信息：（記者張家偉）澳門特區政府衛生局疾病預防及控制中心 8 日表示，目前累計有 141 人在本地自助餐廳進食後出現食物中毒症狀，其中大部分與進食「問題生蠔」有關。衛生局最早在 3 日公佈說，有 14 名來自三個羣體的港澳人士 8月 27日至 30日期間在澳門金沙酒店用餐後出現不適，患者陸續出現發熱、嘔吐和腹瀉等類諾沃克樣病毒感染的症狀。初步調查顯示，「上述情況可能和進食生蠔有關」。

Loading classifier from E:\workspaces\EclipseEE4.4\aaaaaa\classifiers\chinese.misc.distsim.crf.ser.gz ... done [6.8 sec].

<MISC>2008年 9月 9日新華網 9月 8日</MISC> 信息：（記者 <PERSON>張家偉</PERSON> ） <GPE>澳門</GPE> <LOC>特區</LOC> <ORG>政府衛生局疾病預防及控制中心</ORG> <MISC>8 日</MISC> 表示，目前累計有 141 人在本地自助餐廳進食後出現食物中毒症狀，其中大部分與進食「問題生蠔」有關。 <ORG>衛生局</ORG> 最早在 3 日公佈說，有 14 名來自 <MISC>三</MISC> 個羣體的 <GPE>港澳</GPE> 人士 <MISC>8月 27日至 30日</MISC> 期間在 <GPE>澳門</GPE> 金沙酒店用餐後出現不適，患者陸續出現發熱、嘔吐和腹瀉等類諾沃克樣病毒感染的症狀。初步調查顯示，「上述情況可能和進食生蠔有關」。

Complete!

轉載：

http://blog.csdn.net/sparkexpert/article/details/49497231

http://blog.csdn.net/jdbc/article/details/51382262

http://blog.csdn.net/haoji007/article/details/52788676