學(xué)習(xí)NLP第一課

時(shí)間：2019-01-15 10:16:01

關(guān)鍵字： nlp

手機(jī)看文章

掃描二維碼
隨時(shí)隨地手機(jī)看文章

[導(dǎo)讀]欲先攻其事必先利其器1、安裝nltk，使用[python]?view plain?copypip?install?nltk??2、在命令行下執(zhí)行??[python]?view plain?copy

欲先攻其事必先利其器
1、安裝nltk，使用[python]?view plain?copypip?install?nltk??

2、在命令行下執(zhí)行??

[python]?view plain?copyimport?nltk??nltk.download('punkt')??一段原始文本要可以處理必須經(jīng)過(guò)幾個(gè)階段，一般而言主要有
1、文本清理，清理掉一些不必要的字符，比如使用BeautifulSoup的get_text，一處非ascii字符等等
2、語(yǔ)句分離，一大段原生文本，處理成一系列的語(yǔ)句，用計(jì)算機(jī)術(shù)語(yǔ)而言就是將一個(gè)字符串分割成若干字符串，可以使用"."或者"。"或者nltk_tokenize預(yù)置的預(yù)處理函數(shù)，(使用方式 from nltk.tokenize import sent_tokenize)
3、標(biāo)識(shí)化處理，機(jī)器所能理解的最小單位是單詞，所以我們?cè)谡Z(yǔ)句分離的基礎(chǔ)上還要進(jìn)行分詞操作，也就是將一個(gè)原生字符串分割成一系列有意義的單詞NLP標(biāo)識(shí)化處理的復(fù)雜性根據(jù)應(yīng)用的不同而不同，標(biāo)識(shí)器有很多，比如split，word_tokenize和regex_tokenize
4、詞干提取，較為粗糙的規(guī)則處理過(guò)程，修枝剪葉，比如eating,eaten 共同的詞根是eat，我在處理時(shí)，認(rèn)為eating和eaten就是一個(gè)eat就ok
5、詞性還原，包含了詞根所有的變化，詞性還原操作會(huì)根據(jù)當(dāng)前上下文環(huán)境，將詞根還原成當(dāng)前應(yīng)該表現(xiàn)的形式使用方式（from nltk.stem import WordNetLemmatizer）

6、停用詞移除，比如無(wú)意義的the a? an 等詞匯會(huì)被移除，一般停用詞表示人工定制的，也有一些是根據(jù)給定語(yǔ)料庫(kù)自動(dòng)生成的nltk包含22種語(yǔ)言的停用詞表

根據(jù)以上觀點(diǎn)，涉及到的python代碼是：

[python]?view plain?copy#?-*-?coding:?utf-8?-*-??import?re??import?requests??import?operator??from?bs4?import?BeautifulSoup??from?nltk.tokenize?import?sent_tokenize,wordpunct_tokenize,blankline_tokenize,word_tokenize??import?nltk??import?pymysql??import?os????def?mysql_select():??????#?打開(kāi)數(shù)據(jù)庫(kù)連接??????db?=?pymysql.connect(host="localhost",user="root",passwd="root",db="csdn",charset="utf8")??????#?使用cursor()方法獲取操作游標(biāo)??????cursor?=?db.cursor()??????cursor.execute("SELECT?*?FROM?`article_info`?ORDER?BY?RAND()?LIMIT?1")??????#?提交到數(shù)據(jù)庫(kù)執(zhí)行??????result?=?cursor.fetchall()??????db.close()??????return?result????str_text?=?mysql_select()??#文本清理，我只需要content的內(nèi)容??str_text?=?str_text[0]??#獲得content??str_text?=?str_text[3]??#進(jìn)行文本清理，去掉html??soup?=?BeautifulSoup(str_text,?'lxml')??str_text?=?soup.get_text()??#print("文本清理的結(jié)果：?"+?str_text)??#語(yǔ)句分離器??text_list?=?sent_tokenize(str_text)??#標(biāo)識(shí)化處理，針對(duì)所有的語(yǔ)句進(jìn)行標(biāo)識(shí)化處理??word_list?=?[]??#使用nltk的內(nèi)置函數(shù)進(jìn)行語(yǔ)句分離??for?sentence?in?text_list:??????item_list?=?word_tokenize(sentence)??????word_list.extend(item_list)??result_1_word_list?=?[]??for?word?in?word_list:??????blank_list?=?blankline_tokenize(word)??????result_1_word_list.extend(blank_list)??????'''''?print("查看分詞結(jié)果")?for?item?in?result_1_word_list:?????print(item)?????'''??#去掉停用詞??stop_words?=?[word.strip().lower()?for?word?in?['{','}','(',')',']','[']]??clean_tokens?=?[tok?for?tok?in?result_1_word_list?if?len(tok.lower())>1?and?(tok.lower?not?in?stop_words)]??token_nltk_result?=?nltk.FreqDist(clean_tokens)??for?k,v?in?token_nltk_result.items():??????print(str(k)+"?:?"+str(v))??token_nltk_result.plot(10,cumulative=True)??