ABCDEFGHIJKLMNO
1
jiebaTHULACSnowNLPpynlpirCoreNLPpyLTPspaCygensimiepyNLTKHTMLParserwordnet/hownetTextBlobAllenNLP
2
star118594872159941426239879856092626590447111414
3
Feature overview支持三种分词模式
基于前缀词典实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图 (DAG)
采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合
对于未登录词,采用了基于汉字成词能力的 HMM 模型,使用了 Viterbi 算法
清华大学集成的目前世界上规模最大的人工分词和词性标注中文语料库(约含5800万字)训练而成,模型标注能力强大。
准确率高。该工具包在标准数据集Chinese Treebank(CTB5)上分词的F1值可达97.3%,词性标注的F1值可达到92.9%,与该数据集上最好方法效果相当。
速度较快。同时进行分词和词性标注速度为300KB/s,每秒可处理约15万字。只进行分词速度可达到1.3MB/s。
受到了TextBlob的启发,
没有用NLTK,所有的算法都是自己实现的,并且自带了一些训练好的字典。注意本程序都是处理的unicode编码,所以使用时请自行decode成unicode。
the base forms of words,
their parts of speech, whether they are names of companies, people, etc.,
normalize dates, times, and numeric quantities,
mark up the structure of sentences in terms of phrases and syntactic dependencies,
indicate which noun phrases refer to the same entities,
indicate sentiment,
extract particular or open-class relations between entity mentions,
get the quotes people said.
Writen in Java.
多语言,支持多线程,
线程数可配置
In nltk, we can use wordnet directly, but it looks like that spaCy haven't support that.
Providing a consistent API for diving
into common natural language processing (NLP) tasks.
Stands on the giant shoulders of NLTK and Pattern,
and plays nicely with both
An NLP research library, built on PyTorch,
for developing state-of-the-art deep learning
models on a wide variety of linguistic tasks.
4
分词精确模式,试图将句子最精确地切开,适合文本分析;全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。支持分词功能。
该模型由人民日报分词语料库训练得到。
+++++
5
pipeline?+?+++
6
关键词提取+(TF-IDF,text rank)+(TextRank)+(TextRank)?+?+
7
命名实体识别?+++
8
依存句法分析?+
+(transition-based
neural network parser,
加入微博数据)
+
9
语义角色标注(Semantic Role Labeling)?++(Bi-LSTM的SRL模型)++
10
词性标注++
+(TnT 3-gram HMM)
++(加入微博数据)++
11
Topic Modelling?+(Naive Bayes)+++
12
文本摘要?+(TextRank)++
13
情感分析?+++
14
文本相似度计算?+(BM25)?+
15
网址https://github.com/fxsjy/jieba
https://github.com/thunlp/THULAC-Python
https://github.com/stanfordnlp/CoreNLP
https://github.com/HIT-SCIR/ltp
https://github.com/nltk/nltk
https://github.com/sloria/TextBlob/https://github.com/allenai/allennlp
16
msr_test(560KB)
17
Time0.26s0.62s3.21s
18
Precision0.8140.8770.867
19
Recall0.8090.8990.896