Table of Contents
Author | Wei Wu (victor.wuv@gmail.com) |
Date | 2019-09-30 10:37:28 |
1 NLP workflow:
1.1 Sentiment analysis
A sentiment value determines attitude, which may be a judgment or evaluation, affective state or the intended emotional communication. The sentiment value is labeled as 0 (negative), 1 (neutral) or 2 (positive).
GS can read stock news and report from our database, then it analyze the sentiment of the text content, finally giving a sentiment value on each date.
We can use this sentiment value as a factor for stocks in the multi-factor analysis.
At the same time, we can use this technique to give a sentiment value of the current web page user are browsing.
- Input: selected document content
- Output: sentiment value time series.
1.1.1 Use cases
- Map stock name related to the news/report statement input into its sentiment value, then we can set them as factors, which can be applied to
- Compute average sentiment values in time series, presenting sentiment of overall public opinions, which can be applied to
Defining and predicting public opinions for investors/analysts who cares (more efficient and probably more accurate than human’s instinct judgement)
Eg 短期交易者, dispersion (data and function variance), 与market opinion相对/同做trading
- Source is important, diverse and complete enough: possible standard - a database can make completely random sampling
- News sentiment analysis, together with keyword extraction, see below
1.1.2 Steps of generating sentiment values in GS
- Load report summaries from a relational database, here we use Juyuan MySQL database (see below screenshot).
Hyperparameters:
input | description |
---|---|
db_name | database name |
table_name | database table |
tbl_columns | list of column names to select from SQL table (only used when reading a table) |
t_filter_col_name | column name of date |
t_values | filter by date |
Output: An analyst reports dataframe, each report is presented in a row
- Apply SnowNLP library’s sentiment method, which is a bayesian classifier.
Then we get each report’s sentiment value.
- Calculate the average sentiment value by dividing the total number of reports on each date.
Output is time series.
1.2 Word2vec
Word2vec is a group of related models for word embeddings. We provide a specific process of how to use it to find similar words.
Input: selected document content Output: word embedding model
1.2.1 Use cases
- Searching relevant papers: providing similar keywords of papers to the paper title we already have
‘The Decision Tree Approach to Stock Selection’: find other papers by looking for similar keywords of ‘decision tree classifier’, ‘random forest’, ‘stock prediction’…
- Node2vec for GS tasks
- Policy instance
1.2.2 Steps of finding similar words in GS
- Select corpus, which can be an user interested domain from wikipedia, analyst reports and research files. Assume GS having a category tree of default corpus and there is also an API for users to insert their own sources.
- Extract all the related pages by manually selecting a category/keyword, including its subcategories such as depth level. This forms a skill instance.
- Output:
A skill instance, which can be shown as a node array as below
- Translate the skill instance into graph nodes, then extract all GIDs from the nodes
Function: skill2graph Graph nodes
- Output: node gid
- Extract nodes of read only documents
- Input: node GID from above
- Function: lib.gftIO.gs_call.get_nodes_binary_data
- Text data cleaning
remove punctuation, remove stopwords, tokenize
- Train the word2vec model by gensim lib
Hyperparameters
input | description |
---|---|
Window size | moving window size. |
Dimension | word2vec array dimension. |
Minimum occurrence | minimum occurence of vocabulary. |
- Output: trained word2vec model
- Cache this model as a policy variable
- Find similar words using function ‘model.most_similar()’ from gensim lib
eg. 10 most similar words of ‘股票’.
- Evaluation
Currently using human judgement
Factor Graph?
Other possible methods: http://www.aclweb.org/anthology/D15-1036
1.3 Information Extraction
1.3.1 relation extraction
从文本中抽取两个实体之间的投资关系。
下表为投资这一大类关系所包含的相似关系。
设立 | 增资 | 入股 | 收购 | 并购 | 换股 |
---|---|---|---|---|---|
成立 | 受让 | 现金出资 | 要约收购 | 海外并购 | 转股 |
发起设立 | 扩股 | 携手 | 拟收购 | 重组 | 交换 |
组建 | 扩股 | 间接持有 | 并表 | 整合 | 配股 |
新设 | 占股 | 所持 | 过户 | 兼并 | |
出资 | 转让给 | 联手 | 收购了 | 业务整合 | |
共同出资 | 认缴 | 正式成为 | 资产收购 | 借壳上市 | |
全资 | 定向增发 | 转让给 | 通过收购 | ||
参股 | 参股 | ||||
入驻 | |||||
创投 |