‎

1. NLP workflow:

Author	Wei Wu (victor.wuv@gmail.com)
Date	2019-09-30 10:37:28

1 NLP workflow:

link

1.1 Sentiment analysis

A sentiment value determines attitude, which may be a judgment or evaluation, affective state or the intended emotional communication. The sentiment value is labeled as 0 (negative), 1 (neutral) or 2 (positive).

GS can read stock news and report from our database, then it analyze the sentiment of the text content, finally giving a sentiment value on each date.

We can use this sentiment value as a factor for stocks in the multi-factor analysis.

At the same time, we can use this technique to give a sentiment value of the current web page user are browsing.

Input: selected document content
Output: sentiment value time series.

1.1.1 Use cases

Map stock name related to the news/report statement input into its sentiment value, then we can set them as factors, which can be applied to
1. Factor model - univariate test
2. Stock selection - pick stocks labeled as 2
Compute average sentiment values in time series, presenting sentiment of overall public opinions, which can be applied to

Defining and predicting public opinions for investors/analysts who cares (more efficient and probably more accurate than human’s instinct judgement)

Eg 短期交易者, dispersion (data and function variance), 与market opinion相对/同做trading
Source is important, diverse and complete enough: possible standard - a database can make completely random sampling
News sentiment analysis, together with keyword extraction, see below

1.1.2 Steps of generating sentiment values in GS

Load report summaries from a relational database, here we use Juyuan MySQL database (see below screenshot).

Hyperparameters:

input	description
db_name	database name
table_name	database table
tbl_columns	list of column names to select from SQL table (only used when reading a table)
t_filter_col_name	column name of date
t_values	filter by date

Output: An analyst reports dataframe, each report is presented in a row

Apply SnowNLP library’s sentiment method, which is a bayesian classifier.

Then we get each report’s sentiment value.

Calculate the average sentiment value by dividing the total number of reports on each date.

Output is time series.

1.2 Word2vec

Word2vec is a group of related models for word embeddings. We provide a specific process of how to use it to find similar words.

Input: selected document content Output: word embedding model

1.2.1 Use cases

Searching relevant papers: providing similar keywords of papers to the paper title we already have

‘The Decision Tree Approach to Stock Selection’: find other papers by looking for similar keywords of ‘decision tree classifier’, ‘random forest’, ‘stock prediction’…
Node2vec for GS tasks
Policy instance

1.2.2 Steps of finding similar words in GS

Select corpus, which can be an user interested domain from wikipedia, analyst reports and research files. Assume GS having a category tree of default corpus and there is also an API for users to insert their own sources.
Extract all the related pages by manually selecting a category/keyword, including its subcategories such as depth level. This forms a skill instance.
Output:

A skill instance, which can be shown as a node array as below

Translate the skill instance into graph nodes, then extract all GIDs from the nodes

Function: skill2graph Graph nodes

Output: node gid
Extract nodes of read only documents
Input: node GID from above
Function: lib.gftIO.gs_call.get_nodes_binary_data
Text data cleaning

remove punctuation, remove stopwords, tokenize

Train the word2vec model by gensim lib

Hyperparameters

input	description
Window size	moving window size.
Dimension	word2vec array dimension.
Minimum occurrence	minimum occurence of vocabulary.

Output: trained word2vec model
Cache this model as a policy variable
Find similar words using function ‘model.most_similar()’ from gensim lib

eg. 10 most similar words of ‘股票’.

Evaluation

Currently using human judgement

Factor Graph?

Other possible methods: http://www.aclweb.org/anthology/D15-1036

1.3 Information Extraction

1.3.1 relation extraction

从文本中抽取两个实体之间的投资关系。

下表为投资这一大类关系所包含的相似关系。

设立	增资	入股	收购	并购	换股
成立	受让	现金出资	要约收购	海外并购	转股
发起设立	扩股	携手	拟收购	重组	交换
组建	扩股	间接持有	并表	整合	配股
新设	占股	所持	过户	兼并
出资	转让给	联手	收购了	业务整合
共同出资	认缴	正式成为	资产收购	借壳上市
全资	定向增发	转让给	通过收购
参股		参股
入驻
创投

Table of Contents