Table of Contents

Author Wei Wu (victor.wuv@gmail.com)
Date 2019-09-30 10:37:28

1 NLP workflow:

1.1 Sentiment analysis

A sentiment value determines attitude, which may be a judgment or evaluation, affective state or the intended emotional communication. The sentiment value is labeled as 0 (negative), 1 (neutral) or 2 (positive).

GS can read stock news and report from our database, then it analyze the sentiment of the text content, finally giving a sentiment value on each date.

We can use this sentiment value as a factor for stocks in the multi-factor analysis.

At the same time, we can use this technique to give a sentiment value of the current web page user are browsing.

  • Input: selected document content
  • Output: sentiment value time series.

1.1.1 Use cases

  1. Map stock name related to the news/report statement input into its sentiment value, then we can set them as factors, which can be applied to
    1. Factor model - univariate test
    2. Stock selection - pick stocks labeled as 2
  2. Compute average sentiment values in time series, presenting sentiment of overall public opinions, which can be applied to

    Defining and predicting public opinions for investors/analysts who cares (more efficient and probably more accurate than human’s instinct judgement)

    Eg 短期交易者, dispersion (data and function variance), 与market opinion相对/同做trading

  3. Source is important, diverse and complete enough: possible standard - a database can make completely random sampling
  4. News sentiment analysis, together with keyword extraction, see below

1.1.2 Steps of generating sentiment values in GS

  1. Load report summaries from a relational database, here we use Juyuan MySQL database (see below screenshot).

Hyperparameters:

input description
db_name database name
table_name database table
tbl_columns list of column names to select from SQL table (only used when reading a table)
t_filter_col_name column name of date
t_values filter by date

Output: An analyst reports dataframe, each report is presented in a row

  1. Apply SnowNLP library’s sentiment method, which is a bayesian classifier.

Then we get each report’s sentiment value.

  1. Calculate the average sentiment value by dividing the total number of reports on each date.

Output is time series.

1.2 Word2vec

Word2vec is a group of related models for word embeddings. We provide a specific process of how to use it to find similar words.

Input: selected document content Output: word embedding model

1.2.1 Use cases

  1. Searching relevant papers: providing similar keywords of papers to the paper title we already have

    ‘The Decision Tree Approach to Stock Selection’: find other papers by looking for similar keywords of ‘decision tree classifier’, ‘random forest’, ‘stock prediction’…

  2. Node2vec for GS tasks
  3. Policy instance

1.2.2 Steps of finding similar words in GS

  1. Select corpus, which can be an user interested domain from wikipedia, analyst reports and research files. Assume GS having a category tree of default corpus and there is also an API for users to insert their own sources.
  2. Extract all the related pages by manually selecting a category/keyword, including its subcategories such as depth level. This forms a skill instance.
  3. Output:

A skill instance, which can be shown as a node array as below

  1. Translate the skill instance into graph nodes, then extract all GIDs from the nodes

Function: skill2graph Graph nodes

  • Output: node gid
  • Extract nodes of read only documents
  • Input: node GID from above
  • Function: lib.gftIO.gs_call.get_nodes_binary_data
  • Text data cleaning

remove punctuation, remove stopwords, tokenize

  1. Train the word2vec model by gensim lib

Hyperparameters

input description
Window size moving window size.
Dimension word2vec array dimension.
Minimum occurrence minimum occurence of vocabulary.
  • Output: trained word2vec model
  • Cache this model as a policy variable
  • Find similar words using function ‘model.most_similar()’ from gensim lib

eg. 10 most similar words of ‘股票’.

  1. Evaluation

Currently using human judgement

Factor Graph?

Other possible methods: http://www.aclweb.org/anthology/D15-1036

1.3 Information Extraction

1.3.1 relation extraction

从文本中抽取两个实体之间的投资关系。

下表为投资这一大类关系所包含的相似关系。

设立 增资 入股 收购 并购 换股
成立 受让 现金出资 要约收购 海外并购 转股
发起设立 扩股 携手 拟收购 重组 交换
组建 扩股 间接持有 并表 整合 配股
新设 占股 所持 过户 兼并  
出资 转让给 联手 收购了 业务整合  
共同出资 认缴 正式成为 资产收购 借壳上市  
全资 定向增发 转让给 通过收购    
参股   参股      
入驻          
创投          

1.3.2 knowledge extraction from scientific publications for ease paper reading