Table of Contents

research and project oriented, adjusting speed based on progress.

1 机器学习与数据挖掘

1.1 问题

1.1.1 分类

分类问题是机器学习非常重要的一个组成部分,它的目标是根据已知样本的某些特征,判断一个新的样本属于哪种已知的样本类。分类问题也被称为监督式学习(supervised classification),根据已知训练区提供的样本,通过计算选择特征参数,建立判别函数以对样本进行的分类。 与之相对的称为非监督式学习(unsupervised classification),也叫做聚类分析。

1.1.2 聚类

聚类分析(英语:Cluster analysis,亦称为群集分析)是对于统计数据分析的一门技术,在许多领域受到广泛应用,包括机器学习,数据挖掘,模式识别,图像分析以及生物信息。聚类是把相似的对象通过静态分类的方法分成不同的组别或者更多的子集(subset),这样让在同一个子集中的成员对象都有相似的一些属性,常见的包括在坐标系中更加短的空间距离等。

1.1.3 回归

回归分析(英语:Regression Analysis)是一种统计学上分析数据的方法,目的在于了解两个或多个变数间是否相关、相关方向与强度,并建立数学模型以便观察特定变数来预测研究者感兴趣的变数。更具体的来说,回归分析可以帮助人们了解在只有一个自变量变化时因变量的变化量。一般来说,通过回归分析我们可以由给出的自变量估计因变量的条件期望。 回归分析是建立因变数\( {\displaystyle Y} Y\)(或称依变数,反应变数)与自变数 \({\displaystyle X} X\)(或称独变数,解释变数)之间关系的模型。简单线性回归使用一个自变量$ {\displaystyle X} X\(,复回归使用超过一个自变量\)( {\displaystyle X1,X2…Xi} X1,X2…Xi)$。

  1. 模型

    线性回归 简单回归 普通最小二乘法 多项式回归 一般线性模型

    广义线性模式 离散选择 逻辑回归 多项罗吉特 混合罗吉特 波比 多项式波比 排序性模型 有序波比 泊松回归

    等级线性模型 固定效应 随机效应 混合模型

    非线性回归 非参数 半参数 稳健 分位数回归 保序回归 主成分 最小角 局部 分段 含误差变量

  2. 估计

    最小二乘法 普通最小二乘法 线性 偏最小二乘回归 总体 广义 加权 非线性 非负 重复再加权 岭回归 LASSO

    最小绝对值导数法 贝叶斯 贝叶斯多元

  3. 背景

    回归模型检验 平均响应和预测响应 误差和残差 拟合优度 学生化残差 高斯-马尔可夫定理

  4. 异常检测

    在数据挖掘中,异常检测(英语:anomaly detection)对不匹配预期模式或数据集中其他项目的项目、事件或观测值的识别。 通常异常项目会转变成银行欺诈、结构缺陷、医疗问题、文本错误等类型的问题。异常也被称为离群值、新奇、噪声、偏差和例外。 特别是在检测滥用与网络入侵时,有趣性对象往往不是罕见对象,但却是超出预料的突发活动。这种模式不遵循通常统计定义中把异常点看作是罕见对象,于是许多异常检测方法(特别是无监督的方法)将对此类数据失效,除非进行了合适的聚集。相反,聚类分析算法可能可以检测出这些模式形成的微聚类。

    有三大类异常检测算法。 在假设数据集中大多数实例都是正常的前提下,无监督异常检测方法能通过寻找与其他数据最不匹配的实例来检测出未标记测试数据的异常。监督式异常检测方法需要一个已经被标记“正常”与“异常”的数据集,并涉及到训练分类器(与许多其他的统计分类问题的关键区别是异常检测的内在不均衡性)。半监督式异常检测方法根据一个给定的正常训练数据集创建一个表示正常行为的模型,然后检测由学习模型生成的测试实例的可能性。 关联规则 强化学习 结构预测 特征学习 在线学习 半监督学习 语法归纳

1.2 监督学习(分类 · 回归)

1.2.1 决策树

DecisionTree 决策论中 (如风险管理),决策树(Decision tree)由一个决策图和可能的结果(包括资源成本和风险)组成, 用来创建到达目标的规划。决策树建立并用来辅助决策,是一种特殊的树结构。决策树是一个利用像树一样的图形或决策模型的决策支持工具,包括随机事件结果,资源代价和实用性。它是一个算法显示的方法。决策树经常在运筹学中使用,特别是在决策分析中,它帮助确定一个能最可能达到目标的策略。如果在实际中,决策不得不在没有完备知识的情况下被在线采用,一个决策树应该平行概率模型作为最佳的选择模型或在线选择模型算法。决策树的另一个使用是作为计算条件概率的描述性手段。

1.2.2 表征(装袋, 提升,随机森林)

1.2.3 k-NN

1.2.4 线性回归

1.2.5 朴素贝叶斯

1.2.7 逻辑回归

1.2.8 感知器

1.2.9 支持向量机(SVM)

1.2.10 相关向量机(RVM)

1.3 聚类

1.3.1 BIRCH

1.3.2 层次

1.3.3 k平均

1.3.4 期望最大化(EM)

1.3.5 DBSCAN

1.3.6 OPTICS

1.3.7 均值飘移

1.4 降维

1.4.1 因子分析

1.4.2 CCA

1.4.3 ICA

1.4.4 LDA

1.4.5 NMF

1.4.6 PCA

  • Data preprocessing

feature scaling/mean normalization \[\mu_j=\frac{1}{m} \sum_{i=1}^m x_j^{(i)}\] replace each \(x_j^{(i)}\) with \(x_i-\mu_j\)

1.4.7 LASSO

PCA vs Lasso: unsupervised, supervised.

1.4.8 t-SNE

1.5 结构预测

1.5.1 概率图模型(贝叶斯网络,CRF, HMM)

1.6 异常检测

1.6.1 k-NN

1.6.2 局部离群因子

1.7 神经网络

1.7.1 自编码

1.7.2 深度学习

1.7.3 多层感知机

1.7.4 RNN

1.7.5 受限玻尔兹曼机

1.7.6 SOM

1.7.7 CNN

1.8 理论

1.8.1 偏差/方差困境

1.8.2 计算学习理论

1.8.3 经验风险最小化

1.8.4 PAC学习

1.8.5 统计学习

1.8.6 VC理论

2 Week 1

2.3 What is machine learning

Study of algorithms that

  • improve their performance P
  • at some task T

training data set, validation data set, test data set.

  • with experience E

2.4 Well defined machine learning problem

  • supervised learning

Fitting some data to a function or function approximation.

  • unsupervised learning

Figuring out what the data is without any feedback. For instance, if we were given many data points, we could group them by similarity, or perhaps determine which variables are better than others.

\[H = {H|h: X \to Y}\]

2.4.1 Top-Down induction of DTree

  • A to the best decision attribute for next node.
  • Assign A as decision attribute for node.
  • For each value of A, create new descendant of node.
  • Sort training examples to leaf nodes.
  • If training examples perfectly classified, then STOP, Else iterate over new leaf nodes.

2.4.2 Entropy

Entropy H(X) of a random variable X: \[H(X) = -\sum{P(X=i)log_2*P(X=i)}\]

2.5 Course logistics

  • Linear Regression
  • Logistic Regression
  • Neural Networks
  • Support Vector Machines
  • K-means Clustering
  • Principal Components Analysis
  • Anomaly Detection
  • Collaborative Filtering
  • Object Recognition

2.6 Model Representation

To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict (price). (x(i),y(i)) is called a training example. m—is called a training set.

2.7 Cost Function

\(J(\Theta_1,\Theta_2)\) contour is the bow projected on the 2D surface. A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line.

  • Loss function is usually a function defined on a data point, prediction and label, and measures the penalty. For example:
    • square loss l(f(xi|θ),yi)=(f(xi|θ)−yi)2l(f(xi|θ),yi)=(f(xi|θ)−yi)2, used in linear regression
    • hinge loss l(f(xi|θ),yi)=max(0,1−f(xi|θ)yi)l(f(xi|θ),yi)=max(0,1−f(xi|θ)yi), used in SVM
    • 0/1 loss l(f(xi|θ),yi)=1⟺f(xi|θ)≠yil(f(xi|θ),yi)=1⟺f(xi|θ)≠yi, used in theoretical analysis and definition of accuracy
  • Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For example:
    • Mean Squared Error MSE(θ)=1N∑Ni=1(f(xi|θ)−yi)2MSE(θ)=1N∑i=1N(f(xi|θ)−yi)2
    • SVM cost function SVM(θ)=∥θ∥2+C∑Ni=1ξiSVM(θ)=‖θ‖2+C∑i=1Nξi (there are additional constraints connecting ξiξi with CC and with training set)
  • Objective function is the most general term for any function that you optimize during training. For example, a probability of generating training set in maximum likelihood approach is a well defined objective function, but it is not a loss function nor cost function (however you could define an equivalent cost function). For example:
    • MLE is a type of objective function (which you maximize)
    • Divergence between classes can be an objective function but it is barely a cost function, unless you define something artificial, like 1-Divergence, and name it a cost

2.8 Linear Regression with One Variable

  • <2017-08-14 Mon>

linear regression with one variable is also called simple regression. The X variable is called the predictor, Y is called the dependent.

  • some statistics:
    • t-statistics off value according to the \(/epsilon\).
    • p value the probability of the hypothesis that \(/beta_1\) is 0.
    • \(R^2\) the confidence level that \(/beta_1\) is approximately estimated.
    • RSS residual sum of squares
    • RSE residual standard error.

2.9 Linear Algebra Review

  • Vector
  • Matrix

3 Week 2 Linear Regression with Multiple Variables

  • Gradient Descent:

Taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

  • Algorithm:

\[\Theta_j = \Theta_j + \alpha \derivative{J(\Theta_0,\Theta_1)}\] Update simutaneously: \[Temp_0 := \Theta_0 - \alpha \triangledown J(\Theta_0,\Theta_1)} \] \[Temp_1 := \Theta_1 - \alpha \bigtriangledown J(\Theta_0,\Theta_1) \]

  • normalization

\[\theta_0 = \Theta_0 - \alpha\partial{J(\theta)}{\theta}\]

4 Week 5

4.1 Neural Networks: Learning

5 Week 6

5.1 Advice for Applying Machine Learning

5.2 Machine Learning System Design

6 Week 7

6.1 Support Vector Machine

Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

6.1.1 Maximal Margin Classifier

6.1.2 Support Vector Classifiers

6.1.3 Support Vector Machines

from sklearn import svm
training_X = target
training_y = target names
svm_model = svm.SVC(gamma=0.01, C=100.)
svm_model.fit(training_X, training_y)
predicts = svm_model.predict(test_X)
from sklearn.metrics import accuracy_score
accuracy_score(y_true, predicts)

7 Week 8

7.1 Unsupervised Learning

7.2 Dimensionality Reduction

8 Week 9

8.1 Anomaly Detection

8.2 Recommender Systems

9 Week 10

9.1 Large Scale Machine Learning

10 Week 11

10.1 Application Example:

10.1.1 Natural language processing (NLP)

  • Automatic summarization
  • Automatic taxonomy construction
  • Dialog system
  • Grammar checker
  • Language recognition
    • Handwriting recognition
    • Optical character recognition
    • Speech recognition
  • Machine translation
  • Question answering
  • Speech synthesis
  • Text mining
  • Term frequency–inverse document frequency (tf–idf)
  • Text simplification

10.1.2 Pattern recognition

  • Facial recognition system
  • Handwriting recognition
  • Image recognition
  • Optical character recognition
  • Speech recognition

10.2 Week 1/2

10.2.1 Logistic Regression

10.2.2 Regularization

10.2.3 Classification

  1. Linear Discriminant Analysis
  2. Comparison

10.2.4 Decision tree and random forest

10.2.5 Boosting

10.2.6 XGBoost

10.2.7 SVM

10.2.8 Clustering

10.2.9 Bayesian Network

10.2.10 LDA

10.2.11 HMM

10.2.12 Neural Network

11 Bayesian network

12 Tree-Based Methods

12.1 Decision Trees

如何利用熵分类: 怎么样贪心地分类让熵以最快的速度降低。

信息增益: 给定特征A的信息而使得类X的信息的不确定性减少的程度。

12.2 Bagging, Random Forests, Boosting

13 Unsupervised Learning

13.1 Principal Components Analysis

  • first principal component
  • second principal component

13.2 Clustering Methods

Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations within the same cluster are similar according to some predesignated criterion or criteria, while observations drawn from different clusters are dissimilar. Different clustering techniques make different assumptions on the structure of the data, often defined by some similarity metric and evaluated for example by internal compactness (similarity between members of the same cluster) and separation between different clusters. Other methods are based on estimated density and graph connectivity. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis.

13.2.1 K-Means Clustering

which group of stocks could possibly move up in the next trading period.

13.2.2 K-Nearest Neighbors(KNN)

13.2.3 Hierarchical Clustering

14 Resampling Methods

14.1 Cross-Validation

14.2 The Bootstrap

15 Google Natural Language Processing

16 Outline of Machine Learning

17 Scikit-lean API

17.1 basics of the API

  1. Choose a class of model by importing the appropriate estimator class from ScikitLearn.
In[6]: from sklearn.linear_model import LinearRegression
  1. Choose model hyperparameters by instantiating this class with desired values.
In[7]: model = LinearRegression(fit_intercept=True)
model
Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
normalize=False)
  1. Arrange data into a features matrix and target vector following the discussion

from before.

In[8]: X = x[:, np.newaxis]
X.shape
Out[8]: (50, 1)
  1. Fit the model to your data by calling the fit() method of the model instance.
In[9]: model.fit(X, y)
Out[9]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
normalize=False)
In[10]: model.coef_
Out[10]: array([ 1.9776566])
In[11]: model.intercept_
Out[11]: -0.90331072553111635
  1. Apply the model to new data:
In[12]: xfit = np.linspace(-1, 11)
In[13]: Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)

• For supervised learning, often we predict labels for unknown data using the predict() method. • For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.

17.2 feature engineering

17.2.1 Imputation of Missing Data

The sophisticated approaches tend to be very application-specific, and we won’t dive into them here. For a baseline imputation approach, using the mean, median, or most frequent value, Scikit-Learn provides the Imputer class:

In[15]: from sklearn.preprocessing import Imputer
imp = Imputer(strategy='mean')
X2 = imp.fit_transform(X)
X2
Out[15]: array([[ 4.5, 0. , 3. ],
[ 3. , 7. , 9. ],
[ 3. , 5. , 2. ],
[ 4. , 5. , 6. ],
[ 8. , 8. , 1. ]])

17.2.2 Feature Pipelines

  1. Impute missing values using the mean
  2. Transform features to quadratic
  3. Fit a linear regression
To streamline this type of processing pipeline, Scikit-Learn provides a pipeline object,
which can be used as follows:
In[17]: from sklearn.pipeline import make_pipeline
model = make_pipeline(Imputer(strategy='mean'),
PolynomialFeatures(degree=2),
LinearRegression())

This pipeline looks and acts like a standard Scikit-Learn object, and will apply all the specified steps to any input data.

In[18]: model.fit(X, y) # X with missing values, from above
print(y)
print(model.predict(X))
[14 16 -1 8 -5]
[ 14. 16. -1. 8. -5.]

18 Machine Learning Project Checklist for portfolio optimization

This checklist can guide you through your Machine Learning projects. There are eight main steps:

  1. Frame the problem and look at the big picture.
  2. Get the data.
  3. Explore the data to gain insights.
  4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
  5. Explore many different models and short-list the best ones.
  6. Fine-tune your models and combine them into a great solution.
  7. Present your solution.
  8. Launch, monitor, and maintain your system.

18.1 Frame the Problem and Look at the Big Picture

  1. Define the objective in business terms.

Portfolio optimization is the process of choosing the proportions of various assets to be held in a portfolio, in such a way as to make the portfolio better than any other according to some criterion. The criterion will combine, directly or indirectly, considerations of the expected value of the portfolio's rate of return as well as of the return's dispersion and possibly other measures of financial risk.

  1. How will your solution be used?

Often, portfolio optimization takes place in two stages: optimizing weights of asset classes to hold, and optimizing weights of assets within the same asset class. An example of the former would be choosing the proportions placed in equities versus bonds, while an example of the latter would be choosing the proportions of the stock sub-portfolio placed in stocks X, Y, and Z. Equities and bonds have fundamentally different financial characteristics and have different systematic risk and hence can be viewed as separate asset classes; holding some of the portfolio in each class provides some diversification, and holding various specific assets within each class affords further diversification. By using such a two-step procedure one eliminates non-systematic risks both on the individual asset and the asset class level.

  1. What are the current solutions/workarounds (if any)?

Markowitz Modern Portfolio Theory, Blacklitterman Model, Risk Parity.

  1. How should you frame this problem (supervised/unsupervised, online/offline,etc.)?

reinforcement learning, offline.

  1. How should performance be measured?

Return(percentage change) and Risk(standard deviation or the return). In addition to the traditional measure, standard deviation, or its square (variance), which are not robust risk measures, other measures include the Sortino ratio and the CVaR (Conditional Value at Risk).

  1. Is the performance measure aligned with the business objective?

Often portfolio optimization is done subject to constraints.

  1. What would be the minimum performance needed to reach the business objective?
  2. What are comparable problems? Can you reuse experience or tools?
  3. Is human expertise available?
  4. How would you solve the problem manually?
  5. List the assumptions you (or others) have made so far.
  6. Verify assumptions if possible.

18.2 Get the Data

Note: automate as much as possible so you can easily get fresh data.

  1. List the data you need and how much you need.
  2. Find and document where you can get that data.
  3. Check how much space it will take.
  4. Check legal obligations, and get authorization if necessary.
  5. Get access authorizations.
  6. Create a workspace (with enough storage space).
  7. Get the data.
  8. Convert the data to a format you can easily manipulate (without changing the data itself).
  9. Ensure sensitive information is deleted or protected (e.g., anonymized).
  10. Check the size and type of data (time series, sample, geographical, etc.).
  11. Sample a test set, put it aside, and never look at it (no data snooping!).

18.3 Explore the Data

Note: try to get insights from a field expert for these steps.

  1. Create a copy of the data for exploration (sampling it down to a manageable size

if necessary).

  1. Create a Jupyter notebook to keep a record of your data exploration.
  2. Study each attribute and its characteristics:

• Name • Type (categorical, int/float, bounded/unbounded, text, structured, etc.)of missing values • Noisiness and type of noise (stochastic, outliers, rounding errors, etc.) • Possibly useful for the task? • Type of distribution (Gaussian, uniform, logarithmic, etc.)

  1. For supervised learning tasks, identify the target attribute(s).
  2. Visualize the data.
  3. Study the correlations between attributes.
  4. Study how you would solve the problem manually.
  5. Identify the promising transformations you may want to apply.
  6. Identify extra data that would be useful (go back to “Get the Data” on page 498).
  7. Document what you have learned.

18.4 Prepare the Data

Notes: • Work on copies of the data (keep the original dataset intact). • Write functions for all data transformations you apply, for five reasons: — So you can easily prepare the data the next time you get a fresh dataset — So you can apply these transformations in future projects — To clean and prepare the test set — To clean and prepare new data instances once your solution is live — To make it easy to treat your preparation choices as hyperparameters

  1. Data cleaning:

• Fix or remove outliers (optional). • Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).

  1. Feature selection (optional):

• Drop the attributes that provide no useful information for the task.

  1. Feature engineering, where appropriate:

• Discretize continuous features. Decompose features (e.g., categorical, date/time, etc.). • Add promising transformations of features (e.g., log(x), sqrt(x), x2, etc.). • Aggregate features into promising new features.

  1. Feature scaling: standardize or normalize features.

18.5 Short-List Promising Models

Notes: • If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests). • Once again, try to automate these steps as much as possible.

  1. Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
  2. Measure and compare their performance.

• For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds.

  1. Analyze the most significant variables for each algorithm.
  2. Analyze the types of errors the models make.

• What data would a human have used to avoid these errors?

  1. Have a quick round of feature selection and engineering.
  2. Have one or two more quick iterations of the five previous steps.
  3. Short-list the top three to five most promising models, preferring models that make different types of errors.

18.6 Fine-Tune the System

Notes: • You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning. • As always automate what you can. “Practical Bayesian Optimization of Machine Learning Algorithms,” J. Snoek, H. Larochelle, R. Adams (2012).

  1. Fine-tune the hyperparameters using cross-validation.

• Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or with the median value? Or just drop the rows?). • Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams).

  1. Try Ensemble methods. Combining your best models will often perform better than running them individually.
  2. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.

Don’t tweak your model after measuring the generalization error: you would just start overfitting the test set.

18.7 Present Your Solution

  1. Document what you have done.
  2. Create a nice presentation.

• Make sure you highlight the big picture first.

  1. Explain why your solution achieves the business objective.
  2. Don’t forget to present interesting points you noticed along the way.

• Describe what worked and what did not. • List your assumptions and your system’s limitations.

  1. Ensure your key findings are communicated through beautiful visualizations or

easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).

18.8 Launch!

  1. Get your solution ready for production (plug into production data inputs, write unit tests, etc.).
  2. Write monitoring code to check your system’s live performance at regular intervals and trigger alerts when it drops.

• Beware of slow degradation too: models tend to “rot” as data evolves. • Measuring performance may require a human pipeline (e.g., via a crowdsourcing service). • Also monitor your inputs’ quality (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale). This is particularly important for online learning systems.

  1. Retrain your models on a regular basis on fresh data (automate as much as possible)

19 Machine Learning Project Checklist for Natural Language Processing on Knowledge Graph recommendation agent.