1. Basic prerequisites
- 1.1. maximum likelihood estimation
- 1.2. terminology/jargon:
2. 深度学习
- 2.1. Definitions
- 2.2. why deep?
3. 学习方法
4. NLP
- 4.1. Application: Where can DL be applied for NLP tasks? DL Algorithms NLP Usage Neural Network (NN)
5. Critical questions

Author	weiwu (victor.wuv@gmail.com)
Date	2018-07-16 19:30:44
Title

1 Basic prerequisites

https://docs.google.com/document/d/1NitqVZyU1zZYRUQmj-0c2Pq3qso7jcbF9XekZlC-w04/edit

1.1 maximum likelihood estimation

Summary

Maximum likelihood is a general and powerful technique for learning statistical models, i.e. fitting the parameters to data. The maximum likelihood parameters are the ones under which the observed data has the highest probability. It is widely used in practice, and techniques such as Bayesian parameter estimation are closely related to maximum likelihood.

Context

This concept has the prerequisites:

random variables
independent random variables (The data are generally assumed to be independent draws from a distribution.)
optimization problems (Maximum likelihood is formulated as an optimization problem.)
Gaussian distribution (Fitting a Gaussian distribution is an instructive example of maximum likelihood estimation.)

1.2 terminology/jargon:

Activation

An activation, or activation function, for a neural network is defined as the mapping of the input to the output via a non-linear transform function at each “node”, which is simply a locus of computation within the net. Each layer in a neural net consists of many nodes, and the number of nodes in a layer is known as its width.

Activation algorithms are the gates that determine, at each node in the net, whether and to what extent to transmit the signal the node has received from the previous layer. A combination of weights (coefficients) and biases work on the input data from the previous layer to determine whether that signal surpasses a given treshhold and is deemed significant. Those weights and biases are slowly updated as the neural net minimizes its error; i.e. the level of nodes’ activation change in the course of learning. Deeplearning4j includes activation functions such as sigmoid, relu, tanh and ELU. These activation functions allow neural networks to make complex boundary decisions for features at various levels of abstraction.

The hyperbolic tangent function tanh (z) = 2σ(2z) – 1:

hyperbolic tangent function is just like the logistic function it is S-shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of 0 to 1 in the case of the logistic function), which tends to make each layer’s output more or less normalized (i.e., centered around 0) at the beginning of training. This often helps speed up convergence.

The ReLU function ReLU (z) = max (0, z).

It is continuous but unfortunately not differentiable at z = 0 (the slope changes abruptly, which can make Gradient Descent bounce around). However, in practice it works very well and has the advantage of being fast to compute. Most importantly, the fact that it does not have a maximum output value also helps reduce some issues during Gradient Descent.

Adadelta

Adadelta is an updater, or learning algorithm, related to gradient descent. Unlike SGD, which applies the same learning rate to all parameters of the network, Adadelta adapts the learning rate per parameter.

AdaDeltaUpdater in Deepelearning4j ADADELTA: An Adaptive Learning Rate Method

Adagrad

Adagrad, short for adaptive gradient, is an updater or learning algorithm that adjust the learning rate for each parameter in the net by monitoring the squared gradients in the course of learning. It is a substitute for SGD, and can be useful when processing sparse data.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Adam

Adam (Gibson) co-created Deeplearning4j. :) Adam is also an updater, similar to rmsprop, which uses a running average of the gradient’s first and second moment plus a bias-correction term.

Adam: A Method for Stochastic Optimization

Affine Layer

Affine is a fancy word for a fully connected layer in a neural network. “Fully connected” means that all the nodes of one layer connect to all the nodes of the subsequent layer. A restricted Boltzmann machine, for example, is a fully connected layer. Convolutional networks use affine layers interspersed with both their namesake convolutional layers (which create feature maps based on convolutions) and downsampling layers, which throw out a lot of data and only keep the maximum value. “Affine” derives from the Latin affinis, which means bordering or connected with. Each connection, in an affine layer, is a passage whereby input is multiplied by a weight and added to a bias before it accumulates with all other inputs at a given node, the sum of which is then passed through an activation function: e.g. output = activation(weight*input+bias), or y = f(w*x+b).

AlexNet

AlexNet is a deep convolutional network named after Alex Krizhevsky, a former student of Geoff Hinton’s at the University of Toronto, now at Google. AlexNet was used to win ILSVRC 2012, and foretold a wave of deep convolutional networks that would set new records in image recognition. AlexNet is now a standard architecture: it contains five convolutional layers, three of which are followed by max-pooling (downsampling) layers, two fully connected (affine) layers – all of which ends in a softmax layer. Here is Deeplearning4j’s implementation of AlexNet.

ImageNet Classification with Deep Convolutional Neural Networks

Attention Models

Attention models “attend” to specific parts of an image in sequence, one after another. By relying on a sequence of glances, they capture visual structure, much like the human eye is believed to function with foveation. This visual processing, which relies on a recurrent network to process sequential data, can be contrasted with other machine vision techniques that process a whole image in a single, forward pass.

DRAW: A Recurrent Neural Network For Image Generation.

an attention-based system as having three components:

A process that “ reads” raw data (such as source words in a source sentence),

and converts them into distributed representations, with one feature vector associated with each word position.

A list of feature vectors storing the output of the reader. This can be

understood as a “ memory” containing a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all of them.

A process that “ exploits” the content of the memory to sequentially perform

a task, at each time step having the ability put attention on the content of one memory element (or a few, with a different weight).

Autoencoder

Autoencoders are at the heart of representation learning. They encode input, usually by compressing large vectors into smaller vectors that capture their most significant features; that is, they are useful for data compression (dimensionality reduction) as well as data reconstruction for unsupervised learning. A restricted Boltzmann machine is a type of autoencoder, and in fact, autoencoders come in many flavors, including Variational Autoencoders, Denoising Autoencoders and Sequence Autoencoders. Variational autoencoders have replaced RBMs in many labs because they produce more stable results. Denoising autoencoders provide a form of regularization by introducing Gaussian noise into the input, which the network learns to ignore in search of the true signal.

Auto-Encoding Variational Bayes Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion Semi-supervised Sequence Learning

Backpropagation¹

To calculate the gradient the relate weights to error, to calculate the error contribution of each neuron after a batch of data (in image recognition, multiple images) is processed. we use a technique known as backpropagation, which is also referred to as the backward pass of the network, by the gradient descent optimization algorithm to adjust the weight of neurons by calculating the gradient of the loss function. Backpropagation is a repeated application of chain rule of calculus for partial derivatives. The first step is to calculate the derivatives of the objective function with respect to the output units, then the derivatives of the output of the last hidden layer to the input of the last hidden layer; then the input of the last hidden layer to the weights between it and the penultimate hidden layer, etc.

Assumptions: Two assumptions must be made about the form of the error function. The first is that it can be written as an average ${\textstyle E={\frac {1}{n}}\sum _{x}E_{x}}$ over error functions ${\textstyle E_{x}}$, for ${\textstyle n}$ individual training examples, ${\textstyle x}$. The reason for this assumption is that the backpropagation algorithm calculates the gradient of the error function for a single training example, which needs to be generalized to the overall error function. The second assumption is that it can be written as a function of the outputs from the neural network.

Alt text

A special form of backpropagation is called backpropagation through time, or BPTT, which is specifically useful for recurrent networks analyzing text and time series. With BPTT, each time step of the RNN is the equivalent of a layer in a feed-forward network. To backpropagate over many time steps, BPTT can be truncated for the purpose of efficiency. Truncated BPTT limits the time steps over which error is propagated.

Batch Normalization

Batch Normalization does what is says: it normalizes mini-batches as they’re fed into a neural-net layer. Batch normalization has two potential benefits: it can accelerate learning because it allows you to employ higher learning rates, and also regularizes that learning.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Overview of mini-batch gradient descent (U. Toronto)

Bidirectional Recurrent Neural Networks

A Bidirectional RNN is composed of two RNNs that process data in opposite directions. One reads a given sequence from start to finish; the other reads it from finish to start. Bidirectional RNNs are employed in NLP for translation problems, among other use cases. Deeplearning4j’s implementation of bidirectional Graves LSTMs is here.

Bidirectional Recurrent Neural Networks

Binarization

The process of transforming data in to a set of zeros and ones. An example would be gray-scaling an image by transforming a picture from the 0-255 spectrum to a 0-1 spectrum.

Boltzmann Machine

“A Boltzmann machine learns internal (not defined by the user) concepts that help to explain (that can generate) the observed data. These concepts are captured by random variables (called hidden units) that have a joint distribution (statistical dependencies) among themselves and with the data, and that allow the learner to capture highly non-linear and complex interactions between the parts (observed random variables) of any observed example (like the pixels in an image). You can also think of these higher-level factors or hidden units as another, more abstract, representation of the data. The Boltzmann machine is parametrized through simple two-way interactions between every pair of random variable involved (the observed ones as well as the hidden ones).” - Yoshua Bengio

Channel

Channel is a word used when speaking of convolutional networks. ConvNets treat color images as volumes; that is, an image has height, width and depth. The depth is the number of channels, which coincide with how you encode colors. RGB images have three channels, for red, green and blue respectively.

Class

Used in classification a Class refers to a label applied to a group of records sharing similar characteristics.

Confusion Matrix

Also known as an error matrix or contingency table. Confusions matrices allow you to see if your algorithm is systematically confusing two labels, by contrasting your net’s predictions against a benchmark.

Contrastive Divergence

“Contrastive divergence is a recipe for training undirected graphical models (a class of probabilistic models used in machine learning). It relies on an approximation of the gradient (a good direction of change for the parameters) of the log-likelihood (the basic criterion that most probabilistic learning algorithms try to optimize) based on a short Markov chain (a way to sample from probabilistic models) started at the last example seen. It has been popularized in the context of Restricted Boltzmann Machines (Hinton & Salakhutdinov, 2006, Science), the latter being the first and most popular building block for deep learning algorithms.” ~Yoshua Bengio

Convolutional Network (CNN)

Convolutional networks are a deep neural network that is currently the state-of-the-art in image processing. They are setting new records in accuracy every year on widely accepted benchmark contests like ImageNet.

From the Latin convolvere, “to convolve” means to roll together. For mathematical purposes, a convolution is the integral measuring how much two functions overlap as one passes over the other. Think of a convolution as a way of mixing two functions by multiplying them: a fancy form of multiplication.

Imagine a tall, narrow bell curve standing in the middle of a graph. The integral is the area under that curve. Imagine near it a second bell curve that is shorter and wider, drifting slowly from the left side of the graph to the right. The product of those two functions’ overlap at each point along the x-axis is their convolution. So in a sense, the two functions are being “rolled together.”

Cosine Similarity

It turns out two vectors are just 66% of a triangle, so let’s do a quick trig review.

Trigonometric functions like sine, cosine and tangent are ratios that use the lengths of a side of a right triangle (opposite, adjacent and hypotenuse) to compute the shape’s angles. By feeding the sides into ratios like these

Alt text

we can also know the angles at which those sides intersect. Remember SOH-CAH-TOA?

Differences between word vectors, as they swing around the origin like the arms of a clock, can be thought of as differences in degrees.

And similar to ancient navigators gauging the stars by a sextant, we will measure the angular distance between words using something called cosine similarity. You can think of words as points of light in a dark canopy, clustered together in constellations of meaning.

To find that distance knowing only the word vectors, we need the equation for vector dot multiplication (multiplying two vectors to produce a single, scalar value).

In Java, you can think of the formula to measure cosine similarity like this:

public static double cosineSimilarity(double[] vectorA, double[] vectorB) { double dotProduct = 0.0; double normA = 0.0; double normB = 0.0; for (int i = 0; i < vectorA.length; i++) { dotProduct += vectorA[i] * vectorB[i]; normA += Math.pow(vectorA[i], 2); normB += Math.pow(vectorB[i], 2); } return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)); } Cosine is the angle attached to the origin, which makes it useful here. (We normalize the measurements so they come out as percentages, where 1 means that two vectors are equal, and 0 means they are perpendicular, bearing no relation to each other.)

Data Parallelism and Model Parallelism

Training a neural network on a very large dataset requires some form of parallelism, of which there are two types: data parallelism and model parallelism.

Let’s say you have a very large image dataset of 1,000,000 faces. Those faces can be divided into batches of 10, and then 10 separate batches can be dispatched simultaneously to 10 different convolutional networks, so that 100 instances can be processed at once. The 10 different CNNs would then train on a batch, calculate the error on that batch, and update their parameters based on that error. Then, using parameter averaging, the 10 CNNs would update a central, master CNN that would take the average of their updated paramters. This process would repeat until the entire dataset has been exhausted. For more information, please see our page on iterative reduce.

Model parallelism is another way to accelerate neural net training on very large datasets. Here, instead of sending batches of faces to separate neural networks, let’s imagine a different kind of image: an enormous map of the earth. Model parallelism would divide that enormous map into regions, and it would park a separate CNN on each region, to train on only that area and no other. Then, as each enormous map was peeled off the dataset to train the neural networks, it would be broken up and different patches of it would be sent to train on separate CNNs. No parameter averaging necessary here.

Data Science

Data science is the discipline of drawing conclusions from data using computation. There are three core aspects of effective data analysis: exploration, prediction, and inference.

Deep-Belief Network (DBN)

A deep-belief network is a stack of restricted Boltzmann machines, which are themselves a feed-forward autoencoder that learns to reconstruct input layer by layer, greedily. Pioneered by Geoff Hinton and crew. Because a DBN is deep, it learns a hierarchical representation of input. Because DBNs learn to reconstruct that data, they can be useful in unsupervised learning.

A fast learning algorithm for deep belief nets

Deep Learning

Deep Learning allows computational models composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection, and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large datasets by using the back-propagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about dramatic improvements in processing images, video, speech and audio, while recurrent nets have shone on sequential data such as text and speech. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep learning methods are representation learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. -NIPS

Distant supervision

远监督。弱监督也称为远监督，数据集的标签是不可靠的(这里的不可靠可以是标记不正确，多种标记，标记不充分，局部标记等)，针对监督信息不完整或不明确对象的学习问题统称为弱监督学习。

Distributed Representations

The Nupic community has a good explanation of distributed representations here. Other good explanations can be found on this Quora page.

Downpour Stochastic Gradient Descent

Downpour stochastic gradient descent is an asynchronous stochastic gradient descent procedure, employed by Google among others, that expands the scale and increases the speed of training deep-learning networks.

Dropout

Dropout is a hyperparameter used for regularization in neural networks. Like all regularization techniques, its purpose is to prevent overfitting. Dropout randomly makes nodes in the neural network “drop out” by setting them to zero, which encourages the network to rely on other features that act as signals. That, in turn, creates more generalizable representations of data.

Dropout: A Simple Way to Prevent Neural Networks from Overfitting Recurrent Neural Network Regularization

DropConnect

DropConnect is a generalization of Dropout for regularizing large fully-connected layers within neural networks. Dropout sets a randomly selected subset of activations to zero at each layer. DropConnect, in contrast, sets a randomly selected subset of weights within the network to zero.

Regularization of Neural Networks using DropConnect

Euclidean space

In geometry, Euclidean space encompasses the two-dimensional Euclidean plane, the three-dimensional space of Euclidean geometry, and certain other spaces.

One way to think of the Euclidean plane is as a set of points satisfying certain relationships, expressible in terms of distance and angle.

Embedding

An embedding is a representation of input, or an encoding. For example, a neural word embedding is a vector that represents that word. The word is said to be embedded in vector space. Word2vec and GloVe are two techniques used to train word embeddings to predict a word’s context. Because an embedding is a form of representation learning, we can “embed” any data type, including sounds, images and time series.

Epoch vs. Iteration

In machine-learning parlance, an epoch is a complete pass through a given dataset. That is, by the end of one epoch, your neural network – be it a restricted Boltzmann machine, convolutional net or deep-belief network – will have been exposed to every record to example within the dataset once. Not to be confused with an iteration, which is simply one update of the neural net model’s parameters. Many iterations can occur before an epoch is over. Epoch and iteration are only synonymous if you update your parameters once for each pass through the whole dataset; if you update using mini-batches, they mean different things. Say your data has 2 minibatches: A and B. .numIterations(3) performs training like AAABBB, while 3 epochs looks like ABABAB.

Extract, transform, load (ETL)

Data is loaded from disk or other sources into memory with the proper transforms such as binarization and normalization. Broadly, you can think of a datapipeline as the process over gathering data from disparate sources and locations, putting it into a form that your algorithms can learn from, and then placing it in a data structure that they can iterate through.

f1 Score

The f1 score is a number between zero and one that explains how well the network performed during training. It is analogous to a percentage, with 1 being the best score and zero the worst. f1 is basically the probability that your net’s guesses are correct.

F1 = 2 * ((precision * recall) / (precision + recall)) Accuracy measures how often you get the right answer, while f1 scores are a measure of accuracy. For example, if you have 100 fruit – 99 apples and 1 orange – and your model predicts that all 100 items are apples, then it is 99% accurate. But that model failed to identify the difference between apples and oranges. f1 scores help you judge whether a model is actually doing well as classifying when you have an imbalance in the categories you’re trying to tag.

An f1 score is an average of both precision and recall. More specifically, it is a type of average called the harmonic mean, which tends to be less than the arithmetic or geometric means. Recall answers: “Given a positive example, how likely is the classifier going to detect it?” It is the ratio of true positives to the sum of true positives and false negatives.

Precision answers: “Given a positive prediction from the classifier, how likely is it to be correct ?” It is the ratio of true positives to the sum of true positives and false positives.

For f1 to be high, both recall and precision of the model have to be high.

recall rate:

举个栗子假设我们手上有60个正样本，40个负样本，我们要找出所有的正样本，系统查找出50个，其中只有40个是真正的正样本，计算上述各指标。TP: 将正类预测为正类数 40 FN: 将正类预测为负类数 20 FP: 将负类预测为正类数 10 TN: 将负类预测为负类数 30 准确率(accuracy) = 预测对的/所有 = (TP+TN)/(TP+FN+FP+TN) = 70% 精确率(precision) = TP/(TP+FP) = 80% 召回率(recall) = TP/(TP+FN) = 2/3

Feed-Forward Network

A neural network that takes the initial input and triggers the activation of each layer of the network successively, without circulating. Feed-forward nets contrast with recurrent and recursive nets in that feed-forward nets never let the output of one node circle back to the same or previous nodes.

Gaussian Distribution

A Gaussian, or normal, distribution, is a continuous probability distribution that represents the probability that any given observation will occur on different points of a range. Visually, it resembles what’s usually called a Bell curve.

Gloval Vectores (GloVe)

GloVe is a generalization of Tomas Mikolov’s word2vec algorithms, a technique for creating neural word embeddings. It was first presented at NIPS by Jeffrey Pennington, Richard Socher and Christopher Manning of Stanford’s NLP department. Deeplearning4j’s implementation of GloVe is here.

GloVe: Global Vectors for Word Representation

Gradient Descent

The gradient is a derivative, which you will know from differential calculus. That is, it’s the ratio of the rate of change of a neural net’s parameters and the error it produces, as it learns how to reconstruct a dataset or make guesses about labels. The process of minimizing error is called gradient descent. Descending a gradient has two aspects: choosing the direction to step in (momentum) and choosing the size of the step (learning rate).

Since MLPs are, by construction, differentiable operators, they can be trained to minimise any differentiable objective function using gradient descent. The basic idea of gradient descent is to find the derivative of the objective function with respect to each of the network weights, then adjust the weights in the direction of the negative slope. -Graves

Generative model

In statistical classification, including machine learning, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent,[a] but three major types can be distinguished, following (Jebara 2004):

Given an observable variable X and a target variable Y, a generative model is a statistical model of the joint probability distribution on X × Y, {\displaystyle P(X,Y)} {\displaystyle P(X,Y)};

A discriminative model is a model of the conditional probability of the target Y, given an observation x, symbolically, {\displaystyle P(Y|X=x)} {\displaystyle P(Y|X=x)}; and

Classifiers computed without using a probability model are also referred to loosely as "discriminative".

Gradient Clipping

Gradient Clipping is one way to solve the problem of exploding gradients. Exploding gradients arise in deep networks when gradients associating weights and the net’s error become too large. Exploding gradients are frequently encountered in RNNs dealing with long-term dependencies. One way to clip gradients is to normalize them when the L2 norm of a parameter vector surpasses a given threshhold.

Epoch

An Epoch is a complete pass through all the training data. A neural network is trained until the error rate is acceptable, and this will often take multiple passes through the complete data set.

note An iteration is when parameters are updated and is typically less than a full pass. For example if BatchSize is 100 and data size is 1,000 an epoch will have 10 iterations. If trained for 30 epochs there will be 300 iterations.

Graphical Models

A directed graphical model is another name for a Bayesian net, which represents the probabilistic relationships between the variables represented by its nodes.

Gated Recurrent Unit (GRU)

A GRU is a pared-down LSTM. GRUs rely on gating mechanisms to learn long-range dependencies while sidestepping the vanishing gradient problem. They include reset and update gates to decide when to update the GRUs memory at each time step.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Highway Networks

Highway networks are an architecture introduced by Schmidhuber et al to let information flow unhindered across several RNN layers on so-called “information highways.” The architecture uses gating units that learn to regulate the flow of information through the net. Highway networks with hundreds of layers can be trained directly using SGD, which means they can support very deep architectures.

Highway Networks

Hyperplane

“A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts. What does that mean intuitively?

First think of the real line. Now pick a point. That point divides the real line into two parts (the part above that point, and the part below that point). The real line has 1 dimension, while the point has 0 dimensions. So a point is a hyperplane of the real line.

Now think of the two-dimensional plane. Now pick any line. That line divides the plane into two parts (“left” and “right” or maybe “above” and “below”). The plane has 2 dimensions, but the line has only one. So a line is a hyperplane of the 2d plane. Notice that if you pick a point, it doesn’t divide the 2d plane into two parts. So one point is not enough.

Now think of a 3d space. Now to divide the space into two parts, you need a plane. Your plane has two dimensions, your space has three. So a plane is the hyperplane for a 3d space.

OK, now we’ve run out of visual examples. But suppose you have a space of n dimensions. You can write down an equation describing an n-1 dimensional object that divides the n-dimensional space into two pieces. That’s a hyperplane.” -Quora

International Conference on Learning Representations

ICLR, pronounced “I-clear”. An important conference. See representation learning.

International Conference for Machine Learning

ICML, or the International Conference for Machine Learning, is a well-known and well attended machine-learning conference.

ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

The ImageNet Large Scale Visual Recognition Challenge is the formal name for ImageNet, a yearly contest held to solicit and evalute the best techniques in image recognition. Deep convolutional architectures have driven error rates on the ImageNet competition from 30% to less than 5%, which means they now have human-level accuracy.

Iteration

An iteration is an update of weights after analysing a batch of input records. See Epoch for clarification.

Learning rate

Neural networks are often trained by gradient descent on the weights. This means at each iteration we use backpropagation to calculate the derivative of the loss function with respect to each weight and subtract it from that weight. However, if you actually try that, the weights will change far too much each iteration, which will make them “overcorrect” and the loss will actually increase/diverge. So in practice, people usually multiply each derivative by a small value called the “learning rate” before they subtract it from its corresponding weight.

\[W:=W−η∇LW:=W−η∇L\]

This η here is called the learning rate.

LeNet

Google’s LeNet architecture is a deep convolutional network. It won ILSVRC in 2014, and introduced techniques for paring the size of a CNN, thus increasing computational efficiency.

Going Deeper with Convolutions

Long Short-Term Memory Units (LSTM)

LSTMs are a form of recurrent neural network invented in the 1990s by Sepp Hochreiter and Juergen Schmidhuber, and now widely used for image, sound and time series analysis, because they help solve the vanishing gradient problem by using a memory gates. Alex Graves made significant improvements to the LSTM with what is now known as the Graves LSTM, which Deeplearning4j implements here.

Latent Semantic Indexing (LSI)

LSI (also known as Latent Semantic Analysis, LSA) learns latent topics by performing a matrix decomposition (SVD) on the term-document matrix.

LDA is a generative probabilistic model, that assumes a Dirichlet prior over the latent topics.

Log-Likelihood

Log likelihood is related to the statistical idea of the likelihood function. Likelihood is a function of the parameters of a statistical model. “The probability of some observed outcomes given a set of parameter values is referred to as the likelihood of the set of parameter values given the observed outcomes.”

Maximum Likelihood Estimation

“Say you have a coin and you’re not sure it’s “fair.” So you want to estimate the “true” probability it will come up heads. Call this probability P, and code the outcome of a coin flip as 1 if it’s heads and 0 if it’s tails. You flip the coin four times and get 1, 0, 0, 0 (i.e., 1 heads and 3 tails). What is the likelihood that you would get these outcomes, given P? Well, the probability of heads is P, as we defined it above. That means the probability of tails is (1 - P). So the probability of 1 heads and 3 tails is P * (1 - P)3 [Edit: We call this the “likelihood” of the data]. If we “guess” that the coin is fair, that’s saying P = 0.5, so the likelihood of the data is L = .5 * (1 - .5)3 = .0625. What if we guess that P = 0.45? Then L = .45 * (1 - .45)3 = ~.075. So P = 0.45 is actually a better estimate than P = 0.5, because the data are “more likely” to have occurred if P = 0.45 than if P = 0.5. At P = 0.4, the likelihood is 0.4 * (1 - 0.4)3 = .0864. At P = 0.35, the likelihood is 0.35 * (1 - 0.35)3 = .096. In this case, it turns out that the value of P that maximizes the likelihood is P = 0.25. So that’s our “maximum likelihood” estimate for P. In practice, max likelihood is harder to estimate than this (with predictors and various assumptions about the distribution of the data and error terms), but that’s the basic concept behind it.” –u/jacknbox

So in a sense, probability is treated as an unseen, internal property of the data. A parameter. And likelihood is a measure of how well the outcomes recorded in the data match our hypothesis about their probability; i.e. our theory about how the data is produced. The better our theory of the data’s probability, the higher the likelihood of a given set of outcomes.

Model

In neural networks, the model is the collection of weights and biases that transform input into output. A neural network is a set of algorithms that update models such that the models guess with less error as they learn. A model is a symbolic, logical or mathematical machine whose purpose is to deduce output from input. If a model’s assumptions are correct, then one must necessarily believe its conclusions. Neural networks produced trained models that can be deployed to process, classify, cluster and make predictions about data.

MNIST

MNIST is the “hello world” of deep-learning datasets. Everyone uses MNIST to test their neural networks, just to see if the net actually works at all. MNIST contains 60,000 training examples and 10,000 test examples of the handwritten numerals 0-9. These images are 28x28 pixels, which means they require 784 nodes on the first input layer of a neural network. MNIST is available for download here.

Model Score

As your model trains the goal of training is to improve the “score” for the output or the overall error rate. The webui will present a graph of the score for each iteration. For text based console output of the score as the model trains you would use ScoreIterationListener

Nesterov’s Momentum

Momentum also known as Nesterov’s momentum, influences the speed of learning. It causes the model to converge faster to a point of minimal error. Momentum adjusts the size of the next step, the weight update, based on the previous step’s gradient. That is, it takes the gradient’s history and multiplies it. Before each new step, a provisional gradient is calculated by taking partial derivatives from the model, and the hyperparameters are applied to it to produce a new gradient. Momentum influences the gradient your model uses for the next step.

Nesterov’s Momentum Updater in Deeplearnign4j

Multilayer Perceptron

MLPs are perhaps the oldest form of deep neural network. They consist of multiple, fully connected feedforward layers. Examples of Deeplearning4j’s multilayer perceptrons can be seen here.

Neural Machine Translation

Neural machine translation maps one language to another using neural networks. Typically, recurrent neural networks are use to ingest a sequence from the input language and output a sequence in the target language.

Sequence to Sequence Learning with Neural Networks

Neural Network architecture

feed-forward networks and Recurrent Recursive networks. Feed-forward networks include networks with fully connected layers, such as the multi-layer perceptron, as well as networks with convolutional and pooling layers. All of the networks act as classifiers, but each with different strengths.

Noise-Contrastive Estimations (NCE)

Noise-contrastive estimation offers a balance of computational and statistical efficiency. It is used to train classifiers with many classes in the output layer. It replaces the softmax probability density function, an approximation of a maximum likelihood estimator that is cheaper computationally.

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models Learning word embeddings efficiently with noise-contrastive estimation

Nonlinear Transform Function

A function that maps input on a nonlinear scale such as sigmoid or tanh. By definition, a nonlinear function’s output is not directly proportional to its input.

Normalization

The process of transforming the data to span a range from 0 to 1.

Object-Oriented Programming (OOP)

While deep learning and opject oriented programming don’t necessarily go together, Deeplearning4j is written in Java following the principles of OOP. In object-oriented programming, you create so-called objects, which are generally abstract nouns representing a part in a larger symbolic machine (e.g. in Deeplearning4j, the object class DataSetIterator traverses across datasets and feeds parts of those datasets into another process, iteratively, piece by piece).

DatasetIterator is actually the name of a class of object. In any particular object-oriented program, you would create a particular instance of that general class, calling it, say, ‘iter’ like this:

new DataSetIterator iter; Every object is really just a data structure that combines fields containing data and methods that act on the data in those fields.

The way you talk about those fields and methods is with the dot operator ., and parentheses () that contain parameters. For example, if you wrote iter.next(5), then you’d be telling the DataSetIterator to go across a dataset processing 5 instances of that data (say 5 images or records) at a time, where next is the method you call, and 5 is the parameter you pass into it.

You can learn more about DataSetIterator and other classes in Deeplearning4j in our Javadoc.

Objective Function

Also called a loss function or a cost function, an objective function defines what success looks like when an algorithm learns. It is a measure of the difference between a neural net’s guess and the ground truth; that is, the error. Measuring that error is a precondition to updating the neural net in such a way that its guesses generate less error. The error resulting from the loss function is fed into backpropagation in order to update the weights and biases that process input in the neural network.

One-Hot Encoding

Used in classification and bag of words. The label for each example is all 0s, except for a 1 at the index of the actual class to which the example belongs. For BOW, the one represents the word encountered.

Below is an example of one-hot encoding for the phrase “The quick brown fox” One Hot Encoding for words

Pooling

Pooling, max pooling and average pooling are terms that refer to downsampling or subsampling within a convolutional network. Downsampling is a way of reducing the amount of data flowing through the network, and therefore decreasing the computational cost of the network. Average pooling takes the average of several values. Max pooling takes the greatest of several values. Max pooling is currently the preferred type of downsampling layer in convolutional networks.

Probability Density

Probability densities are used in unsupervised learning, with algorithms such as autoencoders, VAEs and GANs.

“A probability density essentially says “for a given variable (e.g. radius) what, at that particular value, is the likelihood of encountering an event or an object (e.g. an electron)?” So if I’m at the nucleus of a atom and I move to, say, one Angstrom away, at one Angstrom there is a certain likelihood I will spot an electron. But we like to not just ask for the probability at one point; we’d sometimes like to find the probability for a range of points: What is the probability of finding an electron between the nucleus and one Angstrom, for example. So we add up (“integrate”) the probability from zero to one Angstrom. For the sake of convenience, we sometimes employ “normalization”; that is, we require that adding up all the probabilities over every possible value will give us 1.00000000 (etc).” –u/beigebox

Probability Distribution

“A probability distribution is a mathematical function and/or graph that tells us how likely something is to happen.

So, for example, if you’re rolling two dice and you want to find the likelihood of each possible number you can get, you could make a chart that looks like this. As you can see, you’re most likely to get a 7, then a 6, then an 8, and so on. The numbers on the left are the percent of the time where you’ll get that value, and the ones on the right are a fraction (they mean the same thing, just different forms of the same number). The way that it you use the distribution to find the likelihood of each outcome is this:

There are 36 possible ways for the two dice to land. There are 6 combinations that get you 7, 5 that get you 6/8, 4 that get you 5/9, and so on. So, the likelihood of each one happening is the number of possible combinations that get you that number divided by the total number of possible combinations. For 7, it would be 6/36, or 1/6, which you’ll notice is the same as what we see in the graph. For 8, it’s 5/36, etc. etc.

The key thing to note here is that the sum of all of the probabilities will equal 1 (or, 100%). That’s really important, because it’s absolutely essential that there be a result of rolling the two die every time. If all the percentages added up to 90%, what the heck is happening that last 10% of the time?

So, for more complex probability distributions, the way that the distribution is generated is more involved, but the way you read it is the same. If, for example, you see a distribution that looks like this, you know that you’re going to get a value of μ 40% (corresponding to .4 on the left side) of the time whenever you do whatever the experiment or test associated with that distribution.

The percentages in the shaded areas are also important. Just like earlier when I said that the sum of all the probabilities has to equal 1 or 100%, the area under the curve of a probability distribution has to equal 1, too. You don’t need to know why that is (it involves calculus), but it’s worth mentioning. You can see that the graph I linked is actually helpfully labeled; the reason they do that is to show you that you what percentage of the time you’re going to end up somewhere in that area.

So, for example, about 68% of the time, you’ll end up between -1σ and 1σ.” –u/corpuscle634

Reconstruction Entropy

After applying Gaussian noise, a kind of statistical white noise, to the data, this objective function punishes the network for any result that is not closer to the original input. That signal prompts the network to learn different features in an attempt to reconstruct the input better and minimize error.

Rectified Linear Units

Rectified linear units, or reLU, are a non-linear activation function widely applied in neural networks because they deal well with the vanishing gradient problem. They can be expressed so: f(x) = max(0, x), where activation is set to zero if the output does not surpass a minimum threshhold, and activation increases linearly above that threshhold.

Rectifier Nonlinearities Improve Neural Network Acoustic Models Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Rectified Linear Units Improve Restricted Boltzmann Machines Incorporating Second-Order Functional Knowledge for Better Option Pricing Recurrent Neural Networks

While “a multilayer perceptron (MLP) can only map from input to output vectors, whereas an RNN can in principle map from the entire history of previous inputs to each output. Indeed, the equivalent result to the universal approximation theory for MLPs is that an RNN with a sufficient number of hidden units can approximate any measurable sequence-to-sequence mapping to arbitrary accuracy (Hammer, 2000). The key point is that the recurrent connections allow a ‘memory’ of previous inputs to persist in the network’s internal state, which can then be used to influence the network output. The forward pass of an RNN is the same as that of an MLP with a single hidden layer, except that activations arrive at the hidden layer from both the current external input and the hidden layer activations one step back in time. “ -Graves

Recursive Neural Networks

Recursive neural networks learn data with structural hierarchies, such as text arranged grammatically, much like recurrent neural networks learn data structured by its occurance in time. Their chief use is in natural-language processing, and they are associated with Richard Socher of Stanford’s NLP lab.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Reinforcement Learning

Reinforcement learning is a branch of machine learning that is goal oriented; that is, reinforcement learning algorithms have as their objective to maximize a reward, often over the course of many decisions. Unlike deep neural networks, reinforcement learning is not differentiable.

Representation Learning

Representation learning is learning the best representation of input. A vector, for example, can “represent” an image. Training a neural network will adjust the vector’s elements to represent the image better, or lead to better guesses when a neural network is fed the image. The neural net might train to guess the image’s name, for instance. Deep learning means that several layers of representations are stacked atop one another, and those representations are increasingly abstract; i.e. the initial, low-level representations are granular, and may represent pixels, while the higher representations will stand for combinations of pixels, and then combinations of combinations, and so forth.

Residual Networks (ResNet)

Microsoft Research used deep Residual Networks to win ImageNet in 2015. ResNets create “shortcuts” across several layers (deep resnets have 150 layers), allowing the net to learn so-called residual mappings. ResNets are similar to nets with Highway Layers, although they’re data independent. Microsoft Research created ResNets by generating by different deep networks automatically and relying on hyperparameter optimization.

Deep Residual Learning for Image Recognition

Restricted Boltzmann Machine (RBM)

Restricted Boltzmann machines are Boltzmann machines that are constrained to feed input forward symmetrically, which means all the nodes of one layer must connect to all the nodes of the subsequent layer. Stacked RBMs are known as a deep-belief network, and are used to learn how to reconstruct data layer by layer. Introduced by Geoff Hinton, RBMs were partially responsible for the renewed interest in deep learning that began circa 2006. In many labs, they have been replaced with more stable layers such as Variational Autoencoders.

A Practical Guide to Training Restricted Boltzmann Machines

RMSProp

RMSProp is an optimization algorithm like Adagrad. In contrast to Adagrad, it relies on a decay term to prevent the learning rate from decreasing too rapidly.

Optimization Algorithms (Stanford) An overview of gradient descent optimization algorithms

Score

Measurement of the overall error rate of the model. The score of the model will be displayed graphically in the webui or it can be displayed the console by using ScoreIterationListener

Serialization

Serialization is how you translate data structures or object state into storable formats. Deeplearning4j’s nets are serialized, which means they can operate on devices with limited memory.

Skipgram

The prerequisite to a definition of skipgrams is one of ngrams. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. A unigram represents one “item,” a bigram two, a trigram three and so forth. Skipgrams are ngrams in which the items are not necessarily contiguous. This can be illustrated best with a few examples. Skipping is a form of noise, in the sense of noising and denoising, which allows neural nets to better generalize their extraction of features. See how skipgrams are implemented in Word2vec.

Softmax

Softmax is a function used as the output layer of a neural network that classifies input. It converts vectors into class probabilities. Softmax normalizes the vector of scores by first exponentiating and then dividing by a constant.

A Scalable Hierarchical Distributed Language Model

Stochastic Gradient Descent

Stochastic Gradient Descent optimizes gradient descent and minimizes the loss function during network training.

Stochastic is simply a synonym for “random.” A stochastic process is a process that involves a random variable, such as randomly initialized weights. Stochastic derives from the Greek word stochazesthai, “to guess or aim at”. Stochastic processes describe the evolution of, say, a random set of variables, and as such, they involve some indeterminacy – quite the opposite of having a precisely predicted processes that are deterministic, and have just one outcome.

The stochastic element of a learning process is a form of search. Random weights represent a hypothesis, an attempt, or a guess that one tests. The results of that search are recorded in the form of a weight adjustment, which effectively shrinks the search space as the parameters move toward a position of less error.

Neural-network gradients are calculated using backpropagation. SGD is usually used with minibatches, such that parameters are updated based on the average error generated by the instances of a whole batch.

Support Vector Machine

While support-vector machines are not neural networks, they are an important algorithm that deserves explanation:

An SVM is just trying to draw a line through your training points. So it's just like regular old linear regression except for the following three details: (1) there is an epsilon parameter that means "If the line fits a point to within epsilon then that's good enough; stop trying to fit it and worry about fitting other points." (2) there is a C parameter and the smaller you make it the more you are telling it to find "non-wiggly lines". So if you run SVR and get some crazy wiggly output that's obviously not right you can often make C smaller and it will stop being crazy. And finally (3) when there are outliers (e.g. bad points that will never fit your line) in your data they will only mess up your result a little bit. This is because SVR only gets upset about outliers in proportion to how far away they are from the line it wants to fit. This is contrasted with normal linear regression which gets upset in proportion to the square of the distance from the line. Regular linear regression worries too much about these bad points. TL;DR: SVR is trying to draw a line that gets within epsilon of all the points. Some points are bad and can't be made to get within epsilon and SVR doesn't get too upset about them whereas other regression methods flip out. Reddit

Tensors

Here is an example of tensor along dimension (TAD):

Vanishing Gradient Problem

The vanishing gradient problem is a challenge the confront backpropagation over many layers. Backpropagation establishes the relationship between a given weight and the error of a neural network. It does so through the chain rule of calculus, calculating how the change in a given weight along a gradient affects the change in error. However, in very deep neural networks, the gradient that relates the weight change to the error change can become very small. So small that updates in the net’s parameters hardly change the net’s guesses and error; so small, in fact, that it is difficult to know in which direction the weight should be adjusted to diminish error. Non-linear activation functions such as sigmoid and tanh make the vanishing gradient problem particularly difficult, because the activation funcion tapers off at both ends. This has led to the widespread adoption of rectified linear units (reLU) for activations in deep nets. It was in seeking to solve the vanishing gradient problem that Sepp Hochreiter and Juergen Schmidhuber invented a form of recurrent network called an LSTM in the 1990s. The inverse of the vanishing gradient problem, in which the gradient is impossibly small, is the exploding gradient problem, in which the gradient is impossibly large (i.e. changing a weight has too much impact on the error.)

On the difficulty of training recurrent neural networks Transfer Learning

Transfer learning is when a system can recognize and apply knowledge and skills learned in previous domains or tasks to novel domains or tasks. That is, if a model is trained on image data to recognize one set of categories, transfer learning applies if that same model is capable, with minimal additional training, or recognizing a different set of categories. For example, trained on 1,000 celebrity faces, a transfer learning model can be taught to recognize members of your family by swapping in another output layer with the nodes “mom”, “dad”, “elder brother”, “younger sister” and training that output layer on the new classifications.

Vector

Word2vec and other neural networks represent input as vectors.

A vector is a data structure with at least two components, as opposed to a scalar, which has just one. For example, a vector can represent velocity, an idea that combines speed and direction: wind velocity = (50mph, 35 degrees North East). A scalar, on the other hand, can represent something with one value like temperature or height: 50 degrees Celsius, 180 centimeters.

Therefore, we can represent two-dimensional vectors as arrows on an x-y graph, with the coordinates x and y each representing one of the vector’s values.

Two vectors can relate to one another mathematically, and similarities between them (and therefore between anything you can vectorize, including words) can be measured with precision.

As you can see, these vectors differ from one another in both their length, or magnitude, and in their angle, or direction. The angle is what concerns us here.

VGG is a deep convolutional architecture that won the benchmark ImageNet competition in 2014. A VGG architecture is composed of 16–19 weight layers and uses small convolutional filters. Deeplearning4j’s implementations of two VGG architecturs are here.

Very Deep Convolutional Networks for Large-Scale Image Recognition

Word2vec

Tomas Mikolov’s neural networks, known as Word2vec, have become widely used because they help produce state-of-the-art word embeddings. Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand. Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well to genes, code, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned. Deeplearning4j implements a distributed form of Word2vec for Java and Scala, which works on Spark with GPUs.

Weight decay

When training neural networks, it is common to use "weight decay," where after each update, the weights are multiplied by a factor slightly less than 1. This prevents the weights from growing too large, and can be seen as gradient descent on a quadratic regularization term.

Weight decay is an example of a regularization method.

The $L_2$ norm of the weights isn't necessarily a good regularizer for neural nets. Some more principled alternatives include:
Tikhonov regularization, which rewards invariance to noise in the inputs (go to concept)
Tangent propagation, which rewards invariance to irrelevant transformations of the inputs such as translation and scalling (go to concept)

Early stopping is another strategy to prevent overfitting in neural nets.

Xavier Initialization

The Xavier initialization is based on the work of Xavier Glorot and Yoshua Bengio in their paper “Understanding the difficulty of training deep feedforward neural networks.” An explanation can be found here. Weights should be initialized in a way that promotes “learning”. The wrong weight initialization will make gradients too large or too small, and make it difficult to update the weights. Small weights lead to small activations, and large weights lead to large ones. Xavier weight initialization considers the distribution of output activations with regard to input activations. Its purpose is to maintain same distribution of activations, so they aren’t too small (mean zero but with small variance) or too large (mean zero but with large variance). DL4J’s implementation of Xavier weight initialization aligns with the Glorot Bengio paper, Nd4j.randn(order, shape).muli(FastMath.sqrt(2.0 / (fanIn + fanOut))). Where fanIn(k) would be the number of units sending input to k, and fanOut(k) would be the number of units recieving output from k.

2 深度学习

深度学习（英语：deep learning）是机器学习拉出的分支，*它试图使用包含复杂结构或由多重非线性变换构成的多个处理层对数据进行高层抽象的算法*. 深度学习是机器学习中一种基于对数据进行表征学习的方法。观测值（例如一幅图像）可以使用多种方式来表示，如每个像素强度值的向量，或者更抽象地表示成一系列边、特定形状的区域等。而使用某些特定的表示方法更容易从实例中学习任务（例如，人脸识别或面部表情识别）。深度学习的好处是用非监督式或半监督式的特征学习和分层特征提取高效算法来替代手工获取特征。表征学习的目标是寻求更好的表示方法并创建更好的模型来从大规模未标记数据中学习这些表示方法。表达方式类似神经科学的进步，并松散地创建在类似神经系统中的信息处理和通信模式的理解上，如神经编码，试图定义拉动神经元的反应之间的关系以及大脑中的神经元的电活动之间的关系。

2.1 Definitions

Deep learning is a class of machine learning algorithms that:

use a cascade of multiple layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input.
learn in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis, 人脸识别或面部表情识别) manners.
learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.
use some form of gradient descent for training via backpropagation.

Layers that have been used in deep learning include hidden layers of an artificial neural network and sets of propositional formulas.

2.2 why deep?

There are functions you can compute with a small L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

For example, \[y=X_1 XOR X_2 XOR X_3 XOR ... XOR X_n\] for deep neural networks with less hidden units, the computation is O(log(n)). But for shallower neural networks, the computation is O(2ⁿ).

3 学习方法

Step 1: 学习机器学习基础
Step 2: 深入学习
Step 3: 选择一个区域并进一步深入
- 计算机视觉(Computer Vision):
- 自然语言处理(NLP)：
- 记忆网络(RNN-LSTM)
- 深度加强学习(RDL):
- 生成模型(GAN):
Step 4: 建立项目

4 NLP

NLP

4.1 Application: Where can DL be applied for NLP tasks? DL Algorithms NLP Usage Neural Network (NN)

4.1.1 Feed-forward propagation

POS, NER, Chunking
Entity and Intent Extraction

4.1.2 Recurrent Neural Networks (RNN)

Language Modeling and Generating Text
Machine Translation
Question Answering System
Image Captioning
Generating Image Descriptions

4.1.3 Recursive Neural Networks

Parsing Sentences
Sentiment Analysis
Paraphrase Detection
Relation Classification
Object Detection

4.1.4 Convolutional Neural Network (CNN)

Sentence / Text Classification
Relation Extraction and Classification
Sentiment classification
Spam Detection or Topic Categorization
Classification of Search Queries
Semantic relation extraction

5 Critical questions

how - overview picture why - assumption, hypothesis(formulas), representations, why use this, why does it work, cons, 边界. example comparison results conclusion

5.0.1 传统机器学习考察点：

1、bias与variance的含义，并结合ensemble method问哪种方法降低bias，哪种方法降低variance https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/ 2、lr与svm的区别与联系 SVM：非概率二元线性分类器，利用核方法，有效进行非线性分类。Try to maximize the margin between the closest support vectors geometrically. Instead of assuming a probabilistic model, we're trying to find a particular optimal separating hyperplane, where we define "optimality" in the context of the support vectors.

LR：Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution.

3、gbdt与adaboost的区别与联系

4、手推svm，svm麻雀虽小五脏俱全

5、pca与lda的区别与联系，并推导

6、白化的原理与作用

7、给一个算法，例如lr，问这个算法的model、evaluate、optimization分别是啥

回归模型中存在多重共线性, 你如何解决这个问题？

我们可以先去除一个共线性变量; 计算VIF(方差膨胀因子), 采取相应措施;为了避免损失信息, 我们可以使用一些正则化方法, 比如, 岭回归和lasso回归.

5.0.2 深度学习考察点：

Nodes of output layer depend on the number of categories?

If there are many categories, weight matrix will be very large. Here we use hierarchical softmax to control the classes falling in (0, 1).

NN loss function

\[J(θ)=∑(y(i)−(1+e−θTx(i))−1)2\] \[min J(\Theta)\] need code to compute: \[-J(\Theta)\] \[-\frac{ }{\partial \Theta_{ij}^{(l)}} J(\Theta)\] 1、手推bp

2、梯度消失/爆炸原因，以及解决方法随着神经网络层数的加深，优化函数越来越容易陷入局部最优解，并且这个“陷阱”越来越偏离真正的全局最优。

随着网络层数增加，“梯度消失”现象更加严重。具体来说，我们常常使用sigmoid作为神经元的输入输出函数。对于幅度为1的信号，在BP反向传播梯度时，每传递一层，梯度衰减为原来的0.25。层数一多，梯度指数衰减后低层基本上接受不到有效的训练信号。

Hinton利用预训练方法缓解了局部最优解问题，为了克服梯度消失，ReLU、maxout等传输函数代替了sigmoid，形成了如今DNN的基本形式。去年出现的高速公路网络(highway network)和深度残差学习（deep residual learning）进一步避免了梯度消失，网络层数达到了前所未有的一百多层（深度残差学习：152层）

全连接DNN的结构里下层神经元和所有上层神经元都能够形成连接，带来的潜在问题是参数数量的膨胀。假设输入的是一幅像素为1K*1K的图像，隐含层有1M个节点，光这一层就有10^{12个权重需要训练}，这不仅容易过拟合，而且极容易陷入局部最优。

3、bn的原理，与白化的联系 http://blog.csdn.net/fate_fjh/article/details/53375881 Batch Normalization是由google提出的一种训练优化方法。参考论文：Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift² 个人觉得BN层的作用是加快网络学习速率，论文中提及其它的优点都是这个优点的副产品。网上对BN解释详细的不多，大多从原理上解释，没有说出实际使用的过程，这里从what, why, how三个角度去解释BN。

What is BN

Normalization是数据标准化（归一化，规范化），Batch 可以理解为批量，加起来就是批量标准化。

Why is BN

解决的问题是梯度消失与梯度爆炸。

关于梯度消失，以sigmoid函数为例子，sigmoid函数使得输出在[0,1]之间。

事实上x到了一定大小，经过sigmoid函数的输出范围就很小了.

如果输入很大，其对应的斜率就很小，我们知道，其斜率（梯度）在反向传播中是权值学习速率。所以就会出现如下的问题，在深度网络中，如果网络的激活输出很大，其梯度就很小，学习速率就很慢。假设每层学习梯度都小于最大值0.25，网络有n层，因为链式求导的原因，第一层的梯度小于0.25的n次方，所以学习速率就慢，对于最后一层只需对自身求导1次，梯度就大，学习速率就快。这会造成的影响是在一个很大的深度网络中，浅层基本不学习，权值变化小，后面几层一直在学习，结果就是，后面几层基本可以表示整个网络，失去了深度的意义。

关于梯度爆炸，根据链式求导法，第一层偏移量的梯度=激活层斜率1x权值1x激活层斜率2x…激活层斜率(n-1)x权值(n-1)x激活层斜率n 假如激活层斜率均为最大值0.25，所有层的权值为100，这样梯度就会指数增加

-How to use BN

4、防止过拟合有哪些方法(regularization):

weight decay:

\[J(W)= MSE_{train}+\lambda\omega^T\omega\] where $\lambda$ is a value chosen ahead of time that controls the strength of our preference for smaller weights. When $\lambda=0$, we impose no preference, and larger$\lambda$ forces the weights to become smaller. Minimizing J(w) results in a choice of weights that make a tradeoff between fitting the training data and being small.

To get the optimized $\lambda$, splitting data set to train set and validation set. \[\lambda={10^{-2}, 10^{-1.5}, 10^{-1}, ..., 10, 10^{1.5}, 10^{2}}\]

restrict parameter values

putting extra constraints on models such as adding restrictions on the parameter values.

soft constraint

add extra terms in the objective function that can be thought of as corresponding to a soft constraint on the parameter values.

5、dnn、cnn、rnn的区别与联系 DNN是一个大类，CNN是一个典型的空间上深度的神经网络，RNN是在时间上深度的神经网络。

为了克服梯度消失，ReLU、maxout等传输函数代替了sigmoid，形成了如今DNN的基本形式.

图像中有固有的局部模式（比如轮廓、边界，人的眼睛、鼻子、嘴等）可以利用，显然应该将图像处理中的概念和神经网络技术相结合。此时我们可以祭出题主所说的卷积神经网络CNN。对于CNN来说，并不是所有上下层神经元都能直接相连，而是通过“卷积核”作为中介。同一个卷积核在所有图像内是共享的，图像通过卷积操作后仍然保留原先的位置关系。

样本出现的时间顺序对于自然语言处理、语音识别、手写体识别等应用非常重要。对了适应这种需求，就出现了大家所说的另一种神经网络结构——循环神经网络RNN。

6、机器学习与深度学习的联系

7、batch size大小会怎么影响收敛速度

5.0.3 最优化考察点：

1、sgd、momentum、rmsprop、adam区别与联系

2、深度学习为什么不用二阶优化

3、拉格朗日乘子法、对偶问题、kkt条件

coding考察点：排序、双指针、dp、贪心、分治、递归、回溯、字符串、树、链表、trie、bfs、dfs等等
第一种是广撒网地问一些老生常谈的DL中没有标准答案的问题，比如过拟合怎么办？样本偏斜怎么办？

drop out，data augmentation， weight decay(常用的weight decay有哪些？怎么处理weight decay的权重).L1，L2。让你比较为什么要两种weight decay，区别在哪里。比如如果你讲L1零点不可导才用L2，那么立马问你SMOOTHL1。如果你都说明白了，就问你为什么weight decay能够一定程度解决过拟合？如果你说到了L0和稀疏性。接着就来问你为什么稀疏性有效？

用Map Reduce implement矩阵乘法 NLP相关的encoding问题 (CBOW vs Skipgram) 不同的activation function的pros/cons Gradient Boosting相关问题 Random Forest 相关问题 SVM的Gaussian Kernel 的 dimension 用Regex分析文本如何用python/R 读取JSON, 并且洗数据用C++ implement Monte Carlo coding: 用DFS走迷宫

用过哪些DL的library呀? 现在的DL 的state of art model有哪些呀? 如果如理diminishing gradient的问题呀? 如果同时处理文本文档+图片呀? 如果防止overfitting呀? 如何pre-train model呀? 能否自己在服务器上用distributed computing部署一个现有的model 呀?

解决网络过拟合的手段有些什么呀 Dropout的为什么可以解决过拟合呀 Batch-normalization的思想是什么呀类别不平衡的时候怎么办啊目标检测中anchor box的做法和adaboost人脸检测中的滑窗检测有什么区别啊？跟踪和检测有什么区别啊？

用过几个框架？

https://deeplearning4j.org/cn/compare-dl4j-torch7-pylearn#tensorflow

Torch 和 Pytorch

Python 框架

Theano及其生态系统, TensorFlow, Caffe, Caffe2, CNTK, Chainer, DSSTNE, DyNet, Keras, Paddle, 它们的优劣分析一下.

5.0.4 我们在这里将机器学习工程师需要掌握的基本技能分为五类：

1 计算机科学基础与编程能力 2 概率与统计 3 数据建模与评估 4 应用机器学习算法与库 5 软件工程和系统设计

（一）计算机科学基础与编程能力

▲ 你怎么判断一个链表中是否有循环？

▲ 给定一棵二叉查找树中的两个元素，求它们的最近公共祖先。

▲ 写一个栈排列函数

▲ 如何计算比较排序算法的时间复杂度？你能证明吗？

▲ 如何在加权图中找到两个节点间最短路径？如果有些权值是负的怎么办？

▲ 求一个字符串中所有的回文子串。

对于所有这些问题，你都要能够推导出你的方法的时间和空间复杂度，并且尽可能用最低复杂度解决问题。

只有通过大量的练习才能对这些不同类型的问题烂熟于胸，这样你就能够在面试时快速给出一个有效的解决方案。

常用的算法面试准备平台有 Lintcode、LeetCode、Interview Cake等。

（二）概率与统计

▲ 给出一个群体中男性和女性各自的平均身高，整个群体的平均身高是多少？

▲ 最近一项调查显示，在意大利1/3的汽车是费拉里斯（法拉利跑车），这其中一半的车都是红色。那么如果你在意大利的街头看到一辆红色的汽车驶来，请问它是费拉里斯的可能性有多大？

▲ 你想在网站上找到一个最合适的位置放广告，你可以选择广告字体的大小(小号、中号、大号)，你也可以选择广告放置的位置(顶部、中部、底部)。那么至少需要多少页面访问量（n）和广告点击量(m)，你才能有95%的自信说其中的一个设计比其他设计都好？

很多机器学习算法以概率论与统计学作为基础。所以对这些基础知识有清晰的概念非常重要，同时，你也要能够将这些抽象的公式与实际联系起来。

（三）数据建模和评估

▲ 奶农正试图了解影响牛奶品质的因素。他记录了每天的气温（30-40°C）、湿度（60-90%）、饲料消耗（2000-2500公斤）、牛奶产量（500-1000升）。

假设问题是要预测每天的牛奶产量，你会如何处理数据并建立模型？

这是一个什么类型的机器学习问题？

▲ 你的公司正在开发一个面部表情识别系统，这个系统接受像素为1920*1080的高清图片作为输入，接收到输入的图片后它就能告诉用户图片中的人脸处于以下哪种情绪状态：平常、高兴、悲伤、愤怒和恐惧。若图片中没有人脸时系统要能够分辨这种情况。

这属于什么类型的机器学习问题？

如果每个像素点由 3 个值来表示（RGB），那么输入数据的原始维度有多大？有办法降维吗？

你会如何对系统的输出进行编码？为什么？

▲ 在过去几个世纪里搜集到的气象数据显示温度呈循环上升和下降。对于这样的数据（年平均温度值序列），你会如何建模来预测未来 5 年的平均气温？

▲ 你的工作是收集世界各地的文章，并将来源不同的相似文章整合成一篇文章。你会如何设计这样一个系统？会用到哪些机器学习技术？

（四）应用机器学习算法与库

▲ 你在用一个给定的数据集训练一个单隐层神经网络时，发现权重在迭代训练中波动很大(变化巨大，常在正负值间摇摆)，你需要调整什么参数来解决这个问题？

▲ 支持向量机的训练在本质上是在最优化哪个值？

▲ LASSO 回归用 L1-norm 作为惩罚项，而岭回归（Ridge Regression）则使用 L2-norm 作为惩罚项。这两者哪个更有可能得到一个稀疏（某些项的系数为 0）的模型？ Ridge

▲ 当测试一个10层神经网络的反向传播时，你发现前三层的权值完全没有变化。接下来的几层（4-6）权值变化的非常缓慢。这是为什么？该如何解决?

▲ 你现在有一些关于欧洲小麦产出的数据，包括年降雨量（R，英寸），平均高度（A，米）和小麦产量（O，公斤/平方千米）。你经过粗略分析认为小麦产量与降雨量的平方以及平均海报的对数之间存在关系，即: O = β0+ β1 × R2 + β2 × loge(A)。你能使用线性回归模型计算出系数（β）吗？

你可以通过参加一些数据科学和机器学习的比赛来了解各种各样的问题和它们之间的细微差别。多多参加这些比赛，并尝试应用不同的机器学习模型。

（五）软件工程和系统设计

▲ 你在运行一个电子商务网站。当用户点击商品详细信息时，你要根据用户过去所购商品特征推荐5个用户感兴趣的商品，同时在页面底部显示。为完成这个功能你需要什么服务器和数据库？假设它们是可用的，写一个程序来获得这5个推荐商品。

▲ 你会从一个在线视频播放网站（如YouTube）上搜集什么数据来估测用户参与度和视频人气度？人气度：参与度：评论，转发，浏览次数。总浏览次数，平均每天浏览次数。

▲ 一个非常简单的垃圾邮件检测系统工作原理如下：它每次处理一封邮件并统计每个不同单词出现的频率（Term frequency），然后将这些频率与之前被标注为垃圾/正常邮件的那些频率进行比较。为扩大这一系统处理大量的电子邮件，你能设计一种能在计算机集群上运行的 Map-Reduce 方案吗？

▲你想实现用户实时使用可视化，就像热敏图一样。为实现这个功能，在客户端与服务器端你需要什么组件/服务器/API？

六、结语

很多正在准备机器学习面试的朋友往往都会沉浸在如何准备技术层面的问题上，却极少思考为什么这家公司会有这个职位，为什么公司想要用机器学习作为解决方案，为什么他们对你感兴趣？等等。

round1：

1：cnn做卷积的运算时间复杂度；

2：Random forest和GBT描述；

3：（看到kaggle项目经历）为什么xgboost效果好？

4：leetcode；

round2：

1：工程背景；

2: python熟悉程度；

3：leetcode；

round3：

1：项目介绍

2：项目最难的是什么

3：项目做的最有成就感的是什么

4：生活做的最有成就感的是什么

5：一天刷多少次我们的app

不评论；

— 打车公司 —

1: LSTM结构推导，为什么比RNN好？

需要说明一下LSTM的结构，input forget gate， cell information hidden information这些，之前我答的是防止梯度消失爆炸，知友指正，不能防止爆炸，很有道理，感谢；

2：梯度消失爆炸为什么？

答案：略

3：为什么你用的autoencoder比LSTM好？

答案：我说主要还是随机化word embedding的问题，autoencoder的句子表示方法是词袋方法，虽然丢失顺序但是保留物理意义；（?）

4: overfitting怎么解决：

答案：dropout， regularization， batch normalizatin；

5：dropout为什么解决overfitting，L1和L2 regularization原理，为什么L1 regularization可以使参数优化到0， batch normalizatin为什么可以防止梯度消失爆炸；

答案：略

6: 模型欠拟合的解决方法：

答案：我就说到了curriculum learning里面的sample reweight和增加模型复杂度；还有一些特征工程；然后问了常用的特征工程的方法；

7：（简历里面写了VAE和GAN还有RL，牛逼吹大了）VAE和GAN的共同点是什么，解释一下GAN或者强化学习如何引用到你工作里面的；

答案：略

传统机器学习

1：SVM的dual problem推导；

2：random forest的算法描述+bias和variance的分解公式；

3：HMM和CRF的本质区别；

4：频率学派和贝叶斯派的本质区别；

5：常用的优化方法；

6: 矩阵行列式的物理意义（行列式就是矩阵对应的线性变换对空间的拉伸程度的度量，或者说物体经过变换前后的体积比）

7: 动态预测每个区域的用车需求量；

对于打车公司，我的感觉很好，hr态度和面试官态度都很好，包括最后跟老大打完电话约去公司聊一下确定一下；全程hr都是有问必答；

有一次为了去前面那个新闻app，而改了打车公司面试时间，hr态度都很好；

最后我已经决定了去深圳，不能去打车公司也有点遗憾了；

而且打车公司问的问题很专业，全程下来都是ML算法，不考脑残的leetcode；我根本没时间也不想再去刷leetcode就为了个面试；

— 手机公司 —

round1：

1：LSTM相关的问题；

2：python写k-means；

3：想不起来了

round2：

1：业务相关的问题

2：leetcode

3：具体业务问题，也就是如何量化句子之间的相似度；

round3：

1：记不起来了

手机公司最近在搞发布会，面试过了一个星期再通知我去复面，我果断拒绝；

全程深度学习的东西基本上不问，问了一两个看来他们基本不用，然后就是leetcode；

手机公司做智能家居蛮有前途的；面试官态度很好；

— 搜索公司 —

三轮

1：怎么样识别文本垃圾信息；

2：(数据结构)树合并；

3：工作涉及到的业务知识；

4: python如何把16位进制的数转换成2进制的数；

5：MySQL的键的一个问题；

6: linux下如何把两个文件按照列合并

7：map-reduce的原理（问的基础，因为我简历没有mapreduce）；

8：NLP方面的想法；

9：职业规划，专家型还是领导型；

10：如果给offer是不是直接来此公司；

说实话，搜索公司最耿直，一下午面玩完全没有任何磨磨唧唧就给了口头offer；

如果留在北京，首选肯定是它了；

后面问我在面其他哪些公司，如果给了offer去哪家，我说就这家，那时候也没想到后面的两家深圳公司也过了，感觉蛮愧疚的，就冲这个态度也应该去此公司的；

真的不像网上流传的那些；而且此公司最后面的manager是我见过态度很好而且感觉可以依靠人；

—大厂—

1: 链表逆转

2：1亿的文本如何放在100台机器上两两做相似度计算

3：40亿数据如何用2G内存排序

4；遍历树

5：HMM原理

5.0.5 data science

https://towardsdatascience.com/data-science-and-machine-learning-interview-questions-3f6207cf040b

What’s the trade-off between bias and variance? What is gradient descent? Explain over- and under-fitting and how to combat them? How do you combat the curse of dimensionality? What is regularization, why do we use it, and give some examples of common methods? Explain Principal Component Analysis (PCA)? Why is ReLU better and more often used than Sigmoid in Neural Networks? What is data normalization and why do we need it? I felt this one would be important to highlight. Data normalization is very important preprocessing step, used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by its standard deviation. If we don’t do this then some of the features (those with high magnitude) will be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change is pretty big, but for smaller features it’s quite insignificant). The data normalization makes all features weighted equally. Explain dimensionality reduction, where it’s used, and it’s benefits? Dimensionality reduction is the process of reducing the number of feature variables under consideration by obtaining a set of principal variables which are basically the important features. Importance of a feature depends on how much the feature variable contributes to the information representation of the data and depends on which technique you decide to use. Deciding which technique to use comes down to trial-and-error and preference. It’s common to start with a linear technique and move to non-linear techniques when results suggest inadequate fit. Benefits of dimensionality reduction for a data set may be: (1) Reduce the storage space needed (2) Speed up computation (for example in machine learning algorithms), less dimensions mean less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions (3) Remove redundant features, for example no point in storing a terrain’s size in both sq meters and sq miles (maybe data gathering was flawed) (4) Reducing a data’s dimension to 2D or 3D may allow us to plot and visualize it, maybe observe patterns, give us insights (5) Too many features or too complex a model can lead to overfitting. How do you handle missing or corrupted data in a dataset? You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value. In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method. Explain this clustering algorithm? I wrote a popular article on the The 5 Clustering Algorithms Data Scientists Need to Know explaining all of them in detail with some great visualizations. How would you go about doing an Exploratory Data Analysis (EDA)? The goal of an EDA is to gather some insights from the data before applying your predictive model i.e gain some information. Basically, you want to do your EDA in a coarse to fine manner. We start by gaining some high-level global insights. Check out some imbalanced classes. Look at mean and variance of each class. Check out the first few rows to see what it’s all about. Run a pandas df.info() to see which features are continuous, categorical, their type (int, float, string). Next, drop unnecessary columns that won’t be useful in analysis and prediction. These can simply be columns that look useless, one’s where many rows have the same value (i.e it doesn’t give us much information), or it’s missing a lot of values. We can also fill in missing values with the most common value in that column, or the median. Now we can start making some basic visualizations. Start with high-level stuff. Do some bar plots for features that are categorical and have a small number of groups. Bar plots of the final classes. Look at the most “general features”. Create some visualizations about these individual features to try and gain some basic insights. Now we can start to get more specific. Create visualizations between features, two or three at a time. How are features related to each other? You can also do a PCA to see which features contain the most information. Group some features together as well to see their relationships. For example, what happens to the classes when A = 0 and B = 0? How about A = 1 and B = 0? Compare different features. For example, if feature A can be either “Female” or “Male” then we can plot feature A against which cabin they stayed in to see if Males and Females stay in different cabins. Beyond bar, scatter, and other basic plots, we can do a PDF/CDF, overlayed plots, etc. Look at some statistics like distribution, p-value, etc. Finally it’s time to build the ML model. Start with easier stuff like Naive Bayes and Linear Regression. If you see that those suck or the data is highly non-linear, go with polynomial regression, decision trees, or SVMs. The features can be selected based on their importance from the EDA. If you have lots of data you can use a Neural Network. Check ROC curve. Precision, Recall How do you know which Machine Learning model you should use? While one should always keep the “no free lunch theorem” in mind, there are some general guidelines. I wrote an article on how to select the proper regression model here. This cheatsheet is also fantastic! Why do we use convolutions for images rather than just FC layers? This one was pretty interesting since it’s not something companies usually ask. As you would expect, I got this question from a company focused on Computer Vision. This answer has 2 parts to it. Firstly, convolutions preserve, encode, and actually use the spatial information from the image. If we used only FC layers we would have no relative spatial information. Secondly, Convolutional Neural Networks (CNNs) have a partially built-in translation in-variance, since each convolution kernel acts as it’s own filter/feature detector. What makes CNNs translation invariant? As explained above, each convolution kernel acts as it’s own filter/feature detector. So let’s say you’re doing object detection, it doesn’t matter where in the image the object is since we’re going to apply the convolution in a sliding window fashion across the entire image anyways. Why do we have max-pooling in classification CNNs? Again as you would expect this is for a role in Computer Vision. Max-pooling in a CNN allows you to reduce computation since your feature maps are smaller after the pooling. You don’t lose too much semantic information since you’re taking the maximum activation. There’s also a theory that max-pooling contributes a bit to giving CNNs more translation in-variance. Check out this great video from Andrew Ng on the benefits of max-pooling. Why do segmentation CNNs typically have an encoder-decoder style / structure? The encoder CNN can basically be thought of as a feature extraction network, while the decoder uses that information to predict the image segments by “decoding” the features and upscaling to the original image size. What is the significance of Residual Networks? The main thing that residual connections did was allow for direct feature access from previous layers. This makes information propagation throughout the network much easier. One very interesting paper about this shows how using local skip connections gives the network a type of ensemble multi-path structure, giving features multiple paths to propagate throughout the network. What is batch normalization and why does it work? Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. The idea is then to normalize the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one. This is done for each individual mini-batch at each layer i.e compute the mean and variance of that mini-batch alone, then normalize. This is analogous to how the inputs to networks are standardized. How does this help? We know that normalizing the inputs to a network helps it learn. But a network is just a series of layers, where the output of one layer becomes the input to the next. That means we can think of any layer in a neural network as the first layer of a smaller subsequent network. Thought of as a series of neural networks feeding into each other, we normalize the output of one layer before applying the activation function, and then feed it into the following layer (sub-network). How would you handle an imbalanced dataset? I have an article about this! Check out #3 :) Why would you use many small convolutional kernels such as 3x3 rather than a few large ones? This is very well explained in the VGGNet paper. There are 2 reasons: First, you can use several smaller kernels rather than few large ones to get the same receptive field and capture more spatial context, but with the smaller kernels you are using less parameters and computations. Secondly, because with smaller kernels you will be using more filters, you’ll be able to use more activation functions and thus have a more discriminative mapping function being learned by your CNN. Do you have any other projects that would be related here? Here you’ll really draw connections between your research and their business. Is there anything you did or any skills you learned that could possibly connect back to their business or the role you are applying for? It doesn’t have to be 100% exact, just somehow related such that you can show that you will be able to directly add lots of value. Explain your current masters research? What worked? What didn’t? Future directions? Same as the last question!

5.0.6 analytics

https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-startups-in-machine-learning-data-science/

Interview Questions on Machine Learning Q1. You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. Following are the methods you can use to tackle such situation:

Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use. We can randomly sample the data set. This means, we can create a smaller data set, let’s say, having 1000 variables and 300000 rows and do the computations. To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we’ll use correlation. For categorical variables, we’ll use chi-square test. Also, we can use PCA and pick the components which can explain the maximum variance in the data set. Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option. Building a linear model using Stochastic Gradient Descent is also helpful. We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in significant loss of information. Note: For point 4 & 5, make sure you read about online learning algorithms & Stochastic Gradient Descent. These are advanced methods.

Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

Answer: Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.

Know more: PCA

Q3. You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.

Q4. You are given a data set on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

Answer: If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps:

We can use undersampling, oversampling or SMOTE to make the data balanced. We can alter the prediction threshold value by doing probability caliberation and finding a optimal threshold using AUC-ROC curve. We can assign weight to classes such that the minority classes gets larger weight. We can also use anomaly detection. Know more: Imbalanced Classification

Q5. Why is naive Bayes so ‘naive’ ?

Answer: naive Bayes is so ‘naive’ because it assumes that all of the features in a data set are equally important and independent. As we know, these assumption are rarely true in real world scenario.

Q6. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?

Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be classified as spam.

Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word ‘FREE’ is used in previous spam message is likelihood. Marginal likelihood is, the probability that the word ‘FREE’ is used in any message.

Q7. You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

Answer: Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non – linear interactions. The reason why decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model did. Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.

Q8. You are assigned a new project which involves helping a food delivery company save more money. The problem is, company’s delivery team aren’t able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

Answer: You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals.

This is not a machine learning problem. This is a route optimization problem. A machine learning problem consist of three things:

There exist a pattern. You cannot solve it mathematically (even by writing exponential equations). You have data on it. Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

Q9. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

Answer: Low bias occurs when the model’s predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on an unseen data, it gives disappointing results.

In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).

Also, to combat high variance, we can:

Use regularization technique, where higher model coefficients get penalized, hence lowering model complexity. Use top n features from variable importance chart. May be, with all the variable in the data set, the algorithm is having difficulty in finding the meaningful signal.

Q10. You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Answer: Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.

For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.

Q11. After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?

Answer: As we know, ensemble learners are based on the idea of combining weak learners to create strong learners. But, these learners provide superior result when the combined models are uncorrelated. Since, we have used 5 GBM models and got no accuracy improvement, suggests that the models are correlated. The problem with correlated models is, all the models provide same information.

For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners are built on the premise of combining weak uncorrelated models to obtain better predictions.

Q12. How is kNN different from kmeans clustering?

Answer: Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.

kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as lazy learner because it involves minimal training of model. Hence, it doesn’t use training data to make generalization on unseen data set.

Q13. How is True Positive Rate and Recall related? Write the equation.

Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).

Know more: Evaluation Metrics

Q14. You have built a multiple regression model. Your model R² isn’t as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?

Answer: Yes, it is possible. We need to understand the significance of intercept term in a regression model. The intercept term shows model prediction without any independent variable i.e. mean prediction. The formula of R² = 1 – ∑(y – y´)²/∑(y – ymean)² where y´ is predicted value.

When intercept term is present, R² value evaluates your model wrt. to the mean model. In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation’s value becomes smaller than actual, resulting in higher R².

Q15. After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

Answer: To check multicollinearity, we can create a correlation matrix to identify & remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.

But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.

Know more: Regression

Q16. When is Ridge regression favorable over Lasso regression?

Answer: You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

Know more: Ridge and Lasso Regression

Q17. Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

Answer: After reading this question, you should have understood that this is a classic case of “causation and correlation”. No, we can’t conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon.

Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can’t say that pirated died because of rise in global average temperature.

Know more: Causation and Correlation

Q18. While working on a data set, how do you select important variables? Explain your methods.

Answer: Following are the methods of variable selection you can use:

Remove the correlated variables prior to selecting important variables Use linear regression and select variables based on p values Use Forward Selection, Backward Selection, Stepwise Selection Use Random Forest, Xgboost and plot variable importance chart Use Lasso Regression Measure information gain for the available set of features and select top n features accordingly.

Q19. What is the difference between covariance and correlation?

Answer: Correlation is the standardized form of covariance.

Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we’ll get different covariances which can’t be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.

Q20. Is it possible capture the correlation between continuous and categorical variable? If yes, how?

Answer: Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.

Q21. Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?

Answer: The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.

In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.

Random forest improves model accuracy by reducing variance (mainly). The trees grown are uncorrelated to maximize the decrease in variance. On the other hand, GBM improves accuracy my reducing both bias and variance in a model.

Know more: Tree based modeling

Q22. Running a binary classification tree algorithm is the easy part. Do you know how does a tree splitting takes place i.e. how does the tree decide which variable to split at the root node and succeeding nodes?

Answer: A classification trees makes decision based on Gini Index and Node Entropy. In simple words, the tree algorithm find the best possible feature which can divide the data set into purest possible children nodes.

Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure. We can calculate Gini as following:

Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p²+q²). Calculate Gini for split using weighted Gini score of each node of that split Entropy is the measure of impurity as given by (for binary class):

Entropy, Decision Tree

Here p and q is probability of success and failure respectively in that node. Entropy is zero when a node is homogeneous. It is maximum when a both the classes are present in a node at 50% – 50%. Lower entropy is desirable.

Q23. You’ve built a random forest model with 10000 trees. You got delighted after getting training error as 0.00. But, the validation error is 34.23. What is going on? Haven’t you trained your model perfectly?

Answer: The model has overfitted. Training error 0.00 means the classifier has mimiced the training data patterns to an extent, that they are not available in the unseen data. Hence, when this classifier was run on unseen sample, it couldn’t find those patterns and returned prediction with higher error. In random forest, it happens when we use larger number of trees than necessary. Hence, to avoid these situation, we should tune number of trees using cross validation.

Q24. You’ve got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?

Answer: In such high dimensional data sets, we can’t use classical regression techniques, since their assumptions tend to fail. When p > n, we can no longer calculate a unique least square coefficient estimate, the variances become infinite, so OLS cannot be used at all.

To combat this situation, we can use penalized regression methods like lasso, LARS, ridge which can shrink the coefficients to reduce variance. Precisely, ridge regression works best in situations where the least square estimates have higher variance.

Among other methods include subset regression, forward stepwise regression.

11222Q25. What is convex hull ? (Hint: Think SVM)

Answer: In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.

Q26. We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn’t. How ?

Answer: Don’t get baffled at this question. It’s a simple question asking the difference between the two.

Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables. For example: let’s say we have a variable ‘color’. The variable has 3 levels namely Red, Blue and Green. One hot encoding ‘color’ variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value.

In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.

Q27. What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?

Answer: Neither.

In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold as shown below:

fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]
where 1,2,3,4,5,6 represents “year”.

Q28. You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

Answer: We can deal with them in the following ways:

Assign a unique category to missing values, who knows the missing values might decipher some trend We can remove them blatantly. Or, we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others.

‘People who bought this, also bought…’ recommendations seen on amazon is a result of which algorithm?

Answer: The basic idea for this kind of recommendation engine comes from collaborative filtering.

Collaborative Filtering algorithm considers “User Behavior” for recommending items. They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information. Other users behaviour and preferences over the items are used to recommend items to the new users. In this case, features of the items are not known.

Know more: Recommender System

Q30. What do you understand by Type I vs Type II error ?

Answer: Type I error is committed when the null hypothesis is true and we reject it, also known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it, also known as ‘False Negative’.

In the context of confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative (0) when it is actually positive(1).

Q31. You are working on a classification problem. For validation purposes, you’ve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?

Answer: In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesn’t takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.

Q32. You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?

Answer: Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of percent of variance in a predictor which cannot be accounted by other predictors. Large values of tolerance is desirable.

We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables. But, adjusted R² would only increase if an additional variable improves the accuracy of model, otherwise stays same. It is difficult to commit a general threshold value for adjusted R² because it varies between data sets. For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.

Q33. In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?

Answer: We don’t use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option.

Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical & horizontal movements.

Q34. Explain machine learning to me like a 5 year old.

Answer: It’s simple. It’s just like how babies learn to walk. Every time they fall down, they learn (unconsciously) & realize that their legs should be straight and not in a bend position. The next time they fall down, they feel pain. They cry. But, they learn ‘not to stand like that again’. In order to avoid that pain, they try harder. To succeed, they even seek support from the door or wall or anything near them, which helps them stand firm.

This is how a machine works & develops intuition from its environment.

Note: The interview is only trying to test if have the ability of explain complex concepts in simple terms.

Q35. I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?

Answer: We can use the following methods:

Since logistic regression is used to predict probabilities, we can use AUC-ROC curve along with confusion matrix to determine its performance. Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value. Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model. Know more: Logistic Regression

Q36. Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?

Answer: You should say, the choice of machine learning algorithm solely depends of the type of data. If you are given a data set which is exhibits linearity, then linear regression would be the best algorithm to use. If you given to work on images, audios, then neural network would help you to build a robust model.

If the data comprises of non linear interactions, then a boosting or bagging algorithm should be the choice. If the business requirement is to build a model which can be deployed, then we’ll use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc.

In short, there is no one master algorithm for all situations. We must be scrupulous enough to understand which algorithm to use.

Q37. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?

Answer: For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.

Q38. When does regularization becomes necessary in Machine Learning?

Answer: Regularization becomes necessary when the model begins to ovefit / underfit. This technique introduces a cost term for bringing in more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term. This helps to reduce model complexity so that the model can become better at predicting (generalizing).

Q39. What do you understand by Bias Variance trade off?

Answer: The error emerging from any model can be broken down into three components mathematically. Following are these component :

error of a model

Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends. Variance on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.

Q40. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.

Answer: OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value. In simple words,

Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data.

5.0.7 essential questions

https://svrtechnologies.com/machine-learning-lnterview-questions/new-51-machine-learning-lnterview-questions

What is machine learning ? (machine learning lnterview questions)

Answer: In answering this question, try to show your understand of the broad applications of machine learning, as well as how it fits into AI. Put it into your own words, but convey your understanding that machine learning is a form of AI that automates data analysis to enable computers to learn and adapt through experience to do specific tasks without explicit programming. (machine learning lnterview questions)

Which one would you prefer to choose – model accuracy or model performance ?

Answer: Model accuracy is just a subset of model performance but is not the be-all and end-all of model performance. This question is asked to test your knowledge on how well you can make a perfect balance between model accuracy and model performance.

How will you set the threshold for credit card fraud detection mode ?

Answer:

A machine learning interview is a compound process and the final result of the interview is determined by multiple factors and not just by looking at the number of right answers given by the candidate. If you really want that machine learning job, it’s going to take time and dedication as you practice multiple ways to answer the above listed machine learning interview questions, but hopefully it is the enjoyable kind. You will learn a lot and get a good deal of knowledge preparing for your next machine learning interview with the help of these questions.

Machine learning interview questions updated on this blog have been collected from various sources like actual interview experiences of data scientists, discussions on quora, facebook, job portals and other forums,etc. To contribute to this blogpost and help the learning community, please feel free to post your questions in the comments section below.

Give a drawback of Gradient descent. ?

Answer:

It does not always converge to the same point as in some cases it reaches a local-minima instead of a global optimal point.

When should one use Mean absolute error over Root mean square error as a performance measure for regression problems ?

Answer:

When we have many outliers in the data, Mean absolute error is a better choice.

What are the three stages to build any model in Machine learning ?

Answer:

There are 3 stages to build mode in machine learning. Those are

Model Building:- Choose the suitable algorithm for the model and train it according to the requirement of your problem. Model Testing:- Check the accuracy of the model through the test data Applying the model:- Make the required changes after testing and apply the final model which we have at the end.

Why is it important for the royal society to be doing a project about machine learning ?

Answer:

I think this is very important that a royal society to do a project on machine learning to realize themselves to know, how much impact machine learning is going to create in the future. There are some people who even did not heard about what is machine learning till now. That going to be changed in our society in near future. In order to address their potential or in order to address their phase/state where they are right now. When the world is moving so forward with these cutting-edge technologies. I think it’s all about transparency that we need to tell the potential of all the things where we can go when we learn these things in our future. It’s all about looking into the future to make predictions.

How do we know which machine learning algorithm is better for us to solve our problem ?

Answer:

If we are concerning about accuracy then one can test with different algorithms and cross-validate them to know whether you are getting good accuracy or not. Let us suppose When your problem having some small training dataset we need to use models which having low variance and high bias. Or else When your problem having large training dataset we need to use models which having high variance and low bias. If we follow these things we will easily get o know which algorithm is better to solve your machine learning algorithm.

Besant Technologies trained students are having the luxury life by getting placed in top MNC companies and earning lots of huge amount as salary. We have lots of best feedbacks for the machine learning interview questions and answers prepared by us and these questions are fully analyzed and prepared by having a tie-up with the top MNC companies. Do pursue in the best Machine learning institute in Chennai by Besant Technologies and get placed and stay happy.

machine learning lnterview questions

How will you explain machine learning to a layperson in an easily comprehensible manner ?

Answer:

Machine learning is a kind of technology that enables the computer-based machines and systems to make decisions based on prior experience with an activity, with the intent of improving its performance continuously. This can be understood through multiple examples, such as:

Imagine about a curious kid who sticks his palm You have observed that the obese people are more prone to heart diseases than the thinner people; thus, you decided that you will try to remain slim to prevent the risk of a heart disease. You have gone through a lot of information on this topic and then, come up with a general rule of classification. Suppose, you are playing blackjack and based on the sequence of the cards you see, you decide whether to hit or not. In this case, based on the prior experience you have and by looking at what happens, you decide on your course of action. The same way the machines also lear with the aid of technology.

How will you choose the most appropriate machine learning algorithm for your classification problem ?

Answer:

If accuracy has to be given priority in deciding a machine learning algorithm, then the best way to go about it is to test a couple of different algorithms (try different parameters within each algorithm ) and choose the one that best meets the requirement. As a rule of thumb, choose a machine learning algorithm for your classification based on the size of your training set. If the training set is small, then using low variance/high bias classifiers like Naïve Bayes is beneficial, while in the case of large training sets high variance/low bias classifiers like k-nearest would serve the purpose best.

Top 50 Basic SQL Interview Questions And Answers Pdf

What is backpropagation in machine learning ?

Answer:

A) The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph. The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

What is candidate sampling in machine learning ?

Answer:

A training-time optimization in which a probability is calculated for all the positive labels, using, for example, softmax, but only for a random sample of negative labels. For example, if we have an example labeled beagle and dog candidate sampling computes the predicted probabilities and corresponding loss terms for the beagle and dog class outputs in addition to a random subset of the remaining classes (cat, lollipop, fence).

What is classification threshold in machine learning ?

Answer: A scalar-value criterion that is applied to a model’s predicted score in order to separate the positive class from the negative class. Used when mapping logistic regression results to binary classification.

What is Naive Bayes classifier ?

Answer: Naïve Bayes is an extensively used algorithm for classification task. Naïve Bayes classifier is proved to be effective in textual data analysis. This algorithm is a basis for machine learning as it seeks to work on conditional probability to cut through the improbability of a task in advance.

What is Bias-Variance trade-off in machine learning ?

Answer: Bias-Variance is a dilemma of minimizing the errors that stems from 2 different sources at a time. While Bias is based on preconceived assumptions in the learning algorithm, Variance measures whence a set of random numbers are spread across from their average value. Trading off in between these two aspects defines the process of machine algorithm.

What is the difference between artificial learning and machine learning ?

Answer: Machine Learning: Designing and developing algorithms according to the behaviors based on empirical data are known as Machine Learning.

Artificial intelligence: in addition to machine learning, it also covers other aspects like knowledge representation, natural language processing, planning, robotics etc.

machine learning lnterview questions

What is deep learning ?

Answer: This might or might not apply to the job you’re going after, but your answer will help to show you know more than just the technical aspects of machine learning. Deep learning is a subset of machine learning. It refers to using multi-layered neural networks to process data in increasingly complex ways, enabling the software to train itself to perform tasks like speech and image recognition through exposure to these vast amounts of data. Thus the machine undergoes continual improvement in the ability to recognize and process information. Layers of neural networks stacked on top of each for use in deep learning are called deep neural networks.

What is Genetic Programming ?

Answer: Genetic programming is one of the two techniques used in machine learning. The model is based on the testing and selecting the best choice among a set of results. (Company)

Why is Naïve Bayes machine learning algorithm naïve ?

Answer: Naïve Bayes machine learning algorithm is considered Naïve because the assumptions the algorithm makes are virtually impossible to find in real-life data. Conditional probability is calculated as a pure product of individual probabilities of components. This means that the algorithm assumes the presence or absence of a specific feature of a class is not related to the presence or absence of any other feature (absolute independence of features), given the class variable. For instance, a fruit may be considered to be a banana if it is yellow, long and about 5 inches in length. However, if these features depend on each other or are based on the existence of other features, a naïve Bayes classifier will assume all these properties to contribute independently to the probability that this fruit is a banana. Assuming that all features in a given dataset are equally important and independent rarely exists in the real-world scenario.

You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why ?

What’s a Fourier transform ?

Answer: A Fourier transform is a generic method to decompose generic functions into a superposition of symmetric functions. Or as this more intuitive tutorial puts it, given a smoothie, it’s how we find the recipe. The Fourier transform finds the set of cycle speeds, amplitudes and phases to match any time signal. A Fourier transform converts a signal from time to frequency domain — it’s a very common way to extract features from audio signals or other time series such as sensor data.

You are given a dataset where the number of variables (p) is greater than the number of observations (n) (p>n). Which is the best technique to use and why ?

Answer:

When the number of variables is greater than the number of observations, it represents a high dimensional dataset. In such cases, it is not possible to calculate a unique least square coefficient estimate. Penalized regression methods like LARS, Lasso or Ridge seem work well under these circumstances as they tend to shrink the coefficients to reduce variance. Whenever the least square estimates have higher variance, Ridge regression technique seems to work best.

When will you use classification over regression ?

Answer: Classification is about identifying group membership while regression technique involves predicting a response. Both techniques are related to prediction, where classification predicts the belonging to a class whereas regression predicts the value from a continuous set. Classification technique is preferred over regression when the results of the model need to return the belongingness of data points in a dataset to specific explicit categories. (For instance, when you want to find out whether a name is male or female instead of just finding it how correlated they are with male and female names.

machine learning lnterview questions

If a highly positively skewed variable has missing values and we replace them with mean, do we underestimate or overestimate the values ?

Answer: Since in positively skewed data, mean in greater than median, we overestimate the value of missing observations.

What does linear in ‘linear regression’ actually mean ?

Answer: It implies that the dependent variable should be a linear function of parameters. For the same reason, Polynomial regression is classified as linear though it fits a non-linear model between the dependent and independent variable. ( machine learning training)

What type of learning is needed when the system needs to adapt to rapidly changing data ?

Answer: Online learning. Because in Online learning each learning step is fast and cheap, and the system can be trained by feeding data instances sequentially.

What kind of problems lend themselves to machine learning ?

Answer: I think machine learning is become such a big deal because of big data. We now had access to so much of data that machine can interact with it. So, I think this would be a problem where machine learning is going to make great progress. Its like big exploitation to the data. So, one of the big challenges is for artificial intelligence is a computer vision. One of the things like humans do their job in an incredible way

Example:- When humans look at a picture and they will interpret that picture very well. For computers, it is very difficult.

Because we are trying to program the thing from bottom to upwards. But now we can expose an algorithm to many pictures as it can learn as its going learn. So, I think the sort of ability for machine actually to view its environment and interpret and read it. Where it can make a lot of progress. Frankly, where are this data machine learning would be successful? For Example recommendations on the internet and navigation like whenever we drive we are giving new information to it and that’s being used and adapt to change the progress to a higher level. Likewise, health filed one of the biggest filed where a machine can study a lot of data that doctors can’t study and cant maintain that much data.

What is false positive and false negative in terms of machine learning ?

Answer: Let see you are performing some task or you conducted some experiment or you conducted some test and whatever the test is associated with you or whatever the output came from your test or task is actually a negative but you actually predicted as a positive. That means you performed some experiment and output is actually negative but you predicted as a positive. So, Those kinds of cases will lie under false positive. In false negative exactly negative of the previous case called false positive. Actually, there are some outputs which are actually positive but you predicted a negative. So those kinds of cases lies under false negative.

What do you mean by parametric models? Also, give some examples of them ?

Answer: Parametric models are the models having a limited number of parameters. In order to predict new data, you only need to know the parameters of the model. The examples of such models include logistic regression, linear regression, and linear SVMs.

What is a neural network and what are some advantages and disadvantages of such a network ?

Answer: In the information technology field, a neural network is basically a system of hardware and/or software akin to the pattern of neurons in the human brain; it constitutes an important part of deep learning. The greatest advantage of neural networks is that they lead to the performance breakthroughs for unstructured datasets like audio, video, and images. Their high flexibility enables them to learn patterns that no other ML algorithm would be unable to manage. However, the disadvantage of neural networks is that need a huge volume of training data to work effectively. Also, there is difficulty in picking the right architecture for these networks due to their incomprehensible internal layers.

What is sigmoid function in Machine learning ?

Answer: A function that maps logistic or multinomial regression output (log odds) to probabilities, returning a value between 0 and 1.

machine learning lnterview questions

What is batch size machine learning ?

Answer: The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size of a mini-batch is usually between 10 and 1000. Batch size is usually fixed during training and inference.

What is bucketing in machine learning ?

Answer: Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.

What is checkpoint in machine learning ?

Answer: Data that captures the state of the variables of a model at a particular time. Checkpoints enable exporting model weights, as well as performing training across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption). Note that the graph itself is not included in a checkpoint.

What is collaborative filtering in machine learning ?

Answer: Making predictions about the interests of one user based on the interests of many other users. Collaborative filtering is often used in recommendation systems.

What is supervised and unsupervised machine learning ?

Answer:

Supervised machine learning is commonly used as the algorithm is already fed as an input and the algorithm is taught from an equipped dataset. The AI is guided steadily to teach itself with the readily available data resources.

Unsupervised machine learning refers to a process where the machine goes blindly into analysing the input whose outcome is necessarily unknown.

How to choose notable variables while working on a data set ?

Answer:

Removing the correlated variables is the first step before marking the selective variables as correlation hinders uniqueness among the variables. Other important tools such as linear regression, Random Forest and Lasso regression are keys to select variables in a machine learning process.

What is ‘Training set’ and ‘Test set’ ?

Answer:

Training set: It is a set of data is used to discover the potentially predictive relationship in various areas of information science like machine learning. It is an example given to the learner.

Test set: It is used to test the accuracy of the hypotheses generated by the learner, and it is the set of example held back from the learner.

What is the standard approach to supervised learning ?

Answer:

Split the set of example into the training set and the test is the standard approach to supervised learning is.

machine learning lnterview questions

What is Model Selection in Machine Learning ?

Answer: The process of choosing models among diverse mathematical models, which are used to define the same data set is known as Model Selection. It is applied to the fields of statistics, data mining and machine learning.

Explain the two components of Bayesian logic program ?

Answer: Bayesian logic program consists of two components. The first component is a logical one ; it consists of a set of Bayesian Clauses, which captures the qualitative structure of the domain. The second component is a quantitative one, it encodes the quantitative information about the domain.

List some use cases where classification machine learning algorithms can be used ?

Answer:

Natural language processing (Best example for this is Spoken Language Understanding ) Market Segmentation Text Categorization (Spam Filtering ) Bioinformatics (Classifying proteins according to their function) Fraud Detection Face detection

How much data will you allocate for your training, validation and test sets ?

Answer:

There is no to the point answer to this question but there needs to be a balance/equilibrium when allocating data for training, validation and test sets.

If you make the training set too small, then the actual model parameters might have high variance. Also, if the test set is too small, there are chances of unreliable estimation of model performance. A general thumb rule to follow is to use 80: 20 train/test spilt. After this the training set can be further split into validation sets.

What is the most frequent metric to assess model accuracy for classification problems ?

Answer:

Percent Correct Classification (PCC) measures the overall accuracy irrespective of the kind of errors that are made, all errors are considered to have same weight.

“People who bought this, also bought….” recommendations on Amazon are a result of which machine learning algorithm ?

Answer:

Recommender systems usually implement the collaborative filtering machine learning algorithm that considers user behaviour for recommending products to users. Collaborative filtering machine learning algorithms exploit the behaviour of users and products through ratings, reviews, transaction history, browsing history, selection and purchase information.

Name some feature extraction techniques used for dimensionality reduction. ?

Answer: Independent Component Analysis Principal Component Analysis Kernel Based Principal Component Analysis

machine learning lnterview questions

What kind of problems does regularization solve ?

Answer: Regularization is used to address overfitting problems as it penalizes the loss function by adding a multiple of an L1 (LASSO) or an L2 (Ridge) norm of your weights vector w.

Why is Manhattan distance not used in kNN machine learning algorithm to calculate the distance between nearest neighbours ?

Answer: Manhattan distance has restrictions on dimensions and calculates the distance either vertically or horizontally. Euclidean distance is better option in kNN to calculate the distance between nearest neighbours because the data points can be represented in any space without any dimension restriction.

Why do we convert categorical variables into factor? Which function is used in R to perform the same ?

Answer: Most Machine learning algorithms require numbers as input. On converting categorical values to factors we get numerical values and also we don’t have to deal with dummy variables.

We can use both factor() and as.factor() to convert variables to factors.

What is Standardization and Normalisation? Give one advantage of each over the other ?

Answer: Both are feature scaling techniques.

Standardization is less affected by outliers as compared to Normalisation.

Standardization doesn’t bound values to a specific range which may be a problem for some algorithms where an input is bounded between ranges.

How is machine learning used in the movement ?

Answer: As per my knowledge many people already using machine learning in their everyday life. Let us suppose when you are engaging with the internet you are actually expressing your preferences, likes, dislikes through your search. So all these things picked up by cookies coming on to your computer. From that, we can evaluate the behavior of a user. Basically, that will help us to increase a progress of a user through the internet. Navigation is also one of the examples where we are using machine learning to find a distance between two places through using optimization techniques. I think people going to more engage with machine learning in the near future is health.

Example:- If you see now, Actually watson is being to use for health. It looking at and scans of body data & trying to understand symptoms of cancer. These are the things machine learning used in the movement.

Footnotes:

https://en.wikipedia.org/wiki/Backpropagation#cite_note-:0-4 glossary.html

tensor

A tensor consists of a set of primitive values shaped into an array of any number of dimensions. A tensor's rank is its number of dimensions. Here are some examples of tensors:

3 # a rank 0 tensor; a scalar with shape []
[1., 2., 3.] # a rank 1 tensor; a vector with shape [3]
[[1., 2., 3.], [4., 5., 6.]] # a rank 2 tensor; a matrix with shape [2, 3]
[[[1., 2., 3.]], [[7., 8., 9.]]] # a rank 3 tensor with shape [2, 1, 3]

a mathematical object analogous to but more general than a vector, represented by an array of components that are functions of the coordinates of a space.

synapse

突触,activation/link layer.

https://arxiv.org/pdf/1502.03167v3.pdf

Table of Contents

1 Basic prerequisites

1.1 maximum likelihood estimation

1.2 terminology/jargon:

2 深度学习

2.1 Definitions

2.2 why deep?

3 学习方法

4 NLP

4.1 Application: Where can DL be applied for NLP tasks? DL Algorithms NLP Usage Neural Network (NN)

4.1.1 Feed-forward propagation

4.1.2 Recurrent Neural Networks (RNN)

4.1.3 Recursive Neural Networks

4.1.4 Convolutional Neural Network (CNN)

5 Critical questions

5.0.1 传统机器学习考察点：

5.0.2 深度学习考察点：

5.0.3 最优化考察点：

5.0.4 我们在这里将机器学习工程师需要掌握的基本技能分为五类：

5.0.5 data science

5.0.6 analytics

5.0.7 essential questions

Footnotes: