Table of Contents

1 Basic

  • tensor:

a generalization of the concept of a vector

2 Structure

2.1 The Computational Graph

A TensorFlow program is typically split into two parts: the first part builds a computation graph (this is called the construction phase), and the second part runs it (this is the execution phase). The construction phase typically builds a computation graph representing the ML model and the computations required to train it.

node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0) # also tf.float32 implicitly
print(node1, node2)
sess = tf.Session()
print(sess.run([node1, node2]))
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b  # + provides a shortcut for tf.add(a, b)

To make the model trainable, we need to be able to modify the graph to get new outputs with the same input. Variables allow us to add trainable parameters to a graph. They are constructed with a type and initial value:

W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W*x + b

Constants are initialized when you call tf.constant, and their value can never change. By contrast, variables are not initialized when you call tf.Variable. To initialize all the variables in a TensorFlow program, you must explicitly call a special operation as follows:

init = tf.global_variables_initializer()
sess.run(init)
y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x: [1, 2, 3, 4], y: [0, -1, -2, -3]}))

A variable is initialized to the value provided to tf.Variable but can be changed using operations like tf.assign.

fixW = tf.assign(W, [-1.])
fixb = tf.assign(b, [1.])
sess.run([fixW, fixb])
print(sess.run(loss, {x: [1, 2, 3, 4], y: [0, -1, -2, -3]}))

2.2 tf.train API

import tensorflow as tf

# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W*x + b
y = tf.placeholder(tf.float32)

# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
  sess.run(train, {x: x_train, y: y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

2.3 Implementing Gradient Descent

When using Gradient Descent, remember that it is important to first normalize the input feature vectors, or else training may be much slower.

2.4 Saving and Restoring Models

Once you have trained your model, you should save its parameters to disk so you can come back to it whenever you want, use it in another program, compare it to other models, and so on. Moreover, you probably want to save checkpoints at regular inter‐ vals during training so that if your computer crashes during training you can con‐ tinue from the last checkpoint rather than start over from scratch. TensorFlow makes saving and restoring a model very easy. Just create a Saver node at the end of the construction phase (after all variable nodes are created); then, in the execution phase, just call its save() method whenever you want to save the model, passing it the session and path of the checkpoint file.

3 Visualizing the Graph and Training Curves Using TensorBoard

The first step is to tweak your program a bit so it writes the graph definition and some training stats—for example, the training error (MSE)—to a log directory that TensorBoard will read from. You need to use a different log directory every time you run your program, or else TensorBoard will merge stats from different runs, which will mess up the visualizations. The simplest solution for this is to include a time‐ stamp in the log directory name. Add the following code at the beginning of the pro‐ gram:

from datetime import datetime
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = "tf_logs"
logdir = "{}/run-{}/".format(root_logdir, now)
mse_summary = tf.summary.scalar('MSE', mse)
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

Next you need to update the execution phase to evaluate the msesummary node regularly during training (e.g., every 10 mini-batches). This will output a summary that you can then write to the events file using the filewriter. Here is the updated code:

for batch_index in range(n_batches):
    X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
    if batch_index % 10 == 0:
        summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch})
        step = epoch * n_batches + batch_index
        file_writer.add_summary(summary_str, step)
    sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
[...]
file_writer.close()
  1. activate virtualenv.
  2. start tensorboard server.
tensorboard --logdir tf_logs/ --host 192.168.1.199 --port 8008
  1. open browser, go to http://192.168.1.199:8008/

4 TensorBoard: Embedding Visualization

4.1 Load data

Step 1: Load a TSV file of vectors. Example of 3 vectors with dimension 4: 0.1\t0.2\t0.5\t0.9 0.2\t0.1\t5.0\t0.2 0.4\t0.1\t7.0\t0.8

Step 2 (optional): Load a TSV file of metadata. Example of 3 data points and 2 columns. Note: If there is more than one column, the first row will be parsed as column labels. Pokémon\tSpecies Wartortle\tTurtle Venusaur\tSeed Charmeleon\tFlame

4.2 Projections

The Embedding Projector has three methods of reducing the dimensionality of a data set: two linear and one nonlinear. Each method can be used to create either a two- or three-dimensional view.

A straightforward technique for reducing dimensions is Principal Component Analysis (PCA). The Embedding Projector computes the top 10 principal components. The menu lets you project those components onto any combination of two or three. PCA is a linear projection, often effective at examining global geometry.

A popular non-linear dimensionality reduction technique is t-SNE. The Embedding Projector offers both two- and three-dimensional t-SNE views. Layout is performed client-side animating every step of the algorithm. Because t-SNE often preserves some local structure, it is useful for exploring local neighborhoods and finding clusters. Although extremely useful for visualizing high-dimensional data, t-SNE plots can sometimes be mysterious or misleading. See this great article for how to use t-SNE effectively.

5 Deep learning architecture

Tensorflow is a mathmatical library including neural network, Keras is a framework based on Tensorflow.

  • Lower Level: This is where frameworks like Tensorflow, MXNet, Theano, and PyTorch sit. This is the level where mathematical operations like Generalized Matrix-Matrix multiplication and Neural Network primitives like Convolutional operations are implemented.
  • Higher Level: This is where frameworks like Keras sit. At this Level, the lower level primitives are used to implement Neural Network abstraction like Layers and models. Generally, at this level other helpful APIs like model saving and model training are also implemented.

6 Recurrent Neural Networks (RNN)

6.1 output formular

\[y_{(t)}=\psi(x_{(t)}^t * w_x + y_{(t-1)}^t * w_y + b)\]

For the whole layer: \[Y_{(t)}=\psi(X_{(t)}^t * W_x + Y_{(t-1)}^t * W_y + b)\] \[=\psi([X_{(t)} Y_{(t-1)}]W+b)\] with \(W=[W_x W_y]^t\)

6.2 hidden state

In general a cell’s state at time step t, denoted h(t) (the “h” stands for “hidden”), is a function of some inputs at that time step and its state at the previous time step: h(t) = f(h(t–1), x(t)). Its output at time step t, denoted y(t), is also a function of the previous state and the current inputs.

6.3 structure

  • seq to seq: stock price.
  • seq to vector: sentiment.
  • vector to seq: caption of image.
  • delayed seq to seq: translation, Encoder-Decoder.

6.4 basic RNNs

n_inputs = 3
n_neurons = 5
X0 = tf.placeholder(tf.float32, [None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])
Wx = tf.Variable(tf.random_normal(shape=[n_inputs, n_neurons],dtype=tf.float32))
Wy = tf.Variable(tf.random_normal(shape=[n_neurons,n_neurons],dtype=tf.float32))
b = tf.Variable(tf.zeros([1, n_neurons], dtype=tf.float32))
Y0 = tf.tanh(tf.matmul(X0, Wx) + b)
Y1 = tf.tanh(tf.matmul(Y0, Wy) + tf.matmul(X1, Wx) + b)
init = tf.global_variables_initializer()

# feed data
import numpy as np
# Mini-batch: instance 0,instance 1,instance 2,instance 3
X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # t = 0
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]]) # t = 1
with tf.Session() as sess:
  init.run()
  Y0_val, Y1_val = sess.run([Y0, Y1], feed_dict={X0: X0_batch, X1: X1_batch})

6.5 Static Unrolling Through Time

X0 = tf.placeholder(tf.float32, [None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
output_seqs, states = tf.contrib.rnn.static_rnn(
basic_cell, [X0, X1], dtype=tf.float32)
Y0, Y1 = output_seqs

The staticrnn() function calls the cell factory’s _call_() function once per input, creating two copies of the cell (each containing a layer of five recurrent neurons), with shared weights and bias terms, and it chains them just like we did earlier.

The staticrnn() function returns two objects.

The first is a Python list containing the output tensors for each time step.

The second is a tensor containing the final states of the network. When you are using basic cells, the final state is simply equal to the last output.

6.6 Dynamic Unrolling Through Time

The dynamicrnn() function uses a whileloop() operation to run over the cell the appropriate number of times, and you can set swapmemory=True if you want it to swap the GPU’s memory to the CPU’s memory during backpropagation to avoid OOM errors. Conveniently, it also accepts a single tensor for all inputs at every time step (shape [None, nsteps, ninputs]) and it outputs a single tensor for all out‐ puts at every time step (shape [None, nsteps, nneurons]); there is no need to stack, unstack, or transpose. The following code creates the same RNN as earlier using the dynamicrnn() function. It’s so much nicer!

6.7 Training RNNs

To train an RNN, the trick is to unroll it through time (like we just did) and then simply use regular backpropagation. This strategy is called backpro‐ pagation through time (BPTT).

6.8 The difficulty of Training over Many Time Steps

  • good parameter initialization
  • nonsaturating activation functions (e.g., ReLU),
  • Batch Normalization,
  • Gradient Clip‐
  • ping, and faster optimizers.

6.9 Long Short Term Memory(LSTM)

lstm_cell = tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons)

LSTM cells manage two state vectors, and for performance reasons they are kept separate by default. LSTM_cell.png its state is split in two vectors: h(t) and c(t) (“c” stands for “cell”). You can think of h(t) as the short-term state and c(t) as the long-term state.

As the long-term state c(t–1) traverses the network from left to right, you can see that it first goes through a forget gate, dropping some memories, and then it adds some new memories via the addition operation (which adds the memories that were selected by an input gate). The result c(t) is sent straight out, without any further transformation. So, at each time step, some memories are dropped and some memories are added. Moreover, after the addition operation, the long-term state is copied and passed through the tanh func‐ tion, and then the result is filtered by the output gate. This produces the short-term state h(t) (which is equal to the cell’s output for this time step y(t)).

The main layer is the one that outputs g(t). It has the usual role of analyzing the current inputs x(t) and the previous (short-term) state h(t–1). In a basic cell, there is nothing else than this layer, and its output goes straight out to y(t) and h(t). In con‐ trast, in an LSTM cell this layer’s output does not go straight out, but instead it is partially stored in the long-term state.

The three other layers are gate controllers. Since they use the logistic activation function, their outputs range from 0 to 1. As you can see, their outputs are fed to element-wise multiplication operations, so if they output 0s, they close the gate, and if they output 1s, they open it. Specifically: — The forget gate (controlled by f(t)) controls which parts of the long-term state should be erased. — The input gate (controlled by i(t)) controls which parts of g(t) should be added to the long-term state (this is why we said it was only “partially stored”). — Finally, the output gate (controlled by o(t)) controls which parts of the longterm state should be read and output at this time step (both to h(t)) and y(t). In short, an LSTM cell can learn to recognize an important input (that’s the role of the input gate), store it in the long-term state, learn to preserve it for as long as it is needed (that’s the role of the forget gate), and learn to extract it whenever it is needed. This explains why they have been amazingly successful at capturing long-term pat‐ terns in time series, long texts, audio recordings, and more.

6.10 Gated Recurrent Unit(GRU)

The main simplifications are: • Both state vectors are merged into a single vector h(t). • A single gate controller controls both the forget gate and the input gate. If the gate controller outputs a 1, the input gate is open and the forget gate is closed. If it outputs a 0, the opposite happens. In other words, whenever a memory must be stored, the location where it will be stored is erased first. This is actually a fre‐ quent variant to the LSTM cell in and of itself. • There is no output gate; the full state vector is output at every time step. How‐ ever, there is a new gate controller that controls which part of the previous state will be shown to the main layer.