The goal of this assignment is to explore regularization techniques. The original notebook can be found here

Import libraries¶

# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
from tqdm import tnrange
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

Load NotMNIST dataset¶

First reload the data we generated in 1_notmnist.ipynb.

pickle_file = 'datasets/notMNIST.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    X_train = save['train_dataset']
    Y_train = save['train_labels']
    X_valid = save['valid_dataset']
    Y_valid = save['valid_labels']
    X_test = save['test_dataset']
    Y_test = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', X_train.shape, Y_train.shape)
    print('Validation set', X_valid.shape, Y_valid.shape)
    print('Test set', X_test.shape, Y_test.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)

Reformat dataset¶

Reformat into a shape that's more adapted to the models we're going to train:

data as a flat matrix,
labels as float 1-hot encodings.

As I did in previous notebook, this reformat operation will be different from the operation suggested by the original notebook.

image_size = 28
num_labels = 10


def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32).T
    
    # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
    labels = (np.arange(num_labels) == labels[:, None]).astype(np.float32).T
    return dataset, labels


X_train, Y_train = reformat(X_train, Y_train)
X_valid, Y_valid = reformat(X_valid, Y_valid)
X_test, Y_test = reformat(X_test, Y_test)
print('Training set', X_train.shape, Y_train.shape)
print('Validation set', X_valid.shape, Y_valid.shape)
print('Test set', X_test.shape, X_test.shape)

Training set (784, 200000) (10, 200000)
Validation set (784, 10000) (10, 10000)
Test set (784, 10000) (784, 10000)

Using Accuracy as Default Metric¶

Because as we explored before, there exist no unbalanced problem in the dataset,
so accuracy alone will be sufficient for evaluating performance of our model on the classification task.

def accuracy(predictions, labels):
    return (
        np.sum(np.argmax(predictions, axis=0) == np.argmax(
            labels, axis=0)) / labels.shape[1] * 100)

3-layer NN as base model¶

In order to test the effect with/without regularization, we will use a little more complex neural network with 2 hidden layers as our base model. And we will be using ReLU as our activation function.

Hyper parameters¶

# hyper parameters
learning_rate = 1e-2
lamba = 1e-3
keep_prob = 0.5
batch_size = 128
num_steps = 501
n0 = image_size * image_size # input size
n1 = 1024 # first hidden layer
n2 = 512 # second hidden layer
n3 = 256 # third hidden layer
n4 = num_labels # output size

Build model¶

# build a model which let us able to choose different optimzation mechnism
def model(lamba=0, learning_rate=learning_rate,
          keep_prob=1, learning_decay=False,
          batch_size=batch_size, num_steps=num_steps, n1=n1, n2=n2, n3=n3):
    print(
    """
    Train 3-layer NN with following settings:
    Regularization lambda: {}
    Learning rate: {}
    learning_decay: {}
    keep_prob: {}
    Batch_size: {}
    Number of steps: {}
    n1, n2, n3: {}, {}, {}""".format(lamba, learning_rate,
                                     learning_decay, keep_prob,
                                     batch_size, num_steps, n1, n2, n3))
    
    # construct computation graph
    graph = tf.Graph()
    with graph.as_default():
        # placeholder for mini-batch when training 
        X = tf.placeholder(tf.float32, shape=(n0, batch_size))
        Y = tf.placeholder(tf.float32, shape=(num_labels, batch_size))
        global_step = tf.Variable(0)

        # use all valid/test set
        tf_X_valid = tf.constant(X_valid)
        tf_X_test = tf.constant(X_test)

        # initialize weights, biases
        # notice that we have two hidden 
        # layers so we now have W1, b1, W2, b2, W3, b3
        W1 = tf.Variable(tf.truncated_normal([n1, n0], stddev=np.sqrt(2.0 / n0)))
        W2 = tf.Variable(tf.truncated_normal([n2, n1], stddev=np.sqrt(2.0 / n1)))
        W3 = tf.Variable(tf.truncated_normal([n3, n2], stddev=np.sqrt(2.0 / n2)))
        W4 = tf.Variable(tf.truncated_normal([n4, n3], stddev=np.sqrt(2.0 / n3)))
        b1 = tf.Variable(tf.zeros([n1, 1]))
        b2 = tf.Variable(tf.zeros([n2, 1]))
        b3 = tf.Variable(tf.zeros([n3, 1]))
        b4 = tf.Variable(tf.zeros([n4, 1]))


        # training computation
        Z1 = tf.matmul(W1, X) + b1
        A1 = tf.nn.relu(Z1) if keep_prob == 1 else tf.nn.dropout(tf.nn.relu(Z1), keep_prob)
        Z2 = tf.matmul(W2, A1) + b2
        A2 = tf.nn.relu(Z2) if keep_prob == 1 else tf.nn.dropout(tf.nn.relu(Z2), keep_prob)
        Z3 = tf.matmul(W3, A2) + b3
        A3 = tf.nn.relu(Z3) if keep_prob == 1 else tf.nn.dropout(tf.nn.relu(Z3), keep_prob)
        Z4 = tf.matmul(W4, A3) + b4

        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                labels=tf.transpose(Y), logits=tf.transpose(Z4)))
        if lamba:
            loss += lamba * \
            (tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2) + tf.nn.l2_loss(W3) + tf.nn.l2_loss(W4))

        # optimizer
        if learning_decay:
            learning_rate = tf.train.exponential_decay(0.5, global_step, 5000, 0.80, staircase=True)
            optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
        else:
            optimizer = (tf.train
                         .GradientDescentOptimizer(learning_rate).minimize(loss))

        # valid / test prediction
        Y_pred = tf.nn.softmax(Z4, dim=0)
        Y_vaild_pred = tf.nn.softmax(
            tf.matmul(W4, tf.nn.relu(
                tf.matmul(W3, tf.nn.relu(
                    tf.matmul(W2, tf.nn.relu(
                        tf.matmul(W1, tf_X_valid) + b1)) + b2)) + b3)) + b4, dim=0)
        Y_test_pred = tf.nn.softmax(
            tf.matmul(W4, tf.nn.relu(
                tf.matmul(W3, tf.nn.relu(
                    tf.matmul(W2, tf.nn.relu(
                        tf.matmul(W1, tf_X_test) + b1)) + b2)) + b3)) + b4, dim=0)
    
    # define training
    with tf.Session(graph=graph) as sess:
        # initialized parameters
        tf.global_variables_initializer().run()
        print("Initialized")
        for step in tnrange(num_steps):

            # generate randomized mini-batches from training data
            offset = (step * batch_size) % (Y_train.shape[1] - batch_size)
            batch_X = X_train[:, offset:(offset + batch_size)]
            batch_Y = Y_train[:, offset:(offset + batch_size)]

            # train model
            _, l, batch_Y_pred = sess.run(
                [optimizer, loss, Y_pred], feed_dict={X: batch_X, Y: batch_Y})

            if (step % 200 == 0):
                print('Minibatch loss at step {}: {:.3f}. batch acc: {:.1f}%, Valid acc: {:.1f}%.'\
                      .format(step, l,
                              accuracy(batch_Y_pred, batch_Y),
                              accuracy(Y_vaild_pred.eval(), Y_valid)))

        print('Test acc: {:.1f}%'.format(accuracy(Y_test_pred.eval(), Y_test)))

Train model without regularization¶

model(learning_rate=0.5, num_steps=1601)

    Train 3-layer NN with following settings:
    Regularization lambda: 0
    Learning rate: 0.5
    learning_decay: False
    keep_prob: 1
    Batch_size: 128
    Number of steps: 1601
    n1, n2, n3: 1024, 512, 256
Initialized

Minibatch loss at step 0: 2.374. batch acc: 14.1%, Valid acc: 28.4%.
Minibatch loss at step 200: 0.600. batch acc: 82.0%, Valid acc: 84.9%.
Minibatch loss at step 400: 0.429. batch acc: 89.8%, Valid acc: 85.8%.
Minibatch loss at step 600: 0.372. batch acc: 87.5%, Valid acc: 85.7%.
Minibatch loss at step 800: 0.454. batch acc: 89.1%, Valid acc: 87.7%.
Minibatch loss at step 1000: 0.374. batch acc: 87.5%, Valid acc: 88.1%.
Minibatch loss at step 1200: 0.251. batch acc: 91.4%, Valid acc: 88.8%.
Minibatch loss at step 1400: 0.397. batch acc: 89.8%, Valid acc: 89.0%.
Minibatch loss at step 1600: 0.470. batch acc: 82.0%, Valid acc: 88.9%.

Test acc: 94.2%

L2 regularization¶

Introduce and tune L2 regularization for the models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor t using nn.l2_loss(t). The right amount of regularization should improve your validation / test accuracy.

# for lamda in [1 / 10 ** i for i in list(np.arange(1, 4))]:
#     model(lamba=lamda)
    
model(lamba=0.1, learning_rate=0.01)

    Train 3-layer NN with following settings:
    Regularization lambda: 0.1
    Optimizer: sgd
    Learning rate: 0.01
    Batch_size: 128
    Number of steps: 501
    n1, n2: 512, 256
Initialized

Minibatch loss at step 0: 22969.777. batch acc: 9.4%, Valid acc: 19.3%.
Minibatch loss at step 200: 13876.185. batch acc: 74.2%, Valid acc: 75.2%.
Minibatch loss at step 400: 9266.566. batch acc: 78.1%, Valid acc: 74.3%.

Test acc: 81.4%

Case of overfitting¶

Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

model(num_steps=10)

    Train 3-layer NN with following settings:
    Regularization lambda: 0
    Learning rate: 0.01
    learning_decay: False
    keep_prob: 1
    Batch_size: 128
    Number of steps: 10
    n1, n2, n3: 1024, 512, 256
Initialized

Minibatch loss at step 0: 2.442. batch acc: 8.6%, Valid acc: 11.4%.

Test acc: 20.7%

Dropout¶

Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides nn.dropout() for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

model(num_steps=10, keep_prob=0.5)

    Train 3-layer NN with following settings:
    Regularization lambda: 0
    Learning rate: 0.01
    learning_decay: False
    keep_prob: 0.5
    Batch_size: 128
    Number of steps: 10
    n1, n2, n3: 1024, 512, 256
Initialized

Minibatch loss at step 0: 2.784. batch acc: 7.0%, Valid acc: 10.0%.

Test acc: 17.3%

Boost performance by using Multi-layer NN¶

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is 97.1%.

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

global_step = tf.Variable(0)  # count the number of steps taken.
learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

model(learning_decay=True, num_steps=1501, lamba=0, keep_prob=1)

    Train 3-layer NN with following settings:
    Regularization lambda: 0
    Learning rate: 0.01
    learning_decay: True
    keep_prob: 1
    Batch_size: 128
    Number of steps: 1501
    n1, n2, n3: 1024, 512, 256
Initialized

Minibatch loss at step 0: 2.395. batch acc: 12.5%, Valid acc: 37.0%.

Minibatch loss at step 200: 0.589. batch acc: 82.0%, Valid acc: 84.7%.
Minibatch loss at step 400: 0.409. batch acc: 89.1%, Valid acc: 86.2%.
Minibatch loss at step 600: 0.396. batch acc: 88.3%, Valid acc: 86.5%.
Minibatch loss at step 800: 0.435. batch acc: 88.3%, Valid acc: 87.6%.
Minibatch loss at step 1000: 0.407. batch acc: 85.2%, Valid acc: 88.5%.
Minibatch loss at step 1200: 0.262. batch acc: 91.4%, Valid acc: 88.9%.
Minibatch loss at step 1400: 0.411. batch acc: 87.5%, Valid acc: 88.8%.

Test acc: 94.3%

Post Tags Tensorflow Python Deep Learning Regularization Deep Learning by Google Machine Learning Engineer by kaggle Udacity

Previous Post Simple Convolutional Neural Network using TensorFlow

Next Post Using TensorFlow to Train a Shallow NN with Stochastic Gradient Descent

View All Post

Regularization for Multi-layer Neural Networks in Tensorflow