Neural Networks

For part 1 of the homework you will write your own implementation of the backpropagation algorithm for training your own neural network, as well as a few other features such as activations and batch-normalization. You are required to do this assignment in the Python (Python version 3) programming language. Do not use any autodiff toolboxes (PyTorch, TensorFlow, Keras, etc) - you are only permitted and recommended to vectorize your computation using the Numpy library.

In the file we provide you with classes whose methods you have to fill. The autograder tests will compare the outputs of your methods and attributes of your classes with a reference solution. Therefore we enforce for a large part the design of your code ; however, you still have a lot of freedom in your implementation . We recommend that you look through all of the problems before attempting the first problem. However we do recommend you complete the problems in order as the difficulty increases, and questions often rely on the completion of previous questions.


A number of activation classes including Sigmoid, Tanh and ReLU must be implemented. Each of these come in the form of a class which has both a forward and derivative methods that must be implemented. The Identity activation has been implemented for you as an example. All of these classes derive from Activation which acts as an abstract base class / interface.

You only need to implement the two methods for each class. No helper functions or additional classes should be necessary. Keep your code as concise as possible and leverage Numpy as much as possible. Be mindful that you are working with matrices which have a given shape as well as a set of values.

Softmax Cross Entropy

For this homework, we will be using the softmax cross entropy loss detailed in the appendix of this writeup. Use the LogSumExptrick to ensure numerical stability. Implement the methods specified in the class definitionSoftmaxCrossEntropy. This class inherits the baseCriterionclass.Again no helper methods should be needed and you should stride to keep your code as simple as possible. Use 1e-8 for eps.


Implement the softmax cross entropy operation on a batch of output vectors.Hint:Add a class attribute to keep track of intermediate values necessary for the backward computation.

  1. Input shapes:
    1. x: (batch size, 10)
    2. y: (batch size, 10)
  2. Output Shape:
    1. out: (batch size,)


Perform the 'backward' pass of softmax cross entropy operation using intermediate values saved in theforward pass.

  1. Output shapes:
    1. out: (batch size, 10)

Multi-Layer Perceptron (MLP)

In this section of the homework, you will be implementing a Multi-Layer Perceptron with an API similar to popular Automatic Differentiation Libraries like PyTorch, which you will be allowed and encouraged to use in the second part of the homework. Provided in is a template the MLP class. Go through the functions of the given MLP class thoroughly and make sure you understand what each function in the class does so that you can create a generic implementation that supports an arbitrary number of layers, types of activations and network sizes. This section is purely descriptive and for reference as you tackle the next problems. Nothing to implement... yet.

The parameters for theMLPclass are:

  1. input size: The size of each individual data example.
  2. output size: The number of outputs.
  3. hiddens: A list with the number of units in each hidden layer.
  4. activations: A list ofActivationobjects for each layer.
  5. weight_init_fn: A function applied to each weight matrix before training.
  6. bias_init_fn: A function applied to each bias vector before training.
  7. criterion: ACriterionobject to compute the loss and its derivative.
  8. lr: The learning rate.
  9. momentum: Momentum scale (Should be 0.0 until completing 3.6).
  10. num_bn_layers: Number ofBatchNormlayers start from upstream (Should be 0 until completing 3.5).

The attributes of the MLP class are:

  1. @W: The list of weight matrices.
  2. @dW: The list of weight matrix gradients.
  3. @b: A list of bias vectors.
  4. @db: A list of bias vector gradients.
  5. @bnlayers: A list ofBatchNormobjects. (Should beNoneuntil completing 3.5).

The methods of the MLP class are:

  1. forward: Forward pass. Accepts a mini-batch of data and return a batch of output activations.
  2. backward: Backward pass. Accepts ground truth labels and computes gradients for all parameters.
    Hint: Use state stored in activations during forward pass to simplify your code.
  3. zerograds: Set all gradient terms to 0.
  4. step: Apply gradients computed in backward to the parameters.
  5. train(Already implemented): Set the mode of the network to train.
  6. eval(Already implemented): Set the mode of the network to evaluation.

Note: MLP methods train and eval will be useful in 3.5.

Note: Pay attention to the data structures being passed into the constructor and the class attributes specified initially.

Sample constructor call: MLP(784, 10, [64, 64, 32], [Sigmoid(), Sigmoid(), Sigmoid(), Identity()],weight_init_fn, bias_init_fn, SoftmaxCrossEntropy(), 0.008, momentum=0.9, num_bn_layers=0)

Weight Initialization

Good initialization has been shown to make a great difference in the training process. Implement two basic parameter initialization functions

  1. random_normal_weight_init: Accepts two ints specifying the number of inputs and units. Returns an appropriately shaped ndarray with each entry initialized with an independent sample from the standard normal distribution.
  2. zeros_bias_init: Accepts the number of bias units and returns an appropriately shaped ndarray with each entry initialized to zero.

Linear Classifier

Implement a linear classifier (MLP with no hidden layers) using theMLPclass. We suggest you read through the entire assignment before you start implementing this part.For this problem, you will have to fill in the parts of the MLP class and pass the following parameters:

  1. input_size: The size of each individual data example.
  2. output_size: The number of outputs.
  3. hiddens: An empty list because there are no hidden layers.
  4. activations: A singleIdentityactivation.
  5. weight_init_fn: Function name of a weight initializer function you will write for local testing and will be passed to you by the autograder.
  6. bias_init_fn: Function name of a bias initializer function you will write for local testing and will be passed to you by the autograder.
  7. riterion:SoftmaxCrossEntropyobject.

Consider that a linear transformation with an identity activation is simply a linear model.

You will have to implement the forward and backward function of theMLPclass so that it can at least work as a linear classifier. All parameters should be maintained as class attributes originally specified in the original handout code (e.g.self.W,self.b).

The step function also needs to implemented, as it will be invoked after every backward pass to update the parameters of the network (the gradient descent algorithm).After all parts are filled, train the model for 100 epochs with a batch size of 100 and try visualizing the training curves using the utilities provided in


Linear models are cool, but we are here to build neural networks.

Implement all of the non-linearities specified in A sample implementation of theIdentityactivationis provided for reference.

The output of the activation should be stored in the self.state variable of the class. The self.state variable should be further used for calculating the derivative during the backward pass. These are the following list of classes for which you would have to implement the forward and derivative function. The autograder will test these implementations individually.

  1. Sigmoid
  2. Hyperbolic Tangent (Tanh)
  3. Rectified Linear unit (ReLU)

Hidden Layers

Update the MLP class that previously just supported a single fully connected layer (a linear model) such thatit can now support an arbitrary number hidden layers, each with an arbitrary number of units.

Specifically, the hiddens argument of theMLPclass will no longer be assumed to be an empty list and can be of arbitrary length (arbitrary number of layers) and contain arbitrary positive integers (arbitrary number of units in each layer). The implementation should apply the Activation objects, passed as activations, to their respective layers. For example, activations[0] should be applied to the activity of the first hidden layer.While at risk of being pedantic, here is a clarification of the expected arguments to be passed to the MLP constructor once it can support arbitrary numbers of layers with an arbitrary assortment of activation functions.

  1. input_size: The size of each individual data example.
  2. output_size: The number of outputs.
  3. hiddens: A list of layer sizes (number of units per layer).
  4. activations: A list ofActivationobjects to be applied after each linear transformation respectively.
  5. weight_init_fn: Function name of a weight initializer function you will write for local testing and will be passed to you by the autograder.
  6. bias_init_fn: Function name of a bias initializer function you will write for local testing and will be passed to you by the autograder.
  7. criterion:SoftmaxCrossEntropyobject.

Batch Normalization (BatchNorm)

In this problem you will implement the Batch Normalization technique from Ioffe and Szegedy [2015]. Implement theBatchNormclass and make sure that all provided class attributes are set correctly. Apply theBatchNorm transformation of the first num_bn_layers specified as an argument to MLP. For the sake of the autograder tests, you can assume that batch norm will be applied to networks with sigmoid non-linearities.


Modify the step function present in the MLP class to include momentum in your gradient descent. The momentum value will be passed as a parameter. Your function should perform epoch number of epochs and return the resulting weights. Instead of calling the step function after completing your backward pass, you would have to call the momentum function to update the parameters. Also, remember to invoke the zerograd function after each batch.

Note:Please ensure that you shuffle the training set after each epoch by using np.random.shuffle and generate a list of indices and performing a gather operation on the data using these indices.

Training Statistics

In this problem you will be passed the MNIST data set (provided as a 3-tuple of{trainset,valset,testset}, each in the same format described in 1), and a series of network and training parameters.

  1. training_losses: A Numpy ndarray containing the average training loss for each epoch.
  2. training_errors: A Numpy ndarray containing the classification error on the training set at each epoch (Hint:You should not be evaluating performance on the training set after each epoch, but compute an approximation of the training classification error for each training batch).
  3. validation_losses: A Numpy ndarray containing the average validation loss loss for each epoch.
  4. validation_errors: A Numpy ndarray containing the classification error on the validation set after each epoch.

Fill the function get training stats that, given a network, a dataset, a number of epochs and a batch size, trains the network and returns those quantities. Then execute the provided test file, that will run get training stats on MNIST for a given network and plot the previous arrays into files. Hand-in those image files along with your code on autolab.

Note: You will not be autograded on that part. Instead we will grade you manually by looking at your code and plots. We will check that your code and the statistics you obtain make sense. Contrary to the other questions, we do not require an exact match with the solution, only that you run your forward, backward and step functions in the right order, and use the partitions of the dataset correctly (train and val at least), with shuffling of the training set between epochs.

Note: We provide examples of the plots that you should obtain. If you get similar or better values at the end of the training (loss around 0.3, error around 0.1), then you are very likely to get all the points. If you are a bit above but that the errors and losses still drop significantly during training, it’s possible that you get all the points as well.