{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Optimization of Neural Networks - Part 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Main Contents For today\n", "1. GD vs. SGD vs. Mini-Batch GD\n", "2. Newton's Method, RMSProp, Momentum, Nesterov's Accelerated Gradient, Adam\n", "3. Initialization (Xavier, Kaiming)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Current State\n", "\n", "1. Data\n", "1. We have finalized our initial neural network architecture to train\n", "1. How will my neural network learn?\n", " 1. Minimize the loss function with respect to the network parameters\n", " 1. Calculus to rescue -> Iterative approach -> Gradient Descent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Batch Gradient Descent vs. Stochastic Gradient Descent vs. Mini-batch Gradient Descent\n", "\n", "Batch Gradient Descent\n", "1. Mathematically provable and guranteed to converge to global minimum for convex surfaces with a sufficiently small learning rate (lr <= 1/L for L-smooth surface)\n", "2. Tends to be slow since we only make a step after looking through all the data\n", "3. Not online\n", "\n", "Stochastic Gradient Descent\n", "1. Faster. Frequent updates but high variance. Large swings in parameter values\n", "2. Online\n", "3. Convergence can become an issue but with appropriate learning rate scheduling, convergence behaviour is close to that of Batch Gradient Descent\n", "4. Opportunity to jump to a better local minimas. Think exploration vs. exploitation\n", "\n", "Mini-batch Gradient Descent\n", "1. More stable convergence as parameters update variance reduces\n", "2. Can be as fast as SGD due to parallelization\n", "3. Searches through a larger part of the parameter space ( based on empirical data )" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
" ], "text/plain": [ "