00:28:07 Anon. Git: Does Adam use GD or SGD in any way? 00:28:38 Anon. Normal Distribution: is the handout posted? 00:29:42 Anxiang Zhang (TA): Will post after recitation 00:30:07 Anon. Array: is there any reason not to do all three in parrell and use the one with the best results? 00:31:12 Anon. Windows: if you did all three in parallel, wouldn’t the bottleneck still be batch GD so this would be equivalent to batch GD? 00:31:29 Anxiang Zhang (TA): @Nicky Nocerino:GD is computational intensive and SGD is fast. 00:31:43 Anxiang Zhang (TA): Also, SGD could achieve better overall convergence rate 00:32:00 Anon. Array: got it, thanks 00:33:11 Anon. Array: would it make any sense to start with SGD or mini batch and increase batch size over time to make convergence easier? 00:33:56 Anxiang Zhang (TA): Haven’t experimented it or seem related papers but it’s a nice try -P 00:34:37 Anxiang Zhang (TA): The motivation of Batch-GD is parallel computing. SGD is not parallelizable. 00:40:26 Anon. Git: Isn’t Newtons method more computationally expensive though? 00:40:34 Anxiang Zhang (TA): true 00:40:40 Anon. Git: Thanks 00:40:55 Anon. Boolean: Are second-order methods generally less popular than the first-order ones? 00:41:33 Anxiang Zhang (TA): To the best of my knowledge, yes 00:41:42 Reshmi Ghosh (TA): I think so too 00:42:42 Anxiang Zhang (TA): second-order method is mostly used for research topic. like exploring how to approximate hessian in a faster way. 00:43:38 Anon. Boolean: Is it because it is usually inefficient to compute the Hessian? 00:46:07 Anxiang Zhang (TA): Yes. 00:46:34 Anon. Array: are newtons method and RMSprop mutually exlusive? 00:49:15 Anon. Dropout (for NNs): that’s really cool 00:57:23 Reshmi Ghosh (TA): That’s your cue for p2 folks:P 01:00:11 Anon. Array: is there a optimal way to initialize, or just randomly 01:00:39 Anxiang Zhang (TA): no optimal method in deep learning world 01:00:44 Anxiang Zhang (TA): -.- 01:01:02 Reshmi Ghosh (TA): XD 01:01:25 Anon. Array: is there a more optimal way then I guess, if constant is bad, what are better ways 01:02:04 Anon. Boolean: Kaiming initialization is one of them I believe 01:02:47 Anon. Array: thanks 01:04:32 Anon. Array: im a little consfuesd why similar distribution across layers increases efficiency 01:04:57 Anon. Residual: in what case would you want to use kaiming over Xavier or vice versa? 01:05:27 Anon. Windows: why doesn’t the xavier version’s error decrease at all? 01:05:34 Anon. Windows: for the bottom picture 01:06:38 Anon. ResNet50: https://pytorch.org/docs/stable/nn.init.html <- there is something like Kushal said, but it might have to be smudged to work well 01:08:29 Anon. NLLLoss: you can’t have negative distributions 01:08:45 Anon. ResNet50: We don't want the values to explode either. (Ex face underflow or numeric overflow issues) 01:10:05 Anon. Boolean: But isn’t it the case that vanishing gradient should be handled by normalization or a better activation function? 01:11:33 Kushal Saharan (TA): @Alvin Xavier doesn’t decrease only for a very deep neural net as its analysis doesn’t consider activation functions and is therefore poorer for deeper networks 01:12:08 Kushal Saharan (TA): Also you might want to stick with Kaiming for almost all cases 01:13:45 Anon. Array: are there down sides to having too low learning rate other than being slow? 01:14:09 Anon. LR: you might not reach the minima 01:14:15 Kushal Saharan (TA): @Yiwei Vanishing gradient descent is handled using multiple techniques and having a great initialisation helps in that process. Therefore you’d still want to use things like Batchnorm to alleviate vanishing/exploding gradient issues 01:14:39 Anon. Boolean: Thanks 01:14:46 Anon. Array: @ryan, might not reach it in a given time, or might not reach it ever? 01:15:16 Anon. LR: ever, if it’s too small you could get stuck near a local minima 01:15:23 Kushal Saharan (TA): It may reach a minima sure 01:15:55 Anon. Git: Is dropout layers considered regularization? And if so are they allowed in our NN? 01:16:05 Kushal Saharan (TA): But you will not be able to explore the parameter space 01:16:24 Kushal Saharan (TA): Yes dropout is considered regularization 01:16:30 Reshmi Ghosh (TA): Yes! Dropouts are allowed. But you need to be careful in what setting (depth) you use them 01:16:32 Kushal Saharan (TA): They are allowed in your NN 01:16:49 Reshmi Ghosh (TA): Especially how you use parameter ‘p’ value 01:17:36 Anon. Git: Thank you! 01:19:12 Reshmi Ghosh (TA): The dropout documentation In pytorch is very nice 01:19:35 Reshmi Ghosh (TA): You can google to know more about how to implement it correctly 01:19:47 Anon. Action Potential: Does dropout add to the computational graph? If so, how? 01:20:16 Anon. Git: I think it would remove from the computational graph if anything. 01:20:42 Anon. Git: But maybe not 01:21:27 Kushal Saharan (TA): What do you mean by ‘add to computational graph’ ? 01:22:09 Kushal Saharan (TA): There is no parameters to learn in dropout 01:22:28 Anon. Boolean: Does the problem of vanishing gradient count as a type of overfitting? 01:22:31 Kushal Saharan (TA): Consider this to be a layer that switches neurons on and off during the training stage 01:22:57 Anon. Action Potential: Is the operation of dropout added to the graph and the effects of it considered when backprop? 01:23:00 Reshmi Ghosh (TA): Dropout simply ignores some neurons depending on your p value 01:23:19 Kushal Saharan (TA): Yes it is considered during backprop 01:23:40 Anon. DBN: Will the slides be posted on website? 01:23:46 Reshmi Ghosh (TA): yes 01:24:27 Kushal Saharan (TA): Vanishing gradient can occurs even without overfitting 01:24:50 Jacob Lee (TA): Go team!!!!!! 01:25:15 Reshmi Ghosh (TA): Woohoo Jacob!:P 01:25:24 Jacob Lee (TA): 👏🏻👏🏻👏🏻 Yeah!!!! 01:26:34 Anon. Boolean: Also for dropout, is it equivalent as randomly setting some entries in the parameter matrices for each layer to be zero during training? 01:27:02 Reshmi Ghosh (TA): What are some of these entries? 01:28:01 Reshmi Ghosh (TA): Entire outputs of neurons are either set to 0 or not randomly based on ‘p’ you set during training 01:28:24 Kushal Saharan (TA): During evaluation, in order to incorporate this effect you scale all outputs appropriately 01:28:39 Anon. Sodium Ion: So which order should we use again? Is either fine? 01:28:47 Anon. Sodium Ion: Sry didn’t catch the last part 01:28:47 Kushal Saharan (TA): There is another way to handle dropout but this Is one of the ways 01:29:11 Anon. Adam: I think relu after bn 01:29:17 Anon. Sodium Ion: Cool thx 01:29:18 Anon. Boolean: Thank you 01:29:37 Anon. is_available(): Are we allowed to use dropout in hw1p2? 01:29:42 Reshmi Ghosh (TA): Read more because if you interchange BN and dropout, you may run into an issue:P 01:29:46 Kushal Saharan (TA): yes 01:29:47 Reshmi Ghosh (TA): Yes @shentong 01:29:52 Anon. is_available(): Thank you. 01:29:57 Jacob Lee (TA): the order of ReLU and BN depends on the problem space 01:30:03 Kushal Saharan (TA): Also you will find different papers doing different things as far as order of ReLU and BN is concerned 01:30:08 Reshmi Ghosh (TA): yes 01:30:40 Reshmi Ghosh (TA): I think we all have done that @anxiang 01:30:42 Reshmi Ghosh (TA): :P 01:31:01 Anon. ResNet50: How would normalization be used when using word embedders in NLP tasks? After the embedding layer? Or is this advice (of normalizing input data) more applicable to other types of tasks. 01:31:38 Jacob Lee (TA): hw3 and hw4 will be a lot of this:) 01:32:10 Reshmi Ghosh (TA): Hint:we are talking about 1st linear layer 01:32:13 Anon. ResNet34: Part of feature engineering? 01:32:18 Anon. Windows: just removes input data? 01:32:56 Anon. Array: so some features are noise? 01:33:21 Jacob Lee (TA): ^ ya, in the same way that a few dead pixels doesn't make an image interpretable 01:33:24 Jacob Lee (TA): uninterpretable* 01:34:08 Anon. Boolean: Does momentum count as a learning rate scheduler? 01:34:17 Reshmi Ghosh (TA): That’s the clue people, you may experiment with cosineannealingLR if you would like 01:34:39 Kushal Saharan (TA): Umm, momentum is part of the optimization algorithm and not really learning rate scheduler 01:35:54 Reshmi Ghosh (TA): Momentum is a different concept, it will be covered in class, but something good to know about for p2 01:36:01 Anon. Boolean: But if you use an adaptive learning method is it still helpful to introduce a learning rate scheduler? 01:36:55 Jacob Lee (TA): yes 01:39:29 Anon. Boolean: Is it the case that the learning rate scheduler is responsible for initialize the learning rate at the beginning of every epoch while adaptive optimization like momentum adjust the learning rate using past data points during each epoch? 01:40:35 Jacob Lee (TA): 👏🏻👏🏻👏🏻👏🏻👏🏻🍾🎉👍🏻👌🏻 01:40:46 Anon. Array: 👏🏻👏🏻 01:41:44 Reshmi Ghosh (TA): Thank you Jacob for your undivided support 01:41:59 Jacob Lee (TA): 😎 01:42:16 Anon. Boolean: Thank you 01:43:17 Anon. ResNet50: 👏🏻