08:10:19 Anon. Baseline: I have never heard of it but do NNs ever get trained with a non-derivative based optimization technique like the genetic algorithm? 08:10:55 Anti (TA): I'm not sure that's very prevalent, but there is certainly work done in such areas 08:11:14 Anon. Baseline: Thank you sorry! 08:11:59 Anon. CTC: Any recommended papers on what Jacob just mentioned? 08:12:10 Anon. CTC: @critical importance of loss functions 08:12:18 Anon. CTC: cool 08:16:18 Anti (TA): @Dennis an example of a gradient-free way of training a network https://arxiv.org/abs/2005.05955 08:16:44 Anon. Recall Capacity: Why we pass logit instead of predicted labels to the loss function? 08:17:14 Anon. Phoneme: Crossentropy need probability @Yiwei 08:17:48 Anon. Sanger’s Rule: softmax—>nll_loss 08:18:45 Anon. Recall Capacity: What do you mean by validation here? 08:18:51 Anon. Baseline: @Anti Thank you very much! 08:19:25 Anon. Recall Capacity: @Daniel Mo Thank you 08:19:44 Anti (TA): Validation accuracy is the accuracy on data that we haven't trained the network on. 08:36:57 Anon. Transpose: can you explain what is taking derivative wrt an operation? 08:51:53 Anon. Connectionist: Do we need to implement broadcasting in other functions like mul and matmul? 08:54:53 Anon. Phoneme: Function Div should be scalar operation right? 08:55:28 Anon. Phoneme: gotcha, thanks 08:55:40 Anon. Hinton: Can you use numpy methods in our my torch functions? 08:56:02 Anon. PackedSequence: ^im confused about this as well 08:56:11 Jacob Lee (TA): ^ It depends is the answer 08:56:20 Jacob Lee (TA): If you want to add to the comp graph, you shouldn't; you should use the operations you defined 08:56:36 Jacob Lee (TA): But when defining the operations, you can use numpy all you want 08:57:12 Anon. PackedSequence: ohhh ok that makes sense 09:01:28 Anon. Phoneme: when going backward thru Add, Sub, Mul, and Div, we assume they are all element-wise ops, so the partial derivatives would be similar to scalar derivatives? 09:02:11 Jacob Lee (TA): Oh in that sense yeah 09:02:28 Jacob Lee (TA): But once broadcasting is added it's a little more complicated 09:02:55 Anon. Phoneme: Thanks! Kind of see the purpose of unbroadcast here:) 09:06:33 Anon. Weight: dl/dc*dc/da 09:08:20 Anon. Weight: dot products will give a scalar result 09:08:48 Anon. Membrane: yeah, so why we will use dot product for matmul? 09:08:58 Anon. Membrane: that really confuses me 09:08:59 Jacob Lee (TA): You don't necessarily need to use dot product specifically 09:09:10 Jacob Lee (TA): Remember dot product is a special case of matrix multiplication 09:09:24 Anon. Membrane: Anti just said we cannot use * product for matmul backward 09:09:28 Anon. Membrane: then what we should use 09:09:33 Jacob Lee (TA): that's elementwise multiplication 09:09:54 Anon. Recall Capacity: What does “out” refer to? 09:10:13 Anon. Membrane: yeah, we cannot use elementwise multiplication for matmul's backward right? 09:10:27 Jacob Lee (TA): yeah it wouldn't necessarily be defined 09:10:32 Jacob Lee (TA): elementwise requires same shape matrix 09:11:22 Anon. Membrane: then what kind of multiplication we should use (if we cannot use elementwise) for matmul's backward then. 09:11:33 Anon. Weight: Can you tell why the shape of the gradient is the shape of the corresponding transpose? 09:11:39 Jacob Lee (TA): I can't give you the answer necessarily directly 09:11:40 Anon. Boolean: what does the note *except in pytorch, hw1p1 mean? 09:12:11 Jacob Lee (TA): On why the gradient is transposed, it's because of convention 09:12:44 Jacob Lee (TA): derivative is row vector, gradient is column vector 09:13:16 Anon. Membrane: "I can't give you the answer necessarily directly", any resource I can refer now be fore your update for the malmul part 09:13:19 Anon. Weight: Ohh, okay thanks! 09:13:26 Jacob Lee (TA): the slides once ir elease them 09:13:31 Jacob Lee (TA): with the updates 09:13:44 Anon. Membrane: ok 09:14:06 Jacob Lee (TA): also in the recitation slides we uploaded a few days ago 09:14:20 Anon. Phoneme: for w it should be X^T 09:14:21 Jacob Lee (TA): in the links I provided, there's some info about it 09:14:39 Anon. Membrane: you mean the slide of last recitation? 09:15:11 Jacob Lee (TA): yea 09:15:21 Jacob Lee (TA): no 09:15:28 Jacob Lee (TA): as in we uploaded the slides of this recitation early 09:17:47 Anon. Recall Capacity: So Pytorch does not differentiate between row and column vectors? 09:17:56 Jacob Lee (TA): it does 09:18:19 Jacob Lee (TA): broadcasting makes operations neater 09:18:23 Jacob Lee (TA): but they are different in torch 09:18:39 Anon. Recall Capacity: Thanks 09:19:36 Jacob Lee (TA): we'll discuss scanning MLPs in more depth in the CNN lectures 09:19:39 Jacob Lee (TA): Convolutional Nets 09:21:09 Anon. Connectionist: In the previous example, if x’s dimension were K*M, would we do the unbroadcasting in calculating the gradient w.r.t. b1? 09:21:23 Jacob Lee (TA): Yea 09:23:48 Anon. Python: on slide 8 whats the difference between loss and lossfunc in the partial derivatives 09:24:06 Jacob Lee (TA): the loss function is a function, the loss is a value 09:24:23 Jacob Lee (TA): same difference as f(x) and y 09:25:42 Anon. Python: would dLoss/dLossfunc always be just a bunch of ones? 09:26:25 Jacob Lee (TA): I'm not sure actually; I'll ask Anti after this q 09:26:41 Anon. Python: thanks 09:33:59 Anon. Args: are those two parts of hw independent..? 09:34:11 Anon. Args: Ty! 09:34:39 Anon. Neuron: just to be clear, there will not be a quiz this week?