00:22:42 Anon. CasCor: We will update the bias with the weights right? 00:22:54 Anon. CrossEntropyLoss: Yes 00:24:37 jinhyun1@andrew.cmu.edu (TA): yup 00:24:46 Anon. Kernel: yes 00:30:44 Anon. Sum-Product: What do you mean by a divergence function? 00:31:02 Anon. is_leaf: why should divergence be positive if not 0? 00:31:09 Anon. VC Dimension: is this to be distinguished from the divergence in 3d calculus 00:31:16 Anon. Attractor: Positive and negative sum to 0? 00:31:17 Anon. Kernel Trick: Why use divergence rather than magnitude of the difference, or squared difference, for example, to measure error? 00:31:34 jinhyun1@andrew.cmu.edu (TA): @yiwei it’s a functino that measures the difference between your neural network function and the true function 00:31:49 Anon. Variance: g(x) is the expected output? 00:31:53 Mansi Anand (TA): @Yiwei, divergence is the measure of expansion from truth. 00:32:22 Anon. Dendritic Spine: I think you can use squared difference as divergence function 00:32:23 jinhyun1@andrew.cmu.edu (TA): @David divergence can be all of those things - it’s just a general expression that you can fill in with your wanted loss function 00:32:34 jinhyun1@andrew.cmu.edu (TA): @vaidehi no, g is Ground truth 00:32:51 Anon. Sum-Product: So it does not refer to the divergence operator in vector calculus? 00:32:58 jinhyun1@andrew.cmu.edu (TA): @aditya generally, we minimize loss to be 0 00:32:58 Anon. VC Dimension: ^ 00:33:02 jinhyun1@andrew.cmu.edu (TA): @ran what do you mean? 00:33:05 jinhyun1@andrew.cmu.edu (TA): @yiwei yes, they are different 00:33:12 Anon. Kernel Trick: Ah, ok. Thanks 00:33:14 Anon. Sum-Product: Thank you 00:34:24 Anon. is_leaf: thank you 00:34:29 Anon. Center Loss: Are there rule of thumb how many samples we need to have? 00:34:56 jinhyun1@andrew.cmu.edu (TA): @Rui the “samples” in the real world will be a dataset 00:35:10 Anon. Center Loss: right, I mean how many data points? 00:35:20 jinhyun1@andrew.cmu.edu (TA): And there are guidelines on how large of a dataset you need for training some model to some test accuracy, etc but no, there generally isn’t a rule of thumb 00:35:28 jinhyun1@andrew.cmu.edu (TA): okay that was contradictory 00:35:28 Mansi Anand (TA): as many as you can is the desire. but depends how many can you get in real world. 00:35:37 Anon. Kernel: @Rui, pretty much as much as you can. They more you have, the more you can approximate the function 00:35:53 Anon. Kernel: the better you can approximate* 00:35:53 jinhyun1@andrew.cmu.edu (TA): in theory, there are guidelines, but making a dataset usually is constrained by other factors like time 00:36:00 Anon. ResNet50: Integral of divergence is just difference in the area under the curves for g(x) and f(x) right? 00:36:16 jinhyun1@andrew.cmu.edu (TA): Not exactly 00:36:31 jinhyun1@andrew.cmu.edu (TA): Divergence is a measure of the difference between the two functions 00:36:31 Anon. EC2: why do we take average of divergence? 00:36:45 jinhyun1@andrew.cmu.edu (TA): not necessairly the actual difference (because that would be both negative and positive at different points) 00:36:47 Anon. Thalamus: @Sai I guess it also involves abs(f(x)-g(x)) 00:37:04 jinhyun1@andrew.cmu.edu (TA): Along with a lot of other functions too 00:37:24 Anon. Callback: Taking average is the most typical manner to consider all data points. 00:37:25 jinhyun1@andrew.cmu.edu (TA): @Jui We would like to take the integral, but we have a finite dataset, so we just take the average divergence over the dataset 00:37:43 Anon. ResNet50: Yeah so that essentially is an absolute difference in area under the curves right? @Daniel 00:37:47 Mansi Anand (TA): @jui because we have to measure the area or get an estimate how much divergence do we have 00:38:12 Anon. Softmax: Is there a lower bound on the number of samples needed? 00:38:54 Anon. Thalamus: @Posholi depends on your target function 00:39:02 Anon. Kaiming: =======================' 00:39:11 Anon. Args: @Posholi that is a learnability problem. They call it PAC learning. We might see it later on this class 00:39:15 jinhyun1@andrew.cmu.edu (TA): https://towardsdatascience.com/calculating-sample-size-using-vc-dimensions-a-machine-learning-way-748abbe1b1e4 00:39:23 Anon. Alpha: @POsholi in Probably approximate learning you can determine the lower bound 00:39:30 jinhyun1@andrew.cmu.edu (TA): a quick introduciton for # of samples if you want 00:39:32 Anon. Softmax: Ok. Thank you! 00:39:53 Anon. Kernel Trick: So we assume an architecture capable of learning a function. Any tips on finding actually finding such an architecture? Or will that come later 00:40:51 Mansi Anand (TA): guessed it right, will come later. 00:41:15 Anon. Callback: @David https://arxiv.org/abs/1901.00434 00:41:31 Anon. Callback: You can refer to "capacity of NN" 00:43:51 Anon. YOLOv5: There is a new research topic called “Neural Architecture Search”, which is one lab in CMU. Could check it out. It’s cutting-edge research topic. 00:47:38 Anon. batch_first: same as x vector 00:47:42 Anon. Thalamus: same as x 00:48:40 Anon. Hodgkin-Huxley: -x 00:48:41 Anon. Caffe: -x 00:49:04 Anon. Kaiming: -x 00:49:11 Anon. Asynchronous Update: Opposite of x 00:49:12 Anon. Ion pump: -x 00:49:28 Anon. Leakage: -x 00:52:28 Anon. Ion pump: x 00:55:52 Mansi Anand (TA): 5 secs 00:56:04 Anon. hello_world.py: when a vector x is misclassified, why we are updating w using x + w instead of w + 0.5x or w + 2x, is that jsut a rule of thumb? 00:56:06 Anon. Callback: Here is the proof for the lemma:http://www.cs.columbia.edu/~mcollins/courses/6998-2012/notes/perc.converge.pdf 00:56:06 Anon. Hodgkin-Huxley: Should we worry about normalization of W? 00:56:26 Anon. Kernel: Poll was so quick:-( 00:56:37 Anon. Args: Btw does the poll has any weight towards our final grades? 00:56:45 Mansi Anand (TA): No 00:57:32 Anon. Kaiming: no 00:57:35 Anon. Attractor: no 00:57:36 Anon. Hodgkin-Huxley: no 00:57:45 Anon. Node of Ranvier: Why don’t we just collect all the x and -x and use the average of those as the weight? 00:59:21 Anon. ResNet50: No 00:59:22 Anon. print(‘Hello world!’): no 00:59:24 Anon. Dendritic Spine: No 00:59:25 Anon. Asynchronous Update: NO 00:59:26 Anon. Deep Dream: no 00:59:26 Anon. PDF: no 00:59:28 Anon. Hodgkin-Huxley: no 00:59:28 Anon. RNN: no 00:59:28 Anon. Kaiming: no 00:59:30 Anon. pdb.set_trace(): @Feng-Guang I feel like you could probably come up with a case where that doesn't work. Like if you had 50 positive classes and 1 negative class, the average would probably be pretty far off 00:59:33 Anon. Voltage-gate: no 00:59:35 Anon. Leakage: no 00:59:47 Anon. Center Loss: @feng, the perceptron algo may converge faster than using all data points 01:00:55 Anon. LTI: What are "the lowest units"? 01:01:00 jinhyun1@andrew.cmu.edu (TA): You can ccome up with a scenario where that doesn’t yield the right separation. For instance, 50 positive points grouped together, with 1 positive point far off, average would be heavily weighted towards the 50 01:01:18 jinhyun1@andrew.cmu.edu (TA): Lowest units I think just refers to the lowest level perceptron nodes 01:01:22 jinhyun1@andrew.cmu.edu (TA): @hongyuan 01:01:33 Anon. Node of Ranvier: Yeah I got it 01:01:58 Anon. Convolution: How do we determine how many neurons we need in the hidden layer, if we do not know the truth function? 01:02:33 jinhyun1@andrew.cmu.edu (TA): in the real world, you just try different sizes 01:02:47 Mansi Anand (TA): that is a network design choice. 01:03:21 Anon. Sparse Matrix: How would we know how many decision boundaries we would need in general? 01:05:06 jinhyun1@andrew.cmu.edu (TA): You generally don't 01:05:27 Anon. Callback: Same. You need to try 01:05:59 Anon. Linear Layer: Relabeling is exp(number of instances), right? 01:06:28 Anon. Asynchronous Update: 2^n 01:12:21 Anon. Markov Chain: no 01:12:22 Anon. Ion pump: no 01:12:22 Anon. Asynchronous Update: no 01:12:31 Anon. Ion pump: nope 01:12:35 Anon. Leakage: no 01:12:54 Anon. Undirected Edge: How do you determine if classes are separable in practice without visualizing the points? 01:12:57 Anon. Kernel: right 01:17:02 Anon. Dendritic Spine: Can we use the absolute of w * x as the distance 01:18:09 Anon. CasCor: Thats not easily differentiable I believe 01:18:38 Anon. Thalamus: That is what quadratic loss does iirc 01:18:55 Anon. Asynchronous Update: 1 01:18:58 Anon. Attractor: 1 01:21:07 Anon. Kirchhoff: Previously we added the vector of the misclassified training poing to the weight vector to get a better boundary. That gave us a "direction" for optimising the boundary, so why do we need this current approach? is this when the classes are not linearly separable? 01:22:55 jinhyun1@andrew.cmu.edu (TA): We will see soon, but we want a new solution because the other one was exponential w.r.t samples 01:22:59 Anon. Sum-Product: What is the horizontal axis here? 01:23:10 jinhyun1@andrew.cmu.edu (TA): just x 01:23:12 Mansi Anand (TA): in reality you would not have an exact idea that where do I need to wiggle and how much do I need to. Also, there might be many points in different directions like that. 01:23:20 Anon. Max Pool: the input data, assuming its 1-dimensional 01:23:29 Anon. Callback: @Jean I think you'd better train a classification model to check that 01:24:28 Anon. Callback: @Ke W * X is already angular distance if both W and X are normalized or it does not matter. 01:32:44 Anon. LTI: What is P(X) here? 01:33:02 Anon. Autograd: probability of seeing X 01:33:21 Anon. LTI: Makes sense. Thanks Mitchell 01:33:28 Anon. Autograd: :- 01:35:12 Anon. Hodgkin-Huxley: Does empirical estimate converge to expected when N -> infty? 01:35:22 Anon. Hodgkin-Huxley: This seems believable if sampled from P(X) 01:35:31 Anon. Sum-Product: So essentially the empirical risk is the “discretized” divergence (on sample points)? 01:35:46 Anon. Tanh: basically 01:36:15 Anon. Tanh: btn ur expected out and ur actual one 01:37:02 Mansi Anand (TA): @Yiwei yes