00:22:42	Anon. CasCor:	We will update the bias with the weights right?
00:22:54	Anon. CrossEntropyLoss:	Yes
00:24:37	jinhyun1@andrew.cmu.edu (TA):	yup
00:24:46	Anon. Kernel:	yes
00:30:44	Anon. Sum-Product:	What do you mean by a divergence function?
00:31:02	Anon. is_leaf:	why should divergence be positive if not 0?
00:31:09	Anon. VC Dimension:	is this to be distinguished from the divergence in 3d calculus
00:31:16	Anon. Attractor:	Positive and negative sum to 0?
00:31:17	Anon. Kernel Trick:	Why use divergence rather than magnitude of the difference, or squared difference, for example, to measure error?
00:31:34	jinhyun1@andrew.cmu.edu (TA):	@yiwei it’s a functino that measures the difference between your neural network function and the true function
00:31:49	Anon. Variance:	g(x) is the expected output?
00:31:53	Mansi Anand (TA):	@Yiwei, divergence is the measure of expansion from truth.
00:32:22	Anon. Dendritic Spine:	I think you can use squared difference as divergence function
00:32:23	jinhyun1@andrew.cmu.edu (TA):	@David divergence can be all of those things - it’s just a general expression that you can fill in with your wanted loss function
00:32:34	jinhyun1@andrew.cmu.edu (TA):	@vaidehi no, g is Ground truth
00:32:51	Anon. Sum-Product:	So it does not refer to the divergence operator in vector calculus?
00:32:58	jinhyun1@andrew.cmu.edu (TA):	@aditya generally, we minimize loss to be 0
00:32:58	Anon. VC Dimension:	^
00:33:02	jinhyun1@andrew.cmu.edu (TA):	@ran what do you mean?
00:33:05	jinhyun1@andrew.cmu.edu (TA):	@yiwei yes, they are different
00:33:12	Anon. Kernel Trick:	Ah, ok. Thanks
00:33:14	Anon. Sum-Product:	Thank you
00:34:24	Anon. is_leaf:	thank you
00:34:29	Anon. Center Loss:	Are there rule of thumb how many samples we need to have?
00:34:56	jinhyun1@andrew.cmu.edu (TA):	@Rui the “samples” in the real world will be a dataset
00:35:10	Anon. Center Loss:	right, I mean how many data points?
00:35:20	jinhyun1@andrew.cmu.edu (TA):	And there are guidelines on how large of a dataset you need for training some model to some test accuracy, etc but no, there generally isn’t a rule of thumb
00:35:28	jinhyun1@andrew.cmu.edu (TA):	okay that was contradictory
00:35:28	Mansi Anand (TA):	as many as you can is the desire. but depends how many can you get in real world.
00:35:37	Anon. Kernel:	@Rui, pretty much as much as you can. They more you have, the more you can approximate the function
00:35:53	Anon. Kernel:	the better you can approximate*
00:35:53	jinhyun1@andrew.cmu.edu (TA):	in theory, there are guidelines, but making a dataset usually is constrained by other factors like time
00:36:00	Anon. ResNet50:	Integral of divergence is just difference in the area under the curves for g(x) and f(x) right?
00:36:16	jinhyun1@andrew.cmu.edu (TA):	Not exactly
00:36:31	jinhyun1@andrew.cmu.edu (TA):	Divergence is a measure of the difference between the two functions
00:36:31	Anon. EC2:	why do we take average of divergence?
00:36:45	jinhyun1@andrew.cmu.edu (TA):	not necessairly the actual difference (because that would be both negative and positive at different points)
00:36:47	Anon. Thalamus:	@Sai I guess it also involves abs(f(x)-g(x))
00:37:04	jinhyun1@andrew.cmu.edu (TA):	Along with a lot of other functions too
00:37:24	Anon. Callback:	Taking average is the most typical manner to consider all data points.
00:37:25	jinhyun1@andrew.cmu.edu (TA):	@Jui We would like to take the integral, but we have a finite dataset, so we just take the average divergence over the dataset
00:37:43	Anon. ResNet50:	Yeah so that  essentially is an absolute difference in area under the curves right? @Daniel
00:37:47	Mansi Anand (TA):	@jui because we have to measure the area or get an estimate how much divergence do we have
00:38:12	Anon. Softmax:	Is there a lower bound on the number of samples needed?
00:38:54	Anon. Thalamus:	@Posholi depends on your target function
00:39:02	Anon. Kaiming:	======================='
00:39:11	Anon. Args:	@Posholi that is a learnability problem. They call it PAC learning. We might see it later on this class
00:39:15	jinhyun1@andrew.cmu.edu (TA):	https://towardsdatascience.com/calculating-sample-size-using-vc-dimensions-a-machine-learning-way-748abbe1b1e4
00:39:23	Anon. Alpha:	@POsholi in Probably approximate learning you can determine the lower bound
00:39:30	jinhyun1@andrew.cmu.edu (TA):	a quick introduciton for # of samples if you want
00:39:32	Anon. Softmax:	Ok. Thank you!
00:39:53	Anon. Kernel Trick:	So we assume an architecture capable of learning a function. Any tips on finding actually finding such an architecture? Or will that come later
00:40:51	Mansi Anand (TA):	guessed it right, will come later.
00:41:15	Anon. Callback:	@David https://arxiv.org/abs/1901.00434
00:41:31	Anon. Callback:	You can refer to "capacity of NN"
00:43:51	Anon. YOLOv5:	There is a new research topic called “Neural Architecture Search”, which is one lab in CMU. Could check it out. It’s cutting-edge research topic.
00:47:38	Anon. batch_first:	same as x vector
00:47:42	Anon. Thalamus:	same as x
00:48:40	Anon. Hodgkin-Huxley:	-x
00:48:41	Anon. Caffe:	-x
00:49:04	Anon. Kaiming:	-x
00:49:11	Anon. Asynchronous Update:	Opposite of x
00:49:12	Anon. Ion pump:	-x
00:49:28	Anon. Leakage:	-x
00:52:28	Anon. Ion pump:	x
00:55:52	Mansi Anand (TA):	5 secs
00:56:04	Anon. hello_world.py:	when a vector x is misclassified, why we are updating w using x + w instead of w + 0.5x or w + 2x, is that jsut a rule of thumb?
00:56:06	Anon. Callback:	Here is the proof for the lemma:http://www.cs.columbia.edu/~mcollins/courses/6998-2012/notes/perc.converge.pdf
00:56:06	Anon. Hodgkin-Huxley:	Should we worry about normalization of W?
00:56:26	Anon. Kernel:	Poll was so quick:-(
00:56:37	Anon. Args:	Btw does the poll has any weight towards our final grades?
00:56:45	Mansi Anand (TA):	No
00:57:32	Anon. Kaiming:	no
00:57:35	Anon. Attractor:	no
00:57:36	Anon. Hodgkin-Huxley:	no
00:57:45	Anon. Node of Ranvier:	Why don’t we just collect all the x and -x and use the average of those as the weight?
00:59:21	Anon. ResNet50:	No
00:59:22	Anon. print(‘Hello world!’):	no
00:59:24	Anon. Dendritic Spine:	No
00:59:25	Anon. Asynchronous Update:	NO
00:59:26	Anon. Deep Dream:	no
00:59:26	Anon. PDF:	no
00:59:28	Anon. Hodgkin-Huxley:	no
00:59:28	Anon. RNN:	no
00:59:28	Anon. Kaiming:	no
00:59:30	Anon. pdb.set_trace():	@Feng-Guang I feel like you could probably come up with a case where that doesn't work. Like if you had 50 positive classes and 1 negative class, the average would probably be pretty far off
00:59:33	Anon. Voltage-gate:	no
00:59:35	Anon. Leakage:	no
00:59:47	Anon. Center Loss:	@feng, the perceptron algo may converge faster than using all data points
01:00:55	Anon. LTI:	What are "the lowest units"?
01:01:00	jinhyun1@andrew.cmu.edu (TA):	You can ccome up with a scenario where that doesn’t yield the right separation. For instance, 50 positive points grouped together, with 1 positive point far off, average would be heavily weighted towards the 50
01:01:18	jinhyun1@andrew.cmu.edu (TA):	Lowest units I think just refers to the lowest level perceptron nodes
01:01:22	jinhyun1@andrew.cmu.edu (TA):	@hongyuan
01:01:33	Anon. Node of Ranvier:	Yeah I got it
01:01:58	Anon. Convolution:	How do we determine how many neurons we need in the hidden layer, if we do not know the truth function?
01:02:33	jinhyun1@andrew.cmu.edu (TA):	in the real world, you just try different sizes
01:02:47	Mansi Anand (TA):	that is a network design choice.
01:03:21	Anon. Sparse Matrix:	How would we know how many decision boundaries we would need in general?
01:05:06	jinhyun1@andrew.cmu.edu (TA):	You generally don't
01:05:27	Anon. Callback:	Same. You need to try
01:05:59	Anon. Linear Layer:	Relabeling is exp(number of instances), right?
01:06:28	Anon. Asynchronous Update:	2^n
01:12:21	Anon. Markov Chain:	no
01:12:22	Anon. Ion pump:	no
01:12:22	Anon. Asynchronous Update:	no
01:12:31	Anon. Ion pump:	nope
01:12:35	Anon. Leakage:	no
01:12:54	Anon. Undirected Edge:	How do you determine if classes are separable in practice without visualizing the points?
01:12:57	Anon. Kernel:	right
01:17:02	Anon. Dendritic Spine:	Can we use the absolute of w * x as the distance
01:18:09	Anon. CasCor:	Thats not easily differentiable I believe
01:18:38	Anon. Thalamus:	That is what quadratic loss does iirc
01:18:55	Anon. Asynchronous Update:	1
01:18:58	Anon. Attractor:	1
01:21:07	Anon. Kirchhoff:	Previously we added the vector of the misclassified training poing to the weight vector to get a better boundary. That gave us a "direction" for optimising the boundary, so why do we need this current approach? is this when the classes are not linearly separable?
01:22:55	jinhyun1@andrew.cmu.edu (TA):	We will see soon, but we want a new solution because the other one was exponential w.r.t samples
01:22:59	Anon. Sum-Product:	What is the horizontal axis here?
01:23:10	jinhyun1@andrew.cmu.edu (TA):	just x
01:23:12	Mansi Anand (TA):	in reality you would not have an exact idea that where do I need to wiggle and how much do I need to. Also, there might be many points in different directions like that.
01:23:20	Anon. Max Pool:	the input data, assuming its 1-dimensional
01:23:29	Anon. Callback:	@Jean I think you'd better train a classification model to check that
01:24:28	Anon. Callback:	@Ke W * X is already angular distance if both W and X are normalized or it does not matter.
01:32:44	Anon. LTI:	What is P(X) here?
01:33:02	Anon. Autograd:	probability of seeing X
01:33:21	Anon. LTI:	Makes sense. Thanks Mitchell
01:33:28	Anon. Autograd:	:-
01:35:12	Anon. Hodgkin-Huxley:	Does empirical estimate converge to expected when N -> infty?
01:35:22	Anon. Hodgkin-Huxley:	This seems believable if sampled from P(X)
01:35:31	Anon. Sum-Product:	So essentially the empirical risk is the “discretized” divergence (on sample points)?
01:35:46	Anon. Tanh:	basically
01:36:15	Anon. Tanh:	btn ur expected out and ur actual one
01:37:02	Mansi Anand (TA):	@Yiwei yes