00:36:42	Anon. Voltage-gate:	Yes
00:37:02	Anon. Electrochemical gradient:	Today’s slide has not been posted right?
00:37:22	Reshmi Ghosh (TA):	Nope
00:37:37	Reshmi Ghosh (TA):	We’ll do it after the lecture!
00:37:59	Anon. Electrochemical gradient:	Thanks!
00:43:51	Anon. CasCor:	it’s only a sampling?
00:44:05	Anon. Deep Blue:	the threshold matters?
00:44:06	Anon. comicstrip:	might not be linearly seeperable
00:44:08	Anon. Linus:	this could be skewed by unfair distribution of points
00:44:13	Anon. Imagenet:	it’ll only reduce the average error length
00:44:15	Anon. Capacitance:	Some might be further away?
00:44:20	Anon. Imagenet:	not the classification error
00:44:38	Anon. CUDAError:	more points on one side could skew the boundary
00:44:46	Anon. Capacitance:	It will move in that direction
00:44:47	Anon. ICA:	It will flatten/tend towards the majoriity
00:44:48	Anon. comicstrip:	will be pulled to the right
00:44:51	Anon. Gradient:	you will miss many points
00:46:13	Anon. CMU:	Yes
00:46:15	Anon. Soma:	yes
00:46:15	Anon. Decoder:	yes
00:46:35	Anon. Capacitance:	no
00:46:36	Anon. batch_first:	yes
00:46:42	Anon. D33p_M1nd:	yes
00:46:43	Anon. Gradient:	no
00:46:53	Anon. Transformer:	yes
00:47:08	Anon. batch_first:	yes
00:47:08	Anon. Array:	yes
00:47:09	Anon. Kalman Filter:	yes
00:47:10	Anon. Electrochemical gradient:	yes
00:47:16	Anon. Phoneme:	yes
00:47:31	Anon. D33p_M1nd:	it will fail
00:48:21	Anon. comicstrip:	can we ignore correctly labled samples in backprop?
00:48:40	Anon. comicstrip:	like we do in perceptron
00:48:44	Anon. Uniform Distribution:	good
00:48:45	Anon. D33p_M1nd:	good
00:48:45	Anon. Capacitance:	Good
00:48:47	Anon. Phoneme:	bad
00:48:48	Anon. Andy:	good
00:48:48	Anon. Decoder:	good
00:48:50	Anon. Electrochemical gradient:	Good behavior
00:48:52	Anon. Eta:	Bad? Doesn’t this make perceptron more vulnerable to outliers?
00:48:52	Anon. batch_first:	bad
00:48:56	Anon. CUDAError:	good
00:48:57	Anon. Kernel Trick:	bad
00:49:13	Anon. Electrochemical gradient:	Help tolerate noise
00:51:20	Reshmi Ghosh (TA):	Poll folks
00:51:58	Anon. comicstrip:	so would backprop be better for probabilistic data and perceptron better for deterministic data?
00:53:43	Anon. YOLOv6:	what's the question? lost that part
00:55:24	Anon. LeNet:	Why the weights would go to infinity?
00:59:14	Anon. Potassium Ion:	non global minima + saddle points
01:01:08	Anon. ICA:	How large is large for these networks?
01:01:20	Anon. ICA:	Approximately
01:02:10	Anon. Derivative:	is there any particular reason why there’s a huge time skip between your selection of papers
01:03:00	Anon. D33p_M1nd:	Isn't the loss surface dependent upon the training data..?  if yes how does network size increase number of saddle points..etc?
01:03:28	Anon. Deep Blue:	Thanks to Nvidia:P
01:03:32	Reshmi Ghosh (TA):	lol
01:09:34	Anon. Deep Blue:	Do we say a network has "converged" when the loss function is minimized or in terms of minimized classification error?
01:09:35	Anon. EC2:	How can we know the optimal x* beforehand?
01:11:16	Anon. ICA:	Based on the sign of a
01:11:23	Anon. ICA:	+
01:11:24	Anon. Phoneme:	+\
01:11:26	Anon. Phoneme:	+
01:11:27	Anon. Electrochemical gradient:	>0
01:12:31	Anon. Indifferentiable:	@Tarang For convex loss function, that is correct. Actually most loss objectives are non-convex.
01:12:36	Anon. Andy:	Not really
01:19:43	Anon. Cerebellum:	second derivative?
01:19:50	Anon. Phoneme:	second derivative inverse
01:19:51	Anon. LeNet:	Inverse Hessian
01:20:16	Reshmi Ghosh (TA):	@Mitch and @Anantananda do you have questions?
01:20:32	Reshmi Ghosh (TA):	@ananyananda** sorry
01:21:31	Anon. Derivative:	depends
01:21:44	Anon. Derivative:	it may jitter or even diverge if it’s entirely too big
01:21:55	Anon. LeNet:	No
01:22:03	Anon. Derivative:	for this function, 2(w-w*)
01:22:17	Anon. Potassium Ion:	So just to confirm, is the optimal step size normally the value of the inverse Hessian?
01:22:36	Anon. LeNet:	Only in this quadratic case
01:23:55	Anon. Linus:	but wouldn't that only be the optimum for a quadratic approximation and not the actual function?
01:27:06	Anon. C++:	a11 and a22 should not be squared, right?
01:27:11	Anon. ICA:	^
01:27:19	Anon. Capacitance:	^
01:27:22	Anon. Potassium Ion:	^
01:31:15	Anon. Synapse:	Steep
01:31:16	Anon. Cerebellum:	the steeper direction?
01:33:16	Anon. Deep Blue:	In practice if we use a step thats proportional to the derivative in that dimension(as we learnt in the previous few classes), it will ensure that we dont need to maintain separate steps per dimension right?
01:34:15	Anon. Potassium Ion:	So there is an optimal step size based on the derivative of each dimension, which makes it that the overall optimal step size (lr) is the one that falls between n_opt & 2n_opt across all dimensions?
01:34:48	Anon. Indifferentiable:	It will be covered soon
01:35:06	Anon. Linus:	layer wise learning rate?
01:35:41	Anon. XOR Gate:	Is this the reason why some algorithms require normalization?
01:36:45	Anon. LeNet:	Adapted step size
01:36:52	Anon. Synapse:	Gradually decrease lr
01:36:58	Anon. Curse of Dimensionality:	Something similar to simulated annealing
01:39:09	Reshmi Ghosh (TA):	Poll up folks
01:39:22	Anon. Scheduler:	what does the second one mean?
01:39:24	Anon. Electrochemical gradient:	What is always a bad thing?
01:48:21	Anon. Refractory Period:	What’s the alpha here
01:49:02	Anon. XOR Gate:	I think it’s a hyper-param
01:49:10	Anon. EC2:	does it mean rprop can not jump out of local minimum?
01:50:49	Anon. Axon:	why is it called Rprop?
01:51:07	Reshmi Ghosh (TA):	Resilient propagation
01:52:26	Anon. LeNet:	What do we mean by dimension-independent learning rates?
01:53:57	Anon. Phoneme:	0
01:56:39	Anon. Eta:	how do you choose beta?
01:57:58	Anon. Linus:	^ and does that even have to be decayed over time
02:01:24	Anon. LeNet:	What do we mean by dimension-independent learning rates?
02:01:54	Reshmi Ghosh (TA):	@anurag and @asish if we don’t get back to you now
02:01:58	Reshmi Ghosh (TA):	Get back on Piazza
02:02:11	Reshmi Ghosh (TA):	@yiwei I will post your question too