18:40:49	  Anon. Northumberland:	yep
18:44:19	  Anon. Murdoch:	yes
18:44:36	  Anon. Bartlett:	nope
18:44:43	  Anon. Morewood:	every thing is ok prof
18:44:54	  Anon. Flash:	Everything is clear Prof
18:46:56	  Anon. SpyKid2:	Iran
18:47:07	  Anon. SpyKid2:	No
18:48:51	  Anon. Flash:	Unsupervised learning; like clustering etc.
18:49:55	  Anon. SpyKid2:	what's the difference between validation set and test set?
18:50:06	  Anon. Friendship:	Will we be training a network where we go from lower dimension to multiple dimension. Like from latent variable
18:50:12	  Anon. SpyKid2:	thanks
18:52:37	  Anon. Ellsworth:	Change in function output w.r.t. small change in input
18:52:39	  Anon. Drax:	'rate of change of'
18:52:41	  Anon. WonderWoman:	rate of change
18:52:44	  Anon. N.Craig:	rate of change of a function
18:52:45	  Anon. S. Highland:	change in one variable with respect to another
18:52:51	  Anon. Bellefonte:	Rate of change in variable wrt to another variable
18:52:51	  Anon. Morewood:	is a steep
18:53:03	  Anon. Ellsworth:	Limit as a small perturbation goes to zero
18:53:18	  Anon. Bellefonte:	Lim as a change in variable tends to infinity ?
18:53:20	  Anon. Morewood:	is the rate of growth which can gather from limit
18:53:57	  Anon. Morewood:	df\dw
18:56:00	  Anon. Tech:	activation functions?
18:58:11	  Anon. Ellsworth:	scalar
18:58:12	  Anon. Morewood:	vector
18:58:20	  Anon. Ellsworth:	vector
18:58:21	  Anon. Drax:	vector
18:58:22	  Anon. Penn:	vector
18:58:22	  Anon. SpyKid2:	vector
18:58:26	  Anon. Penn:	D
18:58:27	  Anon. SpyKid2:	D
18:58:29	  Anon. Vision:	D
18:58:44	  Anon. Drax:	row vector
18:58:45	  Anon. Ellsworth:	Alpha is a row vector of dimension d
18:58:46	  Anon. Loki:	1xD?
18:58:47	  Anon. WonderWoman:	row vector
18:58:49	  Anon. SpyKid2:	1*D
18:58:52	  Anon. Murdoch:	row vector
18:58:54	  Anon. Morewood:	vector
18:58:55	  Anon. P.J. McArdle:	Row vector
18:59:06	  Anon. Atom:	row vector 1xd
19:00:42	  Anon. Ellsworth:	The vector of all partial derivatives
19:00:52	  Anon. Batman:	partial deriatives
19:01:49	  Anon. Morewood:	sis a derivation of a vector
19:03:08	  Anon. Ellsworth:	Direction in which function is increasing most
19:03:09	  Anon. Loki:	direction of steepest increase?
19:03:20	  Anon. Murdoch:	Steepest direction of increase
19:03:21	  Anon. SpyKid2:	Steepest change
19:03:22	  Anon. Batman:	direction of greatest change
19:05:06	  Anon. S.Craig:	yes
19:05:08	  Anon. Murdoch:	yes
19:05:09	  Anon. Friendship:	yes
19:05:11	  Anon. SpyKid2:	yes
19:05:17	  Anon. Atom:	yes
19:05:57	  Anon. Morewood:	how can we escape from local minima here
19:05:57	  Anon. Friendship:	opposite
19:06:00	  Anon. SpyKid2:	minus gradient
19:06:01	  Anon. P.J. McArdle:	opposite
19:14:47	  Anon. Ellsworth:	Parabola ish
19:16:14	  Anon. IronMan:	This is the same quiz as before?
19:19:12	  Anon. Morewood:	momentum?
19:20:50	  Anon. SpyKid2:	what if we step out of minima
19:22:08	  Anon. Morewood:	whats happen when the step is high
19:22:48	  Anon. Tech:	gets to minima faster but might never reach the "valley" if that makes sense
19:23:12	  Anon. Morewood:	whats for lower step
19:23:25	  Anon. Friendship:	Takes more time to get to the minimum
19:23:44	  Anon. Friendship:	Training time will increase
19:23:47	  Anon. Grandview:	May get stuck in local minima
19:23:49	  Anon. Morewood:	is there a fformula to find the step?
19:23:59	  Anon. Flash:	It will take finer steps towards the local minima but it will take more time to reach there.
19:25:38	  Anon. S.Craig:	How to check if we converged at local minima or global minima?
19:25:46	  Anon. Morewood:	sis the step is changeable?
19:25:58	  Anon. Murdoch:	The loss function is minimized
19:26:01	  Anon. Phillips:	maybe we can re-initialize and try different starting points?
19:26:05	  Anon. Tech:	yes
19:26:10	  Anon. Tech:	random initialization can help with that
19:26:26	  Anon. Atom:	we can also use momentum
19:27:23	  Anon. Morewood:	or other optimization like the adam,msprop,,,,,,
19:28:13	  Anon. Morewood:	prof,whats the cost function here?
19:28:19	  Anon. Morewood:	sis MSE?
19:30:06	  Anon. GreenArrow:	Bhiksha will introduce what cost function we will use here very soon
19:30:12	  Anon. GreenArrow:	it would be KL I believe
19:32:36	  Anon. Thor:	cross entropy loss for classification?
19:33:49	  Anon. Morewood:	where we can use thedm
19:34:06	  Anon. GreenArrow:	"cross entropy loss for classification?" Yes?
19:37:45	  Anon. SpyKid2:	can we say hidden layer is a single function that has multiple input and multiple output?
19:38:07	  Anon. Morewood:	how can hwlp us this function
19:38:09	  Anon. Tech:	I've seen it that way with the activation function included
19:38:43	  Anon. Morewood:	yeah, usually it is used in the last layer of network
19:38:56	  Anon. Loki:	so for vector activations does the whole layer act as one unit that takes in a vector and produces a vector?
19:39:12	  Anon. GreenArrow:	"can we say hidden layer is a single function that has multiple input and multiple output?" why you think it is a single function?
19:39:46	  Anon. Atom:	we can use softmax activation in the last layer for multi class classification
19:40:03	  Anon. GreenArrow:	yes
19:40:10	  Anon. SpyKid2:	can we say hidden layer is a single function that has multiple input and multiple output?
19:40:45	  Anon. Tech:	oh, I think they mean hidden layer 1 can be represented as h1 = f(x1, … xn; w1) and the activation would be z1 = activation_func(h1) and so on. Is this acceptable notation or is it preferred to stay purely in a vectorized form?
19:40:54	  Anon. Morewood:	yeah ehsan I think
19:42:12	  Anon. Tech:	yes
19:42:13	  Anon. SpyKid2:	thank you
19:50:10	  Anon. SpyKid2:	we have infinite classes in the world, can we use some kind of combination between classes? for example can we say horse is 80 percent cat + 20 percent dog?
19:51:20	  Anon. Phillips:	^ you can look into top modeling and models like LDA
19:51:25	  Anon. Phillips:	topic*
19:52:14	  Anon. Tech:	what about using the feature vectors in a manner similar to glove word embeddings? Is that a thing?
19:58:38	  Anon. Tech:	So this is like cross-entropy loss?
19:59:09	  Anon. IronWoman:	The equation is the same, so yeah I guess.
19:59:22	  Anon. Darlington:	it is a special case of cross-entrophy
19:59:24	  Anon. Atom:	is loss function being differentiable at all time always a necessary condition?
19:59:37	  Anon. GreenArrow:	cross-entropy is different from KL divergence but can be calculated using KL divergence
19:59:38	  Anon. Murdoch:	KL divergence is supposed to be the cross entropy
19:59:49	  Anon. Tech:	oh ok, thanks
20:01:24	  Anon. GreenArrow:	"is loss function being differentiable at all time always a necessary condition?" If it not differentable, you probably cannot backprop the gradient backward
20:02:56	  Anon. IronWoman:	-1
20:03:11	  Anon. IronWoman:	Same, -1
20:03:13	  Anon. S.Craig:	1
20:03:14	  Anon. IronWoman:	1
20:09:20	  Anon. Morewood:	-log(1-y)
20:10:10	  Anon. S.Craig:	High negative
20:12:22	  Anon. SpyKid2:	the difference between validation set and test set
20:13:20	  Anon. SpyKid2:	can we say training set is for training our model and validation set is for training our hyper parameters?
20:13:40	  Anon. S. Aiken:	Also we use val set for early stopping
20:14:14	  Anon. GreenArrow:	"can we say training set is for training our model and validation set is for training our hyper parameters?" You can understand in this way
20:14:24	  Anon. S. Aiken:	We are using softmax before KL div right?
20:14:36	  Anon. SpyKid2:	can we say training set is for training our model and validation set is for training our hyper parameters?
20:15:09	  Anon. GreenArrow:	I prefer the word "tuning" rather than "training"
20:15:14	  Anon. Tech:	val set can be used during training to predict high variance on test set, to tune hyperparameters, or to select the best models
20:15:30	  Anon. Atom:	so our loss function should always be smooth functions right?
20:15:43	  Anon. GreenArrow:	"training" hyper parameters sounds weird. You cannot train them, you tune/adjust them
20:17:32	  Anon. Wanda:	We said the gradient in s the transpose of the derivative. I had previously used the terms gradient and derivative interchangeably. What is the relevance of the transpose?
20:17:40	  Anon. Wanda:	*is
20:18:41	  Anon. SpyKid2:	^ by the word "training" I meant:knowing which one we should use
20:19:20	  Anon. GreenArrow:	yeah, then you can understand in this way
20:21:50	  Anon. Atom:	what is the difference between cross entropy vs KL divergence?
20:24:46	  Anon. Schenley:	cross entropy = entropy + KL divergence
20:27:13	  Anon. Tech:	Thank you professor
20:27:15	  Anon. Beechwood:	Thank you!
20:27:15	  Anon. Schenley:	Thank you
20:27:17	  Anon. Loki:	Thank you prof!
20:27:20	  Anon. Baum:	Thank you
20:27:22	  Anon. S.Craig:	Thank you
20:27:22	  Anon. Firestorm:	Thank you!
20:27:23	  Anon. Atom:	thank you 😇