07:58:14 From Jacob Lee (TA):	good luck today friends
07:58:23 From Jacob Lee (TA):	👌🏻👌🏻👌🏻👌🏻
08:00:03 From Anon. Cable Theory:	Could the host go to Zoom settings and mute participants upon joining?
08:00:18 From Jacob Lee (TA):	remember to record!
08:01:10 From Anon. Array:	view present
08:01:19 From Anon. Array:	no
08:01:27 From Anon. Sequence:	Not currently sharing
08:06:37 From Anon. YOLOv3:	Where can we find the recitation slides?
08:06:56 From Anon. Decoder:	If perceptrons are single layer neural nets, then how is multi layered perceptron different from neural nets?
08:07:08 From jinhyun1@andrew.cmu.edu (TA):	@Ran They’ll be up right after this recitation ends
08:07:10 From Anon. Deep Dream:	perceptron is a type of neural net
08:07:12 From Anon. Deep Dream:	I think
08:07:23 From jinhyun1@andrew.cmu.edu (TA):	Yes
08:07:42 From Anon. is_available():	Are these slides on the course website?
08:07:49 From jinhyun1@andrew.cmu.edu (TA):	So Neural Network is an umbrella term that includes MLPs (which are just fully connected layers like these), CNNs, RNNs, etc.
08:07:50 From Jacob Lee (TA):	they will be
08:08:09 From jinhyun1@andrew.cmu.edu (TA):	@Yueqing Sorry they are not up yet
08:08:19 From Anon. LTI:	I guess the word “perceptron” usually refers to one neuron with input, weight, and activation
08:08:30 From jinhyun1@andrew.cmu.edu (TA):	That’s right
08:08:32 From Anon. is_available():	Okay, thanks:)
08:09:10 From Anon. Decoder:	thanks!
08:09:53 From Anon. Kaggle:	Is Lr learning rate?
08:09:58 From jinhyun1@andrew.cmu.edu (TA):	Yes
08:09:59 From Anon. Fast-RCNN:	yes
08:10:02 From Anon. is_available():	yes
08:10:02 From Anon. Python:	yeah
08:10:09 From Anon. Kaggle:	Thank you!
08:10:58 From Jacob Lee (TA):	👏🏻👏🏻👏🏻
08:11:57 From Anon. Gradient Flow:	How would running multiple test cases stop us from getting stuck in a local minima?
08:12:29 From Jacob Lee (TA):	@bajram the gradient space will be slightly different each time, so there'll be different minima/maxima
08:12:33 From Anon. Array:	If you initialize your position in many locations you have a great chance of finding global minimum
08:13:40 From Anon. Gradient Flow:	Ok thanks
08:15:42 From Anon. Eta:	thinking over overhead, are operations using the gpu faster for smaller matrices too?
08:15:49 From Anon. Uniform Distribution:	can you initialize the tensor directly within the gpu memory in order to remove the moving time?
08:16:31 From Anon. ReduceLROnPlateau:	can we create tensors on GPU by default?
08:16:41 From Anon. ReduceLROnPlateau:	my bad, asked same question
08:16:44 From Jacob Lee (TA):	@uma i think it depends, but it's generally the same; transferring between cpu and gpu is a big time bottleneck however
08:17:06 From Anon. Uniform Distribution:	Okay thank you!
08:17:09 From Anon. Acitvation Function:	Can we individually select variables from the code to run on gpu while some of them may not be on gpu
08:17:09 From Reshmi Ghosh (TA):	Yes you can, but you need to move the data to numpy to generate results
08:17:11 From Jacob Lee (TA):	:)
08:17:13 From Anon. Eta:	@Jacob Lee thank you!
08:17:22 From Reshmi Ghosh (TA):	And save them locally
08:17:58 From Anon. Graph:	is there a disadvantage to always moving every tensor we make to gpu?
08:18:25 From Anon. Deep Dream:	what's the difference between a.cuda() and a.to("cuda")
08:19:04 From Jacob Lee (TA):	^ i think theyre the same, but google verify that
08:19:19 From Antioch Sanders (TA):	@Lawrence Chen you may have more limited memory on the GPU
08:19:25 From Reshmi Ghosh (TA):	“Cuda” is the string variable denoting the device name
08:20:25 From Anon. Deep Dream:	ok so apparently using the to method is faster and pytorch recommends it. the cuda method is for backward compatibility with older versions of pytorch im pretty sure
08:20:44 From Jacob Lee (TA):	^ Thanks:)
08:20:45 From Anon. Validate:	Thanks Joseph!
08:20:52 From Anon. ReduceLROnPlateau:	thank you
08:21:04 From Anon. Fast RCNN:	How do we print that out?
08:21:15 From Anon. Fast RCNN:	The GPU usage
08:21:18 From Anon. Fast RCNN:	Thanks!
08:21:27 From Anon. CMU:	How to check whether the memory is increasing or not
08:21:53 From Jacob Lee (TA):	^ it depends on what platform you're on; if youre on aws you can use various bash commands
08:22:08 From Tony Qin (TA):	nvidia-smi
08:22:29 From Reshmi Ghosh (TA):	If you are on google colab
08:22:30 From Jacob Lee (TA):	in colab/Jupyter notebooks theres some cuda command thats convenient, but i forget what it was
08:22:32 From Jacob Lee (TA):	oh
08:22:39 From Reshmi Ghosh (TA):	A bar on right top header shows memory usage
08:22:45 From Anon. Voltage-gate:	I think it’s torch.cuda.memory_allocated()
08:22:50 From Anon. Voltage-gate:	If you want to do it programmatically
08:23:04 From Reshmi Ghosh (TA):	I think Edward is right
08:23:12 From Anon. Fast RCNN:	Thanks all! Appreciate it!
08:23:47 From Anon. CMU:	Thank you
08:24:13 From Anon. ReduceLROnPlateau:	should we include error numbers on piazza to find it easier for other students to find same error?
08:24:44 From Jacob Lee (TA):	sure wouldn't hurt
08:25:17 From Reshmi Ghosh (TA):	@matias as long as you are posting on the right FAQ thread, sure!
08:25:20 From Reshmi Ghosh (TA):	:_)
08:25:33 From Anon. AlphaGo:	Would you recommend doing all computation on CPU initially for one learning iteration, then when no errors show up go to GPU?
08:25:44 From Jacob Lee (TA):	^yes
08:25:57 From Anon. VC Dimension:	Are these slides available online somewhere?
08:26:04 From Jacob Lee (TA):	and while developing on cpu, do it locally; aws time is expensive and colab restricts how much time you use gpu for
08:26:33 From Reshmi Ghosh (TA):	I would say depends?  If the data is large, it might take a long time on cpu
08:26:34 From Jacob Lee (TA):	^ we'll upload later, sorry for the delay
08:26:45 From Anon. Fast RCNN:	Is it possible to finish our homework just using colabs as opposed to using amazon aws instances? Just curious
08:26:53 From Reshmi Ghosh (TA):	Use AWS
08:27:06 From Reshmi Ghosh (TA):	Colab doesn’t have free GPU’s any more
08:27:13 From Jacob Lee (TA):	o rly:(
08:27:17 From Anon. Fast RCNN:	Oh nooo
08:27:19 From Reshmi Ghosh (TA):	Although Colab Pro is cheaper in some ways
08:27:26 From Anon. Python:	wait what?
08:27:34 From Reshmi Ghosh (TA):	You pay 10$ monthly to access Colab pro
08:27:36 From Anon. Python:	seriously?
08:27:44 From Anon. Fast RCNN:	I thought colab had free GPUs but less storage
08:27:45 From Anon. Deep Dream:	el big oof
08:27:50 From Reshmi Ghosh (TA):	But CMU doesn’t provide credits for Colab!
08:28:05 From Reshmi Ghosh (TA):	@Qiyun that changed this May:-(
08:28:23 From Anon. MXNet:	can we just train the model locally
08:28:26 From Reshmi Ghosh (TA):	It does provide free GPUs with a ram size of 12 GB
08:28:28 From Anon. Cerebellum:	Wait colab does
08:28:35 From Anon. Alpha:	How would we apply the credit given by CMU for AWS?
08:28:43 From Reshmi Ghosh (TA):	Beyond that you have to access Colab pro
08:28:44 From Jacob Lee (TA):	@yi shen it will take many centuries for late rhomeworks
08:28:49 From Anon. Bias:	Should we have to do all the assignments on aws or we can do those on the local machine with GPU?
08:28:56 From Antioch Sanders (TA):	training locally is a good idea for testing
08:29:00 From Reshmi Ghosh (TA):	You can use local machine
08:29:08 From Anon. Bias:	Thank you
08:29:17 From Anon. MXNet:	@Jacob Lee, will RTX 3080 faster than AWS T4?
08:29:22 From Anon. Sequence:	Note on local computer training:watch how much of your memory is being used, it can be really easy to consume all of the memory on your computer before any training starts. (potentially freezing your computer)
08:29:25 From Tony Qin (TA):	Zequn, it’s pretty simple. There’s a place to redeem credits. Just search it
08:29:27 From Reshmi Ghosh (TA):	We will give you AWS credits before every hw
08:29:38 From Reshmi Ghosh (TA):	Given that you have completed hw0p1
08:29:40 From Reshmi Ghosh (TA):	:_-)
08:29:48 From Tony Qin (TA):	Yes, 3080 is faster than T4. In fact, 2080ti is faster. I think even 2080 is faster
08:29:52 From Anon. LTI:	Haha go get a 3080 @Yi Shen
08:29:55 From Anon. Alpha:	Got it, thank you!
08:30:00 From Tony Qin (TA):	Get the 3090
08:30:01 From Anon. Fast RCNN:	Thanks @Reshimi !
08:30:15 From Anon. Eta:	for our hw assignments, generally, how much faster is it to use gpu vs. training locally?
08:30:23 From Anon. Fast RCNN:	Sry I meant Reshmi
08:30:29 From Reshmi Ghosh (TA):	Lol isssokay
08:30:32 From Anon. MXNet:	@ Tony I want to buy 3090. but I am poor hahaha
08:30:42 From Anon. is_available():	lol
08:30:52 From Jacob Lee (TA):	@uma big difference, sometimes like 21 hours for one epoch locally, 15 minutes on gpu
08:30:56 From Reshmi Ghosh (TA):	@Uma a lot faster!
08:31:14 From Reshmi Ghosh (TA):	I have heard of incidents more than 21 hours @Jacob
08:31:16 From Reshmi Ghosh (TA):	:P
08:31:41 From Anon. Eta:	oh woah … thanks for the replies @jacob @reshmi
08:31:41 From Tony Qin (TA):	Some of your training will take >5 hours on GPU. Can’t imagine what it would be on CPU
08:31:53 From Anon. RBM:	hey how much difference would there be if I trained on my local machine gpu rather than on aws?
08:32:04 From Tony Qin (TA):	Ananyananda what’s your gpu?
08:32:05 From Anon. Voltage-gate:	Would we do data augmentation in dataset or is there some way to do it in dataloader?
08:32:09 From Reshmi Ghosh (TA):	Oh yes, hw4 will take really long! Also depending on your model architecture hw2 may also take long
08:32:44 From Jacob Lee (TA):	@edward do it in dataset; dataloader has to process on the fly before feeding into model, dataset would be preprocessing so time is frontloaded
08:32:56 From Anon. RBM:	1080 Ti nvidia
08:32:58 From Reshmi Ghosh (TA):	Dataloader is generally used to load data and establish the getitem method to easy index
08:33:02 From Reshmi Ghosh (TA):	Do it in dataset
08:34:07 From Tony Qin (TA):	1080ti should be fine. It will be faster than T4. Only downside is it will have less vram than T4, which has 16GB.
08:34:24 From Anon. Decoder:	why does efficiency decrease if we have too many workers?
08:34:26 From Anon. Cerebellum:	doesn't windows have an issue with num_workers?
08:34:30 From Anon. Matrix:	how about my intel uhd graphics 620 /s
08:34:54 From Anon. RBM:	by vram you mean system ram?
08:34:54 From Tony Qin (TA):	Yes, windows has an issue with num_workers. Don’t use windows. I tried for a long time to get it to work, but couldn’t. Just get arch lol
08:34:58 From Anon. Gradient Flow:	what is num_workers doing?
08:35:01 From Anon. Deep Dream:	what about in wsl
08:35:07 From Tony Qin (TA):	vram is the memory on the gpu
08:35:13 From Jacob Lee (TA):	@joseph i wouldn't run serious code in wsl, not really meant for it
08:35:24 From Anon. Fast RCNN:	num_workers  is for parallel I believe, if set to 1 then its sequential @Bajram
08:35:32 From Anon. Bidirectional:	when you say windows has a problem with num_workers, does this apply if we do everything in aws?
08:35:33 From Anon. Gradient Flow:	Oh right
08:35:39 From Tony Qin (TA):	num_workers is the number of processes that load data into the GPU, which greatly affects speed of training
08:35:56 From Tony Qin (TA):	You can use AWS on windows no problem since the computation is done on the AWS machine
08:35:56 From Jacob Lee (TA):	@zong no, aws would be whatever OS you have on your AMI
08:35:56 From Anon. RBM:	Oh Okay thanks @ Tony Qin
08:36:08 From Anon. MXNet:	@ Tony Even 1080 Ti is faster than T4? So can I simply  understand it like if we have 1080+ GPU on our local machine, then we don't need to use AWS anymore? I have connection issues with AWS right now, so wanna bug a new GPU and do my HW on my local machine
08:36:53 From Jacob Lee (TA):	^ keep in mind memory is important for model quality; allows you to store larger models and larger batch sizes
08:37:10 From Jacob Lee (TA):	larger batch sizes -> better, smoother training because gradient is smoother
08:37:16 From Tony Qin (TA):	Yi Shen, kind of. Speed is one thing, but also pay attention to vram. I would’t suggest working with any less than 10GB vram. In terms of ram, I would suggest 32GB. If your system is better than that, then your local machine will be fine, probably faster than g4dn.xlarge
08:37:40 From Anon. MXNet:	@Tony, thank you so much!
08:38:29 From Anon. RBM:	I guess I'll stick with aws don't have so much vram but do we need so much vram from hw1 itself?
08:39:19 From Tony Qin (TA):	Ananyananda, for hw1 you don’t need much vram. For hw2 and beyond, you will
08:39:26 From Tony Qin (TA):	I’ll bring back the GPU thread, post questions there
08:39:34 From Anon. RBM:	Cool Thanks!
08:41:24 From Anon. Graph:	what’s the purpose of super().__init__()?
08:41:41 From Jacob Lee (TA):	^ runs the init method of the parent class
08:41:50 From Anon. Graph:	Why do we need to do that?
08:41:57 From Anon. is_available():	inheritance
08:42:19 From Anon. MXNet:	you may check what is python inheritance
08:42:24 From Anon. LTI:	otherwise you may not be able to call functions from nn.Modules
08:43:23 From Jacob Lee (TA):	Btw, this recitation is directly useful for hw1p1 and hw1p2. Hw1p1 you'll be implementing some of this information:)
08:43:35 From Jacob Lee (TA):	hw1p2 you'll be applying this info to train a model on a large dataset
08:44:36 From Anon. Baseline:	Do you have to train -> eval -> train again?
08:44:52 From Jacob Lee (TA):	if youre training again afterwards then yes
08:46:45 From Anon. DBN:	| || || |_
08:46:53 From Anon. Voltage-gate:	How would we run softmax on eval then?
08:46:56 From Anon. LTI:	In the example we’re repeatedly putting data from CPU to GPU in every epoch, can we do it beforehand and only access them in training process?
08:46:58 From Anon. Voltage-gate:	Or can we directly just use NLL loss?
08:47:03 From Jacob Lee (TA):	l o s s
08:47:17 From Jacob Lee (TA):	^Edward li there's not really a reason to for hw1p2; you might as well just use xeloss
08:47:18 From Anon. Fast RCNN:	The TA was saying that CrossEntropyLoss includes softmax and what function again? Thanks!
08:47:21 From Tony Qin (TA):	When putting data from CPU to GPU, do it in the training loop
08:47:38 From Jacob Lee (TA):	the writeup for hw1p1 will talk about what cross entropy loss is composed of in torch
08:47:51 From Anon. LTI:	@Tony any reasons for that? Really curious
08:48:00 From Jacob Lee (TA):	otherwise go to the torch documentation for cross entropy loss, they have an explanation but it's a little shallow
08:48:03 From Anon. Uniform Distribution:	So this means that we don't need to apply the cross entropy activation function to the hidden layer before the output layer and directly go ahead with cross-entropy?
08:48:05 From Antioch Sanders (TA):	@Qiyun Chen logsoftmax and nllloss
08:48:20 From Anon. Fast RCNN:	Thanks! @Jacob @Antioch
08:48:25 From Anon. Uniform Distribution:	sorry, softmax*
08:48:48 From Tony Qin (TA):	Daniel, if you do it in the data loader, it will cause issues when num_workers>1. Trying to put data into the GPU with multiple workers at the same time causes issues I think
08:48:58 From Jacob Lee (TA):	@sai that's correct
08:49:28 From Anon. LTI:	Gotcha. Thanks Tony
08:49:30 From Anon. Uniform Distribution:	Thanks @Jacob
08:49:38 From Anon. MXNet:	may I ask what content quiz 1 will cover?
08:49:49 From Jacob Lee (TA):	^ lec 1 and lec 2
08:49:58 From Anon. MXNet:	cool, thx
08:50:07 From Anon. Array:	do we get 3 chances for all quizes
08:50:10 From Jacob Lee (TA):	yes
08:51:27 From Anon. Deep Dream:	is it possible to give the optimizer only some of the model's parameters, and if so how
08:51:39 From Anon. PyTorch:	would it make sense to have learning rate be (number of training samples)^-1
08:52:07 From Jacob Lee (TA):	model.parameters() returns a dict, which i suppose you could manually filter through. key is name of param, value is the params
08:52:17 From Jacob Lee (TA):	otherwise you should set requires_grad to false for that param
08:52:26 From Tony Qin (TA):	Learning rate is usually set to 1e-3 or 1e-4, gradually decreasing over time. If learning rate was (number of training samples)^-1, the model might train too slowly
08:52:32 From Anon. LTI:	@Joseph maybe you can write your own optimizer by inheritance
08:52:47 From Anon. Acitvation Function:	Please explain zeroing gradient after optimizer
08:53:16 From Anon. ReduceLROnPlateau:	can you explain train() vs eval() again, and something about turning train() OFF and ON
08:53:46 From Anon. Validate:	you want to clear the gradients after backprop so the next time you start accumulating the gradients again the gradients from last time don't carry over
08:53:48 From Jacob Lee (TA):	train() and eval() both manipulate a Boolean "is_train". when "is_train == True", the model is in train mode
08:54:08 From Jacob Lee (TA):	^satoru is correct; you'll be implementing this in hw1p1 too
08:54:16 From Anon. ReduceLROnPlateau:	thank you
08:54:23 From Antioch Sanders (TA):	certain layers behave differently depending on is_train
08:54:27 From Anon. Voltage-gate:	Is there any case where we don’t want to zero gradients?
08:54:50 From Anon. Deep Dream:	@Daniel selecting a subset of the model's parameters seems like it'd be part of a specific model class rather than an optimizer to me, but perhaps im misinformed
08:55:24 From Anon. Mask-RCNN:	So essentially loss.backward computes the gradient while optimizer.step tunes the parameter using the gradient?
08:55:30 From Anon. LTI:	@ Joseph you’re gonna pass parameters to optimizers anyway so I guess it doesn’t matter
08:55:48 From Anon. Boltzmann:	In which part do we initialize the parameters?
08:55:48 From Anon. LTI:	And the update process should be handled by optimizer, logically
08:55:53 From Jacob Lee (TA):	i'd still recommend just selecting params
08:56:10 From Jacob Lee (TA):	params are generally initialized when you create the module object
08:56:27 From Jacob Lee (TA):	@yiwei exactly right
08:56:35 From Anon. Cable Theory:	That would be batch SGD
08:57:33 From Anon. Fast-RCNN:	Not yet
08:57:34 From Anon. Cable Theory:	nopw
08:57:35 From Anon. Depolarization:	no
08:57:36 From Jacob Lee (TA):	we'll cover it briefly in hw1p1
08:57:37 From Anon. Uniform Distribution:	Nope
08:57:39 From Jacob Lee (TA):	writeup
08:58:02 From Anon. Alpha:	Do we need to build the Dataset and train model in the loop every time we train DL model? Or there is some higher level API/function (like sickilearn) we can use at later stage?
08:58:39 From Jacob Lee (TA):	^it depends; libraries like torchvision often have higher level apis. i think this year youre allowed to use those
08:58:55 From Jacob Lee (TA):	(for hw2p2)
08:59:05 From Jacob Lee (TA):	im pretty sure for hw1p2 youre required to write your own dataloader/dataset
08:59:14 From Anon. Mask-RCNN:	So I guess in this case NN.linear probably has some options for different initialization schemes?
08:59:17 From Anon. Alpha:	ok, thanks!!
08:59:58 From Jacob Lee (TA):	@yiwei as far as i know, options aren't really built in... in the past id create the object, then modify the params manually with some np or torch function
08:59:59 From Anon. MXNet:	@Tony you just mentioned that you suggest training the model with 32G VRAM. But I just found that even the most latest GPU has like only 20G? Have I misunderstood anything?
09:00:08 From Jacob Lee (TA):	i think by default they do kaiming init? not sure...
09:00:21 From Tony Qin (TA):	Yi, 10GB vram, 32GB normal ram. Post further questions on GPU FAQ
09:00:39 From Anon. MXNet:	Oh, I got it. hahaha
09:01:00 From Anon. Validate:	@Yiwei I think some of the NN modules have options for those, but you can always init parameters for any module manually
09:01:33 From Anon. Mask-RCNN:	Thanks
09:01:36 From Jacob Lee (TA):	https://pytorch.org/docs/stable/nn.init.html
09:02:51 From Anon. ReduceLROnPlateau:	if model.train() turns "is_train == True", then what part of the code sets "is_train == False"?
09:02:58 From Jacob Lee (TA):	model.eval()
09:03:22 From Anon. PyTorch:	how do we save and load trained models?
09:03:32 From Anon. ReduceLROnPlateau:	then could we put model.train() outside the for epoch for loop?
09:03:44 From Jacob Lee (TA):	saving/loading modles is for convenience/backup; you don't have to but youll definitely want to
09:03:55 From Jacob Lee (TA):	@matias yes
09:04:07 From Anon. ReduceLROnPlateau:	@Jacob thank you
09:04:11 From Jacob Lee (TA):	saving/loading models i'd google; theres specific methods for it
09:04:19 From Anon. is_available():	I wondered what the output of the model.eval() is? Can we customize it?
09:04:30 From Jacob Lee (TA):	theres no output, its an in-place operation
09:06:13 From Anon. Mask-RCNN:	If the is_train variable is set to be false, then the parameters of the model could not be changed?
09:06:34 From Jacob Lee (TA):	not unless you change them manually yourself
09:06:39 From Anon. MXNet:	when we are expected to have access to the recording of this recitation. Will the FAQ part be recorded as well?
09:07:02 From Jacob Lee (TA):	yes
09:08:06 From Anon. Sequence:	I have a question about using a package for the homework assignments where should I post that on Piazza?
09:08:31 From Jacob Lee (TA):	my advice:make sure to have fun
09:08:55 From Anon. Dot Product:	spot instances?
09:09:01 From Anon. Bidirectional:	no it was saving
09:09:02 From Anon. Bidirectional:	ur model
09:09:33 From Anon. ReduceLROnPlateau:	save the model and the weights?
09:10:11 From Anon. Fast-RCNN:	Can you also post some info on why and how to use spot instances?
09:11:02 From Anon. Uniform Distribution:	Can you make a piazza post on good practices in training? Thanks
09:11:10 From Jacob Lee (TA):	👏🏻👏🏻👏🏻
09:11:11 From Anon. Fast RCNN:	Thank you so much!
09:11:14 From Anon. Fast-RCNN:	Thank you!
09:11:14 From Anon. Autolab:	Thank you!
09:11:21 From Anon. ReduceLROnPlateau:	thank you TAs