Problem with tutorial 3 code

6 posts / 0 new
Last post
Michal Hartstein
Problem with tutorial 3 code

Hi,

I'm trying to use the code from tutorial 3 (where we classified Malaria images) as a basis for answering HW3. 

Even without changing anything in the code, I always get Runtime errors that I don't understand how to solve:

  • CUDA error: device-side assert triggered

or

  • Cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:265

 

I tried to change the code so that it will not run on the gpu/cuda at all, but it doesn't seem to work.

Can you please guide me how to solve this issue?

Thanks,

Michal

Sanmay Ganguly
Hi Michal, 

Hi Michal, 

is the code running fine on CPU? 

At the start of your notebook you can add a cell with the following line

CUDA_LAUNCH_BLOCKING=1 

and execute the cell before trying anything else. 

Can you check this and report if you see the crash ? 

-- Sanmay

Jan Kadlec
Hi,

Hi,

I was getting the same error, when I used the command you are proposing, my error changed to:

 

RuntimeError Traceback (most recent call last)

<ipython-input-41-392f6679389c> in <module>() 16 # move tensors to GPU if CUDA is available 17 if torch.cuda.is_available(): ---> 18 data, target = data.cuda(), target.cuda() 19 # clear the gradients of all optimized variables 20 optimizer.zero_grad()

RuntimeError: CUDA error: device-side assert triggered

 

So far I found that the problem could be with indexing but I have not found yet how to fix it.**

Jan

 

btw. the other error is:

RuntimeError Traceback (most recent call last)

<ipython-input-43-392f6679389c> in <module>() 13 ################### 14 model_cnn.train() ## --- set the model to train mode -- ## ---> 15 for data, target in train_loader: 16 # move tensors to GPU if CUDA is available 17 if torch.cuda.is_available():

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __next__(self) 615 batch = self.collate_fn([self.dataset[i] for i in indices]) 616 if self.pin_memory: --> 617 batch = pin_memory_batch(batch) 618 return batch 619

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in pin_memory_batch(batch) 243 return {k: pin_memory_batch(sample) for k, sample in batch.items()} 244 elif isinstance(batch, container_abcs.Sequence): --> 245 return [pin_memory_batch(sample) for sample in batch] 246 else: 247 return batch

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in <listcomp>(.0) 243 return {k: pin_memory_batch(sample) for k, sample in batch.items()} 244 elif isinstance(batch, container_abcs.Sequence): --> 245 return [pin_memory_batch(sample) for sample in batch] 246 else: 247 return batch

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in pin_memory_batch(batch) 237 def pin_memory_batch(batch): 238 if isinstance(batch, torch.Tensor): --> 239 return batch.pin_memory() 240 elif isinstance(batch, string_classes): 241 return batch

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:265

 

 

** please see this

https://lernapparat.de/debug-device-assert/

https://discuss.pytorch.org/t/runtimeerror-cuda-runtime-error-59/750/15

https://github.com/pytorch/pytorch/issues/4144

Dor Korn
More Problems with the code of tutorial 3

Hi,

I am getting the following error when running the training:

---------------------------------------------------------------------------

RuntimeError Traceback (most recent call last)

<ipython-input-26-05905ce95d7b> in <module>() 20 optimizer.zero_grad() 21 # forward pass: compute predicted outputs by passing inputs to the model ---> 22 output = model_cnn(data) 23 # calculate the batch loss 24 loss = loss_func(output, target)

3 frames

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py in forward(self, input) 318 def forward(self, input): 319 return F.conv2d(input, self.weight, self.bias, self.stride, --> 320 self.padding, self.dilation, self.groups) 321 322

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

I tried the solution you gave and it didn't work. Please Help.

Thank you,

Dor

Stav Nahum
Hi Michal,

Hi Michal,

i got the same problem and if I remember correctly I managed to fix it by restarting my server ( I work on GoogleCloudPlatform) and making sure all instances are on the GPU (CUDA).

 

Let me know if it works out for you,

Stav

Jonathan Shlomi
Dor,

Dor,

you need to put the model on the GPU before starting the training with model.cuda()

the fact it's telling you the weight type is torch.FloatTensor means that the model was not placed on the gpu.

to the others with errors with THCCachingHostAllocator, that seems to be the colab instance running out of memory. try reducing your batch sizes. but to be honest i can't be sure without seeing your code, so please come by the office if the problem is not resolved.