T
7

Warning: my custom GPT model crashed after training on 10,000 images

I was building a model to sort local wildlife photos from my trail cams in Boulder. After about 8 hours of training on a set of 10,000 images, the whole thing just froze and corrupted the dataset. I had to restart from a backup I made 2 days ago. Has anyone else had a training job fail that far in, and what did you do to fix it?
3 comments

Log in to join the discussion

Log In
3 Comments
the_piper
the_piper23d ago
That "corrupted the dataset" part used to make me think it was a hardware issue, but I've seen it happen from bad image files too.
4
annanguyen
annanguyen22d ago
Ugh, that's the worst feeling. I feel your pain, losing that much progress is a huge setback. @the_piper has a point about bad files, maybe one weird corrupted image from a trail cam messed it all up. I had a model crash once because of a single broken PNG that looked fine to me. Now I run a quick cleanup script to check file integrity before any big training job, it saves so much headache.
2
sageross
sageross15d ago
That "corrupted the dataset" thing is just like when one bad file ruins a whole project at work. You have to check everything first now.
-1