How to scrape and why you shouldn't use it

Web Scraping & Machine learning

A scraper designed to crawl through the image sharing site, using a CNN to classify images and OCR to extract text.


If you’ve reached this page because you just want to scrape (or Lightshot, its official name) then look no further than the GitHub link above. But you should be warned that the legality of web scraping is questionable and some sites may not be happy if you bombard their site with many requests for images.

The problem

It’s well known that sequential IDs are a bad thing on websites. They allow scrapers to easily traverse every item on the site and if those items contain sensitive information then that’s particularly bad because a scraper can collect a massive amount of information with next to no effort.

Instead, IDs should be long, random and unpredictable. A good standard is a UUID e.g. 9b1deb4d-3b7d-4bad-9bdd-2b0d7b3dcb6d.

Instead, has IDs of the following structure: bghwg3 That’s a 6 character alphanumeric string, which a computer can generate all variations of in seconds. What makes this worse is that these IDs are sequential, which means the image with ID abcdef is was uploaded after abcdee which was uploaded after abcded, abcdec, abcdeb… And so on.

You can test this, try uploading an image to and you’ll get a URL like Try increasing the last number/letter by one and you’ll get the image that was uploaded by someone else right after you uploaded your own image.

Building a scraper

There were two stages to the scraper, a collection stage and a processing stage.


As mentioned above, this stage is really simple. Pick a starting code, scrape the HTML to find the image URL, download it, generate the next code, scrape, download, next code… This alone should not be possible on any website however it can be combined with other techniques to extract information far more efficiently.


It’s all well and good having a collection of 1,000,000 random screenshots people have taken without realising anyone other than a few trusted individuals would see, but how do we actually do anything useful with all of this data? Well, we can eyeball it first and see what sort of images people upload. There were a few main types of images that people uploaded, ordered by frequency:

These three types of image are fairly easy to distinguish and being able to separate them automatically would be very useful. After manually categorising two or three thousand images, I was able to train a CNN to classify any image into one of these three categories with pretty decent accuracy.

But we can go further, what about the content of text images? Well, we can identify images containing text with decent accuracy so we can reliably run OCR on these images to extract any text from them. If we have some associated text for every image then we can index and search these images pretty efficiently looking for particular keywords that may be of interest. I’m sure you could think of some that could have catastrophic results.


This should not be possible. People believe screenshots uploaded here are private. I originally undertook this project as a demonstration to Skillbrains (the company that runs of what could be done as a result of their ID system but they refused to respond and are yet to update their system. I opted not to publish the CNN/OCR component of this project as there are many ethical implications involved there. The scraper itself is fairly harmless and may eventually encourage Skillbrains to make changes to their system if enough people become aware of this issue so feel free to check it out at the GitHub link above. There’s also a precollected dataset of screenshots on Kaggle linked above as well.



Web Scraping & Machine learning

A scraper designed to crawl through the image sharing site, using a CNN to classify images and OCR to extract text.

Read Me

YouTube & MBTI

YouTube & MBTI

Producing, publishing and analysing YouTube & MBTI Datasets on Kaggle

Read Me

Building a Battleships AI

Monte-Carlo Simulation & Probability

Using Monte-Carlo simulation to build an AI for battleships

Read Me