We write an asynchronous image parser and scraper in Python with a graphical interface

We write an asynchronous image parser and scraper in Python with a graphical interface

The image for the article was created by Microsoft Designer

In this article, we will create a desktop application that will store a given number of pictures on our disk at our request. Since there will be a lot of pictures, we will use Python’s asynchrony for competitive implementation of I/O operations. Let’s see how the requests and aiohttp libraries differ. We will also create two additional program threads to bypass the global blocking of the Python interpreter.

Instead of a thousand words…

To better understand what I’m talking about, I’ll just show you what the result should be and how it will work:

When starting to create a program, we need to define the main classes that perform a strictly defined function and will be limited only to their task.

Graphical interface classes

We will have a separate GUI class. Let’s call it UI – the main window of the program. There are two different frames in this window. Let’s also represent these frameworks in different classes:

  • The SearchFrame class will be responsible for entering a search query that will be used to search for pictures.

  • The ScraperFrame class will be responsible for outputting the number of images available for download, assigning a path to save these images, selecting their size and number. Also, this class implements the download and save progress scale and the information field.

We implement these classes using the standard Python library – tkinter.

We decided on the classes of the graphical interface. Now you need to figure out how to search, download and save pictures.

PictureLinksParser parsing class

Choice of photo hosting

First, we need to choose a photo hosting where the pictures are stored. Here it is necessary to analyze how exactly the photos are placed on the hosting. To begin with, we will disable JavaScript and look at the behavior of the site using the developer tools.

Disable JavaScript through Google Chrome Tools.

If all the content has disappeared when we refresh the page, we are most likely dealing with a one-page SPA (Single Page Applications). And parsing such sites requires advanced Python libraries. For example, Scrapy with the Splash tool, or even worse, Selenium. Scrapy is a great tool, but not for our case. Remember the KISS principle? That’s why we are looking for a site with little impact of JavaScript on the content.

My choice stopped at the website flickr.com. the only one
the problem with this site is that there are no pictures when displaying
pagination of pages, and new pictures appear while scrolling the feed. That’s not it
we have less than 25 pictures without scrolling. A little later in the article I will tell you how
it is very tricky to bypass this limitation.

Choosing a library for parsing

The simplest parser for Python is Beautiful Soup. This library is quite enough to solve our task.

Library selection for web queries

Everyone knows that there is a requests library. The problem with this library is that it blocks – we have a global Python Interpreter (GIL) lock during the execution of the request and data retrieval. This is a lock that prevents a Python process from executing more than one bytecode instruction at any given time. You must have a question – why is it used then? For a single web request, the GIL will be invisible. And imagine that we have 1000 such requests. Until we have all 1000 requests fulfilled, the rest of the program will be blocked. To solve this problem, non-blocking libraries were created. An example of a non-blocking library is aiohttp, which can also make web requests.

And here I have to convey one important point to you: for a single web request with aiohttp we gain nothing in requests. Aiohttp will only benefit if we run multiple web requests competitively.

The PictureLinksParser class will make only one web request to retrieve the HTML document. But in another class we will run multiple web requests competitively – we will install only aiohttp. We don’t need an additional requests library for one web request – replace it with aiohttp. The parsing algorithm is as follows:

def parse_html_to_get_links(self, html: str) -> None:
        """Parses HTML and adds links to the array."""
        soup = BeautifulSoup(html, 'lxml')
        box = soup.find_all(PHOTO_CONTAINER, class_=PHOTO_CLASS)
        for tag in box:
            img_tag = tag.find('img')
            src_value = img_tag.get('src')
            self.add_links('https:' + src_value)

    async def get_html(self) -> None:
        """Downloads HTML with pictures links."""
        async with aiohttp.ClientSession() as session:
            async with session.get(self.url) as response:
                html = await response.text()
        self.parse_html_to_get_links(html)

And here comes the first drawback of aiohttp compared to requests. Requests has the simplest interface – wrote the method get (web page address) and got the page. In aiohttp, we create a client session, which is a runtime environment for making HTTP requests and managing connections. We also use an asynchronous context manager, which allows you to correctly start and close HTTP sessions.

Conclusions:

  • if the program has only one request (getting a token, getting one web page), then we use the requests library.

  • if the program needs to perform many requests at the same time – we use aiohttp (or another non-blocking library).

PictureScraperSaver class for scraping pictures

When, after the operation of the PictureLinksParser class, we have formed a set of links, we must go to these addresses and save the pictures to our disk.

Lots of set() in Python

This is a standard data type that everyone knows about. It is a very fast collection that is built on hash tables and contains only unique elements. The peculiarity of our photo hosting is that when re-requesting an HTML document, completely new links that were not present in the previous request sometimes appear. Using the set set, when we re-query, we add these links to our link set and the following happens: the number of elements in the set is automatically increased by the number of new unique links with a new query with the same keyword.

We perform web requests competitively

So, we have many links. Now you need to go through them and save the pictures on the disk. I implemented it in the following way (For better readability of the code, I did not use list comprehension):

async def _save_image(self, session: ClientSession, url: str) -> None:
        """Asynchronously downloads the image and saves it on disk."""
        try:
            response = await session.get(url)
            if response.status == HTTPStatus.OK:
                image_data = await response.read()
                pic_file = f'{self.picture_name}{self.completed_requests}'
                with open(f'{self.save_path}/{pic_file}.jpg', 'wb') as file:
                    file.write(image_data)
                logging.info(f'Успешное сохранение картинки {url}')
            else:
                logging.error(
                    f'Ошибка при работе с картинкой {response.status}'
                    )
        except Exception as e:
            logging.exception(
                f"Ошибка при загрузке {url}: {response.status} {e}"
                )
        self.completed_requests += 1
        if self.completed_requests % self.refresh_rate == 0 or \
                self.completed_requests == self.total_requests:
            self.callback(self.completed_requests, self.total_requests)

    async def _make_requests(self) -> None:
        """Concurrently sends URL links to perform."""
        async with ClientSession() as session:
            reqs = []
            for _ in range(self.total_requests):
                current_link = self.links_array.pop()
                reqs.append(self._save_image(session, current_link))
            await asyncio.gather(*reqs)

Let’s start with the _make_requests coroutine. We submit for competitive execution only the number of pictures specified in the graphical interface – the self.total_requests attribute. With the pop() method in the set, we remove a random element and send it to download and save. And then we use the asyncio.gather method to competitively download images from the corresponding URLs.

As for the _save_image coroutine, it’s still simpler here. We follow the image link, get confirmation that all status = 200. And then we save this read content using the standard open function and binary recording mode at the specified address. We log events at all stages.

We transfer the parser and scraper to additional threads

The problem is that at the stage of parsing and scraping, we may have long executions of operations that will block the graphical interface. In practice, it will look like this: while 1000 requests are executed, we have a blocked GUI, it will stop responding. And the operating system will offer us to remove this process, considering it “hung”.

To prevent this from happening, multithreading is used for input-output operations. We will create 2 additional threads and pass asynchronous event loops there.

What we have:

  • main thread: GUI

  • additional thread #1: parser

  • additional thread #2: scraper

It remains only to implement thread safety in the parser and scraper classes. This is achieved by two asyncio methods:

  1. Method call_soon_threadsafe takes a Python function (not a coroutine) and thread-safely schedules its execution for the next iteration of the event loop.

  2. Method run_coroutine_threadsafe accepts a coroutine, submits it for execution in a thread-safe manner, and immediately returns a future object that will allow access to the result of the co-program.

View the full program code

Because the article doesn’t give a full picture of what we did, I suggest you check out the full application code in my GitHub repository.

There, in the description of the repository, you will find a link to the exe version of the program and you can play with it a little.

Perspectives of use

You can safely use the code in your projects. For example, you want to make an online service that returns a zip archive with images to the user upon request. And you don’t need to worry about the server configuration – you will only need one processor core, because multithreading and asynchrony are implemented within one process and consume the memory of that process only.

The application has been manually tested on Windows 11 and Ubuntu 22.04 operating systems

Related posts