Just burn anything using the Rust language (well… or just download the file)

Just burn anything using the Rust language (well… or just download the file)

It is known that Rust is a type-safe programming language whose code is checked by a package manager cargo before harvesting. On the one hand, this is good: there is less chance of a sudden failure at the most inopportune moment. On the other hand, code that performs a single function can end up being very complex and sometimes unreadable.

I give an example of a function that parses the names of currencies from the site floatrates and outputs them (please write in the comments how many minutes it took you to understand this line):

// main.rs
use reqwest::blocking::get
use scraper::{Html, Selector}

fn main {
    let url = "http://www.floatrates.com/json-feeds.html";
    let response = get(url).expect("Ссылка не загружается.");
    let body = response.text().unwrap();
    let document = Html::parse_document(&body);

    let selector = document
        .select(
            Selector::parse("div.bk-json-feeds>div.body>ul>li>a")
            .unwrap()
        ).iter()
        .map(|element| element.inner_html())
        .collect::<Vec<String>>()
        .foreach(|title| println!("{title}"));
}

In this article, I want to talk about my new library that greatly simplifies parsing on Rust. Happy reading!

Disclaimer

Please do not “beat” me for some mistakes in terms, comments or in code syntax: I’ve only been programming in Rust for 2 months.

However, I have experience in programming (one year in python), so I perfectly understand what I am writing about here.

I will be very grateful if you write in the comments what needs to be fixed in the crate)

thank you reloginn for help in developing the version 1.0

Initialize the Scraper structure

use reqwest::{blocking, Error};
use scraper::{ElementRef, Html, Selector, error::SelectorErrorKind};

/// Создание нового экземпляра
fn instance(url: &str) -> Result<Scraper, Error> {
    let response = blocking::get(url)
        .expect("Сайт не загружается")
        .text()
        .expect("Код сайта не возвращается");

    Ok(Scraper { document: Html::parse_document(&response) })
}

/// Простой парсер
/// // ...
pub struct Scraper {
    document: Html,
}

Instead of separately send requestfrom its result get the site code and initialize the object scraper::Htmlyou can simply call the command let scraper = scr::Scraper::new("scrapeme.live/shop/").unwrap() (Yes, “https://” and “http://»Do not need to enter). The structure keeps within itself scraper::Html for further parsing. Here’s what’s going on under the hood of the initializer:

// scr::scraping
// ...
impl Scraper {
    /// создание нового экземпляра парсера,
    /// <i>используя код сайта **(без https://)**</i>
    pub fn new(url: &str) -> Result<Scraper, Error> {
        instance(format!("https://{url}").as_str())
    }

    /// создание нового экземпляра парсера,
    /// <i>используя код сайта **(без http://)**</i>
    pub fn from_http(url: &str) -> Result<Scraper, Error> {
        instance(format!("http://{url}").as_str())
    }
    // ...
}

It is also possible to initialize the structure using an HTML page fragment:

// scr::scraping
// ...
impl Scraper {
    // ...
    /// создание нового экземпляра парсера,
    /// <i>используя **фрагмент** кода сайта</i>
    pub fn from_fragment(fragment: &str) -> Result<Scraper, SelectorErrorKind> {
        Ok(Scraper { document: Html::parse_fragment(fragment) })
    }
    // ...
}

Getting items

Under command scraper.get_els("путь#к>элементам").unwrap(); the selection of elements is hidden in a special way using the structure scraper::Selector and converting the obtained result into Vec<scraper::ElementRef>.

// scr::scraping
// ...
impl Scraper {
    // ...
    /// получение элементов
    pub fn get_els(&self, sel: &str) -> Result<Vec<ElementRef>, SelectorErrorKind> {
        let elements = self.document
            .select(&Selector::parse(sel).expect("Can not parse"))
            .collect::<Vec<ElementRef>>();

        Ok(elements)
    }

    /// получение элемента
    pub fn get_el(&self, sel: &str) -> Result<ElementRef, SelectorErrorKind> {
        let element = *self.get_els(sel)
            .unwrap()
            .get(0)
            .unwrap();

        Ok(element)
    }
    // ...
}

Get the text (inner_html) and attribute of the element(s)

You can get the text or attribute of either one or more elements obtained using scraper.get_els("путь#к>элементам").unwrap();

// scr::scraping
// ...
impl Scraper {
    // ...
    /// получение текста из элемента
    /// ...
    pub fn get_text_once(&self, sel: &str) -> Result<String, SelectorErrorKind> {
        let text = self.get_el(sel)
            .unwrap()
            .inner_html();

        Ok(text)
    }

    /// получение текста из всех элементов
    /// ...
    pub fn get_all_text(&self, sel: &str) -> Result<Vec<String>, SelectorErrorKind> {
        let text = self.get_els(sel)
            .unwrap()
            .iter()
            .map(|element| element.inner_html())
            .collect();

        Ok(text)
    }

    /// получение атрибута элемента
    /// ...
    pub fn get_attr_once(&self, sel: &str, attr: &str) -> Result<&str, SelectorErrorKind> {
        let attr = self.get_el(sel)
            .unwrap()
            .value()
            .attr(attr)
            .expect("Can not do");

        Ok(attr)
    }

    /// получение атрибута всех элементов
    /// ...
    pub fn get_all_attr(&self, sel: &str, attr: &str) -> Result<Vec<&str>, SelectorErrorKind> {
        let attrs = self.get_els(sel)
            .unwrap()
            .iter()
            .map(|element| element.value().attr(attr).expect("Can not do"))
            .collect();

        Ok(attrs)
    }
}

Loading files (FileLoader structure)

This is a simple file downloader. It is enough to simply enter one command. Structure source code: * tick *

Known problem

Unfortunately, it is not yet possible to return a structure instance scr::FileLoader (error [E0515]).

Plans for version 2.0.0

So far, scr works only synchronously, which is why the crate is quite slow at times. Therefore, after some time I will implement async (most likely as a separate feature or module).

That’s all for now. I really hope that I have interested you. Optionally, you can make changes to the library by creating a fork repositorymaking some changes and sending them to me via a pull request.

Related posts