Just burn anything using the Rust language (well… or just download the file)
It is known that Rust is a type-safe programming language whose code is checked by a package manager cargo
before harvesting. On the one hand, this is good: there is less chance of a sudden failure at the most inopportune moment. On the other hand, code that performs a single function can end up being very complex and sometimes unreadable.
I give an example of a function that parses the names of currencies from the site floatrates and outputs them (please write in the comments how many minutes it took you to understand this line):
// main.rs
use reqwest::blocking::get
use scraper::{Html, Selector}
fn main {
let url = "http://www.floatrates.com/json-feeds.html";
let response = get(url).expect("Ссылка не загружается.");
let body = response.text().unwrap();
let document = Html::parse_document(&body);
let selector = document
.select(
Selector::parse("div.bk-json-feeds>div.body>ul>li>a")
.unwrap()
).iter()
.map(|element| element.inner_html())
.collect::<Vec<String>>()
.foreach(|title| println!("{title}"));
}
In this article, I want to talk about my new library that greatly simplifies parsing on Rust. Happy reading!
Disclaimer
Please do not “beat” me for some mistakes in terms, comments or in code syntax: I’ve only been programming in Rust for 2 months.
However, I have experience in programming (one year in python), so I perfectly understand what I am writing about here.
I will be very grateful if you write in the comments what needs to be fixed in the crate)
thank you reloginn for help in developing the version 1.0
Contents
Initialize the Scraper structure
use reqwest::{blocking, Error};
use scraper::{ElementRef, Html, Selector, error::SelectorErrorKind};
/// Создание нового экземпляра
fn instance(url: &str) -> Result<Scraper, Error> {
let response = blocking::get(url)
.expect("Сайт не загружается")
.text()
.expect("Код сайта не возвращается");
Ok(Scraper { document: Html::parse_document(&response) })
}
/// Простой парсер
/// // ...
pub struct Scraper {
document: Html,
}
Instead of separately send requestfrom its result get the site code and initialize the object scraper::Html
you can simply call the command let scraper = scr::Scraper::new("scrapeme.live/shop/").unwrap()
(Yes, “https://
” and “http://
»Do not need to enter). The structure keeps within itself scraper::Html
for further parsing. Here’s what’s going on under the hood of the initializer:
// scr::scraping
// ...
impl Scraper {
/// создание нового экземпляра парсера,
/// <i>используя код сайта **(без https://)**</i>
pub fn new(url: &str) -> Result<Scraper, Error> {
instance(format!("https://{url}").as_str())
}
/// создание нового экземпляра парсера,
/// <i>используя код сайта **(без http://)**</i>
pub fn from_http(url: &str) -> Result<Scraper, Error> {
instance(format!("http://{url}").as_str())
}
// ...
}
It is also possible to initialize the structure using an HTML page fragment:
// scr::scraping
// ...
impl Scraper {
// ...
/// создание нового экземпляра парсера,
/// <i>используя **фрагмент** кода сайта</i>
pub fn from_fragment(fragment: &str) -> Result<Scraper, SelectorErrorKind> {
Ok(Scraper { document: Html::parse_fragment(fragment) })
}
// ...
}
Getting items
Under command scraper.get_els("путь#к>элементам").unwrap();
the selection of elements is hidden in a special way using the structure scraper::Selector
and converting the obtained result into Vec<scraper::ElementRef>
.
// scr::scraping
// ...
impl Scraper {
// ...
/// получение элементов
pub fn get_els(&self, sel: &str) -> Result<Vec<ElementRef>, SelectorErrorKind> {
let elements = self.document
.select(&Selector::parse(sel).expect("Can not parse"))
.collect::<Vec<ElementRef>>();
Ok(elements)
}
/// получение элемента
pub fn get_el(&self, sel: &str) -> Result<ElementRef, SelectorErrorKind> {
let element = *self.get_els(sel)
.unwrap()
.get(0)
.unwrap();
Ok(element)
}
// ...
}
Get the text (inner_html) and attribute of the element(s)
You can get the text or attribute of either one or more elements obtained using scraper.get_els("путь#к>элементам").unwrap();
// scr::scraping
// ...
impl Scraper {
// ...
/// получение текста из элемента
/// ...
pub fn get_text_once(&self, sel: &str) -> Result<String, SelectorErrorKind> {
let text = self.get_el(sel)
.unwrap()
.inner_html();
Ok(text)
}
/// получение текста из всех элементов
/// ...
pub fn get_all_text(&self, sel: &str) -> Result<Vec<String>, SelectorErrorKind> {
let text = self.get_els(sel)
.unwrap()
.iter()
.map(|element| element.inner_html())
.collect();
Ok(text)
}
/// получение атрибута элемента
/// ...
pub fn get_attr_once(&self, sel: &str, attr: &str) -> Result<&str, SelectorErrorKind> {
let attr = self.get_el(sel)
.unwrap()
.value()
.attr(attr)
.expect("Can not do");
Ok(attr)
}
/// получение атрибута всех элементов
/// ...
pub fn get_all_attr(&self, sel: &str, attr: &str) -> Result<Vec<&str>, SelectorErrorKind> {
let attrs = self.get_els(sel)
.unwrap()
.iter()
.map(|element| element.value().attr(attr).expect("Can not do"))
.collect();
Ok(attrs)
}
}
Loading files (FileLoader structure)
This is a simple file downloader. It is enough to simply enter one command. Structure source code: * tick *
Known problem
Unfortunately, it is not yet possible to return a structure instance scr::FileLoader
(error [E0515]).
Plans for version 2.0.0
So far, scr works only synchronously, which is why the crate is quite slow at times. Therefore, after some time I will implement async (most likely as a separate feature or module).
That’s all for now. I really hope that I have interested you. Optionally, you can make changes to the library by creating a fork repositorymaking some changes and sending them to me via a pull request.