In my free time, I am engaged in the development of the “Awakari” service, the idea of which is to filter interesting events from an unlimited number of different sources. In this article, I will talk about ways to get publicly available information on the Internet outside of RSS feeds and Telegram channels.
The first and obvious type of Awakari feed is RSS feeds. The service relieves the user of the need to visit again and again all possible feeds, channels and sites in search of what he is interested in. Instead, the service performs a reverse search and sends relevant messages (for example, Telegram). However, a huge amount of useful information on the Internet remains inaccessible.
On the one hand, it is more profitable for various services for users to receive content only in their applications and platforms. This is also one of the reasons why RSS support has only been declining for the last 10-20 years. In addition, many services purposefully prevent automated metadata collection. This has not been a secret for a long time.
On the other hand, there is the exact opposite direction:
Schema.org — a joint initiative to develop a unified scheme for semantic markup in HTML5. The initiative was launched on June 2, 2011 by the creators of the largest search engines, GoogleYahoo! and Microsoft and on November 1, 2011, the Russian company Yandex joined it. The main goal of schema.org is to help web developers create quality metadata that improves search quality. Metadata on sites using the schemas described on schema.org can be analyzed by search engines, helping the latter to better “understand” the content of web resources. For example, with the help of micro-markup, webmasters can mark up a product page, indicating the price, the overall rating of the product and other data that Google will display in the search results.
There is nothing surprising in the spread of micro-marking.
Leather Services will still provide machine-readable and structured data, at least in order to be higher in Google search results. Detailed statistics on the use of different micro-markup options are also available.
As you can see, two options are currently leading the Internet – JSON-LD and Microdata. Both formats describe structures of a certain type. The most common types are:
How it looks in practice:
In this example, the extracted structure of type “Product” contains various attributes, which can be structures such as “Offers”, which in turn contains attributes such as “price”. When converting to a message, the names of all attributes are converted to lower case. Names of nested attributes are concatenated. This is due to the limitations on CloudEvents attribute names. As a result, the attribute for product prices in Awakari will have the name “offersprice”.
There are also other ways of semantically marking data, for example:
The HTML5 “article” tag usually also contains a title, date, image, etc.
All of these source options are currently supported by the Awakari service. To do this, you also need to decide how to convert this data into unique messages. Suppose the source is updated once every 1 hour. It is necessary to understand what has already been sent for processing, and what is new information. To do this, each output structure (JSON-LD piece, element in an RSS feed, or content of an “article” tag) must be unambiguously converted into a message. After that, the hash sum of the significant attributes of the message will be considered and it will be decided whether it has been encountered before.
Thus, the “news feed” can now be converted
anything that moves many more sources such as blogs, stores, message boards and more.
How it works
To take advantage of the new features in Awakari, you can create a subscription with, for example, the price of the product:
For the simplicity of the example, the name of the product and the currency are not indicated here. For real cases, it is better to use a group of conditions that also includes the necessary name of the product, for example, iPhone. With such a subscription, everything will come in a row from online stores, which is cheaper than 1000
the parrot. But you can always specify the terms of the subscription based on the attributes of the received messages:
WebSub is a protocol, not a transmission format. It was also originally intended to improve the RSS/Atom experience. The whole point is that the client doesn’t need to periodically download the content of the feed, but it is enough to inform the hub of the webhook address to receive updates when they become available. This allows you to minimize the delay in receiving news on the one hand and get rid of “idle” requests on the other. Awakari now also automatically detects if the source supports the WebSub protocol and subscribes to updates. An example of such a source:
The screenshot above also has graphs showing the frequency of messages received from a given source for each minute of the week. In the future, I plan to use these statistics to automatically adapt the update period for those sources that update by polling (ie those that do not support WebSub). In addition, there is an idea to integrate with ActivityPub, which will further expand the set of supported data sources.
If you look events in Mykolaiv – https://city-afisha.com/afisha/