We study and compare the stock exchange information of Mosbirzhi. Our first Python parser, code parsing

We study and compare the stock exchange information of Mosbirzhi. Our first Python parser, code parsing

The information and statistics server of the Moscow Stock Exchange (ISS or ISS) is a service that provides a variety of stock exchange information in real time, as well as trade results and statistical data.

The main capabilities of IDS:

  • Receiving streaming data on the progress of trades.

  • Viewing and exporting the results of trades.

  • Access to historical data on trading results, prices and other indicators.

  • Downloading lists of all instruments, trading modes and their groups.

  • Monitoring of market information in various sections.

    Online trading progress data and trading results are available only with a paid subscription.

There is a special section “Software interface to ISS” on the website of the Moscow Exchange, which contains the Developer’s Guide (v.1.4), Description of metadata and Description of methods.

It is from these documents that you should start studying IIS. By the way, the Rules for the use of stock information of the Moscow Exchange are clearly defined and clearly presented in the presentation.

Requests to IDS for obtaining information are formed in the form of URLs with parameters (or without), a list of all available URLs is listed on the Methods Description page.

You can receive information in the following formats: XML, CSV, JSON, HTML

And let’s solve the receiving task, for example:

the first “engines” block is marked with a red frame.

Let’s take a look at this page with the xml extension:

In responses in XML, HTML and JSON formats, each block of information is separated by a tag data and contains two sections: metadata (Describe the type and length of data fields) and directly market information. We will drive this information into a tabular form, where the metadata section will define the columns, and the information section will fill the rows of the table.

Below I offer for your consideration and use a program listing that allows you to extract information from almost any link on the ISS Queries methods page.

The general ideology of the program is as follows:

  • loading data

  • Parsing XML

  • Formation of DataFrame (tables)

  • Saving in SQLite database

  • Saving (duplication) in a CSV file with the extension txt.

    The program uses the following libraries:

os – work with the file system – creating directories, obtaining paths to files, etc.

sqlite3 – work with SQLite databases – connection, execution of SQL queries, transaction, etc.

xml.etree.ElementTree (ET) – parsing and converting XML into a tree-like structure of elements for convenient access to Python code.

pandas (pd) – work with tabular data, their processing and analysis – DataFrame. Download/write data to CSV and other formats.

requests – sending HTTP requests and receiving responses from web resources. Used for web scraping and working with APIs. Web scraping – obtaining web data by extracting it from web pages.

import os  # модуль для работы с файловой системой
import sqlite3  # модуль для работы с SQLite базами данных
import xml.etree.ElementTree as ET  # модуль для парсинга XML

import pandas as pd  # модуль для работы с данными
import requests  # модуль для HTTP запросов

url = "https://iss.moex.com/iss/index.xml"  # url источника данных XML - глобальные справочники ISS
# url = "https://iss.moex.com/iss/securities/SBER/indices.xml"  # url - Список индексов в которые входит бумага SBER.
# url = "https://iss.moex.com/iss/securities/IMOEX.xml"  # url - Получить спецификацию инструмента индекс Мосбиржи.

response = requests.get(url)  # запрос данных по url

root = ET.fromstring(response.content)  # парсинг XML из ответа

dfs = {}  # словарь для хранения dataframe

for data in root.findall("./data"):  # поиск всех элементов data
    id = data.attrib["id"]  # получение атрибута id

    columns = [
        c.attrib["name"] for c in data.findall("./metadata/columns/column")
    ]  # список колонок
    df = pd.DataFrame(columns=columns)  # создание пустого dataframe с определёнными колонками
    rows = data.findall("./rows/row")  # поиск строк

    for row in rows:
        data = dict(
            zip(columns, [row.attrib.get(c) for c in columns])
        )  # словарь данных строки
        df = pd.concat(
            [df, pd.DataFrame([data])], ignore_index=True
        )  # конкатенация строк

    dfs[id] = df  # сохранение dataframe в словаре

# далее код сохранения данных в БД и CSV
if not os.path.exists("moexmetadata"):
    os.mkdir("moexmetadata")

conn = sqlite3.connect("moexmetadata.db")

for id, df in dfs.items():
    df.to_sql(id, conn, if_exists="replace", index=False)
    # Проверка наличия колонки id в датафрейме
    if "id" in df.columns:
        cursor = conn.cursor()

        # Проверяем наличие индекса
        cursor.execute(f"PRAGMA index_info('index_name')")
        index_info = cursor.fetchall()

        if not index_info:
            # Создаем индекс в таблице по колонке id
            cursor.execute(f"CREATE INDEX index_name ON {id} (id)")
            conn.commit()

    # данные дублируем в файл
    filename = f"{id}.txt"
    path = os.path.join("moexmetadata", filename)
    df.to_csv(path, sep="\t", index=False)

conn.close()

We launch the program, it works in a few seconds. We can comment out the first URL in the program and uncomment the second one, run it again, etc. as needed.

A “moexmetadata” folder with numerous files appeared in the program directory.

here is an example of the content

we also created a database with tables

and if you dive in completely.., theoretically you can set up logical connections between the tables. If, of course, you can first understand all this “chaos” of information. Do you need only this?

Let’s go further

According to the Developer’s Guide, a special “cursor” block of data is also available at the end of the file for some queries.

for example here http://iss.moex.com/iss/history/engines/stock/markets/shares/securities/MOEX it looks like this

here INDEX is the beginning of the current data set in the output (row number) in the query
given by the start parameter; TOTAL – total amount of data (number of rows) available for this request;
PAGESIZE – volume of the current data set in output (number of lines),
the request is specified by the limit parameter (default is 100, possible values ​​are 50,
20, 10, 5, 1).

That is, to download the full (obviously large) amount of data, you need to download sequentially:
http://iss.moex.com/iss/history/engines/stock/markets/shares/securities/MOEX
http://iss.moex.com/iss/history/engines/stock/markets/shares/securities/MOEX?start=100
http://iss.moex.com/iss/history/engines/stock/markets/shares/securities/MOEX?start=200
etc. until (INDEX + PAGESIZE) < TOTAL.

By the way, you can download all the news of the Moex exchange website from the page https://iss.moex.com/iss/sitenews/ – about 43 thousand pieces, only here each step will contain PAGESIZE=”50” records.

This mechanism is completely understandable and logical, it can be implemented.

But not everything is so clear, and as it turned out, the very first URL on the ISS Queries page for obtaining a list of all securities traded on the Moscow Stock Exchange does not have this “cursor” block. And it is obvious that the page https://iss.moex.com/iss/securities is far from the complete list.

it must also be unloaded sequentially:

https://iss.moex.com/iss/securities.xml?start=0

https://iss.moex.com/iss/securities.xml?start=100

https://iss.moex.com/iss/securities.xml?start=200 and so on. Only we don’t have the TOTAL value …, because “cursor” section is missing.

After writing the second version of the program, I noticed that there are start and limit arguments in the /iss/securities method description. In my opinion, it was necessary to either place the “cursor” section wherever cyclic parsing of large volumes is meant – everything is as described in the manual, or in the manual to foresee all these nuances and emphasize checking the presence of start and limit arguments in the method.

Probably, there are other similar addresses on the Moscow Exchange with a large amount of data and without indicating the total amount of TOTAL in the “cursor” block.

So, in order not to complicate the task for this case, I decided to simply modify the program a little. We assume that we are interested in only one first data block (for such large pages without a “cursor” block, the data block will be the only one) and loop until the value is empty.

Here is the modified program code:

import os
import sqlite3
import xml.etree.ElementTree as ET

import pandas as pd
import requests

start = 0 # начальное значение
step = 100 # шаг парсинга 

while True:
    url = f"https://iss.moex.com/iss/securities.xml?start={start}"
    # url = f"https://iss.moex.com/iss/sitenews.xml?start={start}"
    response = requests.get(url)
    root = ET.fromstring(response.content)
    rows = root.findall("./data/rows/row")
    if not rows: # если нет строк - прерываем цикл
        break

    start += step
    print(start)
    dfs = {}  # словарь для хранения dataframe

    for data in root.findall("./data"):  # поиск всех элементов data
        id = data.attrib["id"]  # получение атрибута id
        columns = [
            c.attrib["name"] for c in data.findall("./metadata/columns/column")
        ]  # список колонок
        df = pd.DataFrame(columns=columns)  # создание пустого dataframe с определёнными колонками
        rows = data.findall("./rows/row")  # поиск строк

        for row in rows:
            data = dict(
                zip(columns, [row.attrib.get(c) for c in columns])
            )  # словарь данных строки
            df = pd.concat(
                [df, pd.DataFrame([data])], ignore_index=True
            )  # конкатенация строк

        dfs[id] = df  # сохранение dataframe в словаре

    # далее код сохранения данных в БД и CSV

    if not os.path.exists("moexmetadata"):
        os.mkdir("moexmetadata")

    conn = sqlite3.connect("moexmetadata.db")

    for id, df in dfs.items():
        df.to_sql(id, conn, if_exists="append", index=False)

        # Проверка наличия колонки id

        if "id" in df.columns:
            cursor = conn.cursor()

            # Проверяем наличие индекса

            cursor.execute(f"PRAGMA index_info('index_name')")

            index_info = cursor.fetchall()

            if not index_info:
                # если индекса нет, то создаем индекс в БД в таблице по колонке id

                cursor.execute(f"CREATE INDEX index_name ON {id} (id)")
                conn.commit()

        # данные дублируем в файл
        filename = f"{id}.txt"
        path = os.path.join("moexmetadata", filename)
        with open(path, "a", encoding="utf-8") as f:
            df.to_csv(f, sep="\t", index=False)

conn.close()

We launch it, the program runs for 3 hours. As a result, significant data is obtained

The “List of securities traded on the Moscow Stock Exchange” database consists of 595,000 records.

That’s all. The first effective experience was obtained. There is a general understanding of how ISS MOEX is arranged. Gained experience in parsing, get requests, records in files and databases.

And now it’s my turn to study algopak Mosbirzhi.

VIDEO on the topic:

Related posts