Oil through the eyes of a data analyst

Oil through the eyes of a data analyst

Do you love working with data as much as I do? Do you also suffer from imperfect data analysis tools? And so, literally in the first lines, I understand that I have to be very careful not to hurt the feelings of, for example, fans of Excel or Notepad++. I will try to correct the situation: Excel is great! The best low-code system! Notepad++ is incomparable… but try working with these tools on one million lines of data… eh, I couldn’t resist. And while the article promises to be provocative, I will add to the fire: I want to demonstrate how to process and analyze data for all depersonalized oil deals. With an analyst’s view, I will show how big capital controls the market. And yes, I want this opportunity to be available not only to corporations with their resources, but also to the average user with a laptop. The text of the article is not an investment recommendation, all coincidences are accidental.

A brief introduction.

The writing of the article, as well as the collection of data, were carried out in the first half of January 2024. Therefore, the subject of analysis was depersonalized agreements regarding the settlement of oil futures (BRG4) in the period starting from the beginning of the evening session on January 12, 2024 and ending with the end of the daytime session on January 17, 2024.

Content

1) Method of obtaining and preparing data

2) Data storage and loading

3) Data analysis

Method of obtaining data.

I will describe how to do it for free, but you must have an active Quik trading terminal. Yes, there is a service of the Moscow Exchange, which shares data very sparingly. Yes, I liked the idea

tinkoff

give everything through an API mesh (love Swagger). But I repeat, it is difficult for an ordinary user. We open the Quik terminal, preferably closer to the end of the day session (18-45 Moscow time). The time was chosen randomly, because through the terminal you have the opportunity to get data for just one day (to be more precise for the evening session of the previous day and for the daytime current session).

Select the menu -> Create window -> Table of impersonal agreements
In the window, set the parameters as shown in the picture

In the table of impersonal agreements, right-click and select “Copy all” as shown in the picture

Next, the first test of your love for Notepad++ begins 🙂 Paste the contents of the clipboard into Notepad++. I’ll be honest, I didn’t manage, I cheated and used VSCode. However, when pasted into a text file, the data looks something like this:

Next, a series of substitutions must be made in the text file to reduce the file size and prepare a complete csv file. I usually substitute:

  • All commas are semicolons
  • All tabs are commas
  • the word “Buy” on the symbol ‘B’
  • the word “Sale” on the symbol ‘S’
  • well, etc.

Everything should look something like this:

Here it is worth noting that both the list of tools and the set of columns can be any.

Data storage and loading

In fact, it is already a question of benefits and convenience. Data can be saved as csv files. My way is more complicated, but I will describe it anyway. I personally prefer to use a DBMS, the beloved Clickhouse from

Yandex

. With this approach, I get flexibility, for example, I can give the following SQL query:

 SELECT ID_DTTM, OPERATION, PRICE, QUANTITY, OI FROM moex.br_trades bt where bt.ID_DTTM>toDateTime('2024-01-16 08:50:00') and bt.ID_DTTM<toDateTime('2024-01-17 00:00:00') order by bt.ID_DTTM

… and immediately receive a set of data in the period from 08:50 16.01.24 to 00:00 17.01.24 sorted by date and time. In the case of data storage in csv format, such flexibility is unimaginable.

It is better to ask Yandex how to install Clickhouse 🙂 It is described in the official documentation

a difficult path

However, I have come across articles where everything is described much simpler.

How to upload data from CSV to clickhouse. I use DBeaver and a simple Python script like this:

A simple script

INSERT_SQL = "INSERT INTO moex.br_trades (ID_DTTM, TRADENO, SECID, PRICE, QUANTITY, OPERATION, OI) VALUES "
with open("./br_in_16-17.csv") as file:
	content = file.readlines()

sql = INSERT_SQL
for line in content:
	trade = line.split(",")
	tradeno = trade[1]
	id_dttm = trade[2] + " " + trade[3]
	secid = trade[4]
	price = trade[5]
	quantity = trade[6]
	operation = trade[7]
	oi = trade[8]

	sql = (
		INSERT_SQL
		+ "(toDateTime('"
		+ id_dttm
		+ "'),'"
		+ tradeno
		+ "','"
		+ secid
		+ "',"
		+ price
		+ ","
		+ quantity
		+ ",'"
		+ operation
		+ "',"
		+ oi
		+ ");"
	)

	print(sql)

At about this point, “chokes” should appear, who will say that data should be loaded into the analytical DBMS in batches, and not in rows. In fact, DBeaver dies approximately before starting a batch of 100,000 rows or more. And the line is slow, but performs.

Data analysis

Here, too, the path to the result passed through the swamp of disappointment. I rushed to draw graphs in Apache Superset, but it immediately “left the chat”, freezing for a long time when drawing a modest dataset of 300 thousand rows. Yes, I understand, it’s all about one worker and the bad manners of this BI system to turn the original SQL dataset into its own. As a summary – hello, Jupyter Notebook.

I recommend installation through Anaconda

First of all, we import the required packages and load the data into the df dataframe (either from a csv file or from clickhouse), and see what we got.

Here, for understanding, I will describe the values ​​in columns

  • ID_DTTM – date and time of the transaction
  • OPERATION – “B”: purchase, “S”: sale
  • PRICE – the price of the deal
  • QUANTITY – quantity (lots)
  • OI is open interest

Some statistics

Note the minimum and maximum values ​​of the Quantity column, we will need this later.
Now let’s move on to my favorite procedure – drawing graphs. Let’s start with something simple: let’s look at the dynamics of price changes for the selected period.

Excellent, but so far modest. Any trading terminal can do this. Now let’s draw something that terminals can’t (or won’t). For example, I have long been interested in the correlation between buy/sell and open interest. Let us have a new column in the dataframe, let’s call it delta. The essence of delta is the difference between buying and selling, which accumulates from the beginning of the sample. For example, if we have sold 18 lots since the beginning of the period, then bought 1, and then sold 5 again, then delta = -18 + 5 – 1 = – 22 lots.
And now let’s look at an example of a graph

And here are some interesting conclusions:

  1. Despite the fact that on January 16, the price of oil almost did not change (if you compare the beginning of the day and the end of the day), however, all market participants bought about 7,500 lots at the end of the day. I wonder why that happened, right?
  2. You can, of course, calculate exactly with the open interest, but the inverse correlation between delta and interest is clearly visible. That is, when the market buys, open interest falls and vice versa.

Now let’s split our delta into its components. After all, our sample of impersonal deals includes deals from all market participants, regardless of size. Here we will divide them according to the size of the deal. Remember I asked you to remember the minimum and maximum value of the number of lots in the deal? It is in this case from 1 lot to 500. Let’s take trades larger than 100 lots and calculate the delta separately. Does this feature fit the definition of a great deal? I think so. The buyer/seller for 1 lot of BRG4 futures must have collateral in the amount of about 14,000 KRB. So, under 100 lots, we are already talking about securing 1.4 million rubles. Once again, 1.4 million rubles. – This is only provision under one agreement. So I think completely.

Here’s what happened.

Orange curve – small transactions, red – large ones. And here we find ourselves on the thin ice of interpreting the results. Imaginary “supporters of the conspiracy of market makers against speculators” can say “well, everything is obvious, when the price started to fall, they supported the market, and then drove the crowd up.” Well, if we rely on the numbers, we can see that by the end of the day, about 2,500 lots were bought with the help of large deals, and 5,000 lots were bought with the help of small deals.

But personally, I would like to get a clear indication of local extremes. To put it simply, in a situation where the price has fallen, I would like to understand that “this is the bottom and we will continue to grow”, or in a situation when the price is rising, I would like to understand that “this is the top, we will soon start to fall.” And, honestly, I haven’t been able to achieve it yet. But the purpose of the article was different, right?

Epilogue.

Link to git c csv dump files and jupyter notebook file:

https://github.com/BazDen/BR_trades/

In the article, I tried to demonstrate the possibility for an ordinary user to analyze a large volume of data. Yes, not only futures market instruments can be analyzed in the same way, but, for example, deals on shares. Is it possible to build a successful strategy based on this analysis? I’m sure it is. And even protest. But, I think this is material for the next article and the next sleepless night.

Related posts