We import and replace numpy, pandas, scipy and sklearn

We import and replace numpy, pandas, scipy and sklearn

Hidden text

Spoiler: This is a teaser. Of course, we are not talking about any import substitution. It’s just that we are hyped on the increased wave of popularity of DA and DS, get a fan, and at the same time try to utilize the advantages of C++. No cringe, no fear.

What are we talking about?

It will be about C++ analogue libraries numpy, pandas, scipy and sklearn (np, pd, scipy, sklearn, respectively). These projects were initially conceived as a good addition to the portfolio, but then there was more and more dense involvement in the process of working on them, the challenges became more and more significant, and the project turned into several separate projects containing tens of thousands of lines of code.

Prehistory

It has been 6 years since I wrote my last article on Khabra. Over time, it became clear in what area an experienced programmer should develop further: it would be interesting to connect his career with the topic of DA/DS, or more precisely, to try to develop tools for data analysts and data scientists.

I wanted not only and not so much to try myself as a data scientist, but most of all to feel the inner workings of multidimensional arrays, implement slices, data frames, statistical methods and machine learning methods on my own, experiencing firsthand the beauty of the endless challenges of library developers – the struggle for usability and functionality. Then again performance, usability and functionality. Then again. The cycle “until you get bored”. And all this in pure C++.

Objectives

You ask: what is the purpose of this event? I answer:

  • get a fan

  • benefit users and the community.

  • and in general, a samurai has no goal, there is only a way.

Principles

  • Use the original API of the respective libraries, as millions of users are already used to it.

  • Do not look at the original implementation of numpy, pandas, etc. and not use the ideas from there. Everything is a clean slate, we are samurai.

  • Do not use third-party libraries unless absolutely necessary, all bicycles with hands (or feet).

  • Performance, performance, performance. We C++ samurai have an opportunity to obfuscate the code as best we can, and we should take advantage of that opportunity.

How and what to implement?

So, we already have the API. And what do we do about implementation? Very easy:

  1. We open the notebook of any date of the Sianist. For example https://github.com/adityakumar529/Coursera_Capstone/blob/master/KNN.ipynb

  2. We go to each cell sequentially, from top to bottom, and try to implement the corresponding feature in the top-level sklearn library.

  3. Implemented? You are well done. Now it appears that something is missing from scipy.

  4. Now in pandas.

  5. Now in numpy.

  6. If (your grade <= junior developer) {

    goto 2;

    } else {

    // think more. What if implement the most popular features all at once?

    goto 7;

    }

  7. Let’s implement a lot at once.

  8. goto 2 while not tired, otherwise break

You guys are great!

What has already been done?

Numpy

Pandas

Scipy

Sklearn

What are the plans?

  • Finish sklearn and other libraries to the winning one (at least to a full implementation of spreadsheets and at most to cover the most popular scenarios of data scientists)

  • A library similar to pytorch and/or tensorflow. Or something completely different, but by DL

  • Performance. Handwritten optimizations for SSE, AVX2, AVX512.

And what about the performance?

It’s not all rosy, but we’re working on it.

Consider the following example in python (the reader, of course, has learned the Monte Carlo method for calculating the number “pi”):

import numpy as np

size = 100000000

rx = np.random.rand(size)
ry = np.random.rand(size)

dist = rx * rx + ry * ry

inside = (dist[dist<1]).size

print ("PI=", 4 * inside / size)

We measure:

$ time python monte_carlo.py
PI= 3.14185536

real 0m2.108s
user 0m2.055s
sys 0m0.525s

An example of our implementation:

#include <iostream>

#include <np/Array.hpp>
 
int main(int, char **) {

     using namespace np;

     static const constexpr Size size = 100000000;

     auto rx = random::rand(size);
     auto ry = random::rand(size);

     auto dist = rx * rx + ry * ry;

     auto inside = (dist["dist<1"]).size();

     std::cout << "PI=" << 4 * static_cast<double>(inside) / size;

     return 0;
 }

Let’s start:

$ time ./monte_carlo

PI=3.14152

real	0m2.871s
user	0m12.610s
sys	    0m1.184s

Cheating? Cheating We use OpenMP (that’s why user time is like that) + built-in vectorization for AVX2, which cannot be said about “vanilla” numpy. In short, there is work to be done.

Want more examples? I have them!

Example based on https://github.com/adityakumar529/Coursera_Capstone/blob/master/KNN.ipynb

#include <sklearn/metrics/f1_score.hpp>
#include <sklearn/model_selection/train_test_split.hpp>
#include <sklearn/neighbors/KNeighborsClassifier.hpp>
#include <sklearn/preprocessing/StandardScaler.hpp>

int main(int, char **) {
    using namespace pd;
    using namespace sklearn::model_selection;
    using namespace sklearn::neighbors;
    using namespace sklearn::preprocessing;
    using namespace sklearn::metrics;

    auto data = read_csv("https://raw.githubusercontent.com/adityakumar529/Coursera_Capstone/master/diabetes.csv");
    const char *non_zero[] = {"Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"};
    for (const auto &column: non_zero) {
        data[column] = data[column].replace(0L, np::NaN);
        auto mean = data[column].mean(true);
        data[column] = data[column].replace(np::NaN, mean);
    }

    auto X = data.iloc(":", "0:8");
    auto y = data.iloc(":", "8");
    auto [X_train, X_test, y_train, y_test] = train_test_split({.X = X, .y = y, .test_size = 0.2, .random_state = 42});

    auto sc_X = StandardScaler{};
    X_train = sc_X.fit_transform(X_train);
    X_test = sc_X.transform(X_test);

    auto classifier = KNeighborsClassifier<pd::DataFrame>{{.n_neighbors = 13,
                                                           .p = 2,
                                                           .metric = sklearn::metrics::DistanceMetricType::kEuclidean}};
    classifier.fit(X_train, y_train);
    auto y_pred = classifier.predict(X_test);
    std::cout << "Prediction: " << y_pred << std::endl;
    auto cm = confusion_matrix({.y_true = y_test, .y_pred = y_pred});
    std::cout << cm << std::endl;
    std::cout << f1_score({.y_true = y_test, .y_pred = y_pred}) << std::endl;
    std::cout << accuracy_score(y_test, y_pred) << std::endl;

    return 0;
}

Example conclusion:

$ ./neighbors_diabetes
Prediction: 	0
0	0
1	0
2	0
3	0
4	1
...
149	1
150	0
151	0
152	0
153	1
154 rows x 1 columns

[[85 15]
 [19 35]]
0.673077
0.779221

We will modestly say nothing about the performance.

More examples here: https://github.com/mgorshkov/sklearn/tree/main/samples

Conclusion

We made a brief overview of the libraries np, pd, scipy, sklearn.

Join the development!

For novice developers willing to contribute, projects will be a good, and most importantly, absolutely free, addition to your portfolio, and for data scientists who want more and more features and their customization, they may become a working tool.

If there are those who are interested in contributing, contact the person. If there are data scientists who want to use the library, please also contact us, we will customize the project to your needs.

And also: who needs to learn or fine-tune their model on GPU – contact us. I’ll let you use my GPU in exchange for an opportunity to review your project.

Link

C++ numpy-like template-based array implementation: https://github.com/mgorshkov/np

Methods from pandas library on top of NP library: https://github.com/mgorshkov/pd

Scientific methods on top of NP library: https://github.com/mgorshkov/scipy

ML Methods from scikit-learn library: https://github.com/mgorshkov/sklearn

Related posts