Python code generation – speeding up strftime / strptime

Python code generation – speeding up strftime / strptime

Greeting! In the first and second parts, I shared the history of the creation of the python library convtools (short: allows you to declaratively describe data transformations from which python functions that implement given transformations are generated), now I will talk about the acceleration of individual cases datetime.strptime and datetime.strftimeand also about the interesting things that happened in the current module along the way.

strftime: datetime/date -> str

To begin with, let’s measure the basic variant of date/date and time formatting:

from datetime import datetime

dt = datetime(2023, 8, 1)
assert dt.strftime("%b %Y") == "Aug 2023"

# In [2]: %timeit dt.strftime("%b %Y")
# 1.21 µs ± 4.02 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

After looking at the code above and also looking at the source codes strftimeyou can find the following problems:

  • on each run of the function strftime the interpreter parses the date format string from scratch (without using any intermediate work from previous iterations)

  • the date is pre-converted to timetuplewith which it can already work time.strftime. But because timetuple contains all the components of date and time, then the interpreter did extra work to create it, combining hours, minutes, seconds and microseconds, which in this particular case did not interest us at all.

Now let’s check how fast the narrowest function that implements almost the same date formatting could run (almost, because the different name of the months depending on the locale is ignored here):

from datetime import datetime

MONTH_NAMES = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

def ad_hoc_func(dt):
  return f"{MONTH_NAMES[dt.month - 1]} {dt.year:04}"

dt = datetime(2023, 8, 1)
assert ad_hoc_func(dt) == "Aug 2023"

# In [11]: %timeit ad_hoc_func(dt)
# 258 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We got an acceleration of ~ 4.7 times and setting the task for convtools – Be able to dynamically generate highly specialized converters for a given date format.

Before we review the result, I will make some remarks:

  • dt.strftime("%Y-%m-%d") use suboptimally. It is best to use dt.date().isoformat() for datetime and dt.isoformat() for date (acceleration 5.5x and 6.4x respectively) – taken into account in the implementation

  • dt.strftime("%Y"): the documentation does not make any reservations regarding the year format (at least for python 3.4+), but CPython bugtracker does (#57514) – zero padding is not done under linux glibc python (it is done on mac and linux musl). You can cure it like this dt.strftime("%4Y")but we’ll break it for others — taken into account in the implementation

  • many format codes such as %a, %b, %c, %p depends on the locale installed in the system (for example: Sunday, Monday, …, Saturday for en_US and Sonntag, Montag, …, Samstag for de_DE) – only a part of such codes is implemented, when an unsupported one is encountered, the built-in is used strftime.

from convtools import conversion as c

ad_hoc_func = c.format_dt("%b %Y").gen_converter()
assert ad_hoc_func(dt) == "Aug 2023"

# In [32]: %timeit ad_hoc_func(dt)
# 274 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We lost a little along the way, but we still have 4.4 times the acceleration of the base version. To see what code was generated under the hood, just run with debug=True(try-except binding dumps generated code to tmp on error for good tracebacks and normal debugging):

In [34]: c.format_dt("%b %Y").gen_converter(debug=True)
def converter(data_, *, __v=__naive_values__["__v"], __datetime=__naive_values__["__datetime"]):
    try:
        return f"{__v[data_.month - 1]} {data_.year:04}"
    except __exceptions_to_dump_sources:
        __convtools__code_storage.dump_sources()
        raise

Out[34]: <function _convtools.converter(data_, *, __v=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], __datetime=<class 'datetime.datetime'>)>

strptime: str -> datetime

Let’s repeat the steps above for datetime.strptimebut cutting corners a bit. Looking ahead, I will note that we will not optimize work with format codes that depend on the locale.

from datetime import datetime

assert datetime.strptime("12/31/2020 12:05:54 PM", "%m/%d/%Y %I:%M:%S %p") == datetime(2020, 12, 31, 12, 5, 54)

# In [37]: %timeit datetime.strptime("Aug 2023", "%b %Y")
# 2.93 µs ± 3.51 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

After examining the sources, we find a cache for compiled regular expressions, locks for accessing it, in general, everything is fine, but still, let’s compare it with highly specialized code.

from datetime import datetime
from convtools import conversion as c

ad_hoc_func = c.datetime_parse("%m/%d/%Y %I:%M:%S %p").gen_converter()
assert ad_hoc_func("12/31/2020 12:05:54 PM") == datetime(2020, 12, 31, 12, 5, 54)

# In [44]: %timeit ad_hoc_func("12/31/2020 12:05:54 PM")
# 1.29 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We have a 2.3 times increase in speed – it is noticeable. Let’s look at the code generated under the hood:

In [46]: c.datetime_parse("%m/%d/%Y %I:%M:%S %p").gen_converter(debug=True)
def converter(data_, *, __v=__naive_values__["__v"], __datetime=__naive_values__["__datetime"]):
    try:
        match = __v.match(data_)
        if not match:
            raise ValueError("time data %r does not match format %r" % (data_, """%m/%d/%Y %I:%M:%S %p"""))
        if len(data_) != match.end():
            raise ValueError("unconverted data remains: %s" % data_string[match.end() :])
        groups_ = match.groups()
        i_hour = int(groups_[3])
        ampm_h_delay = 12 if groups_[6].lower() == """pm""" else 0
        return __datetime(int(groups_[2]), int(groups_[0]), int(groups_[1]), i_hour % 12 + ampm_h_delay, int(groups_[4]), int(groups_[5]), 0)
    except __exceptions_to_dump_sources:
        __convtools__code_storage.dump_sources()
        raise

Out[46]: <function _convtools.converter(data_, *, __v=re.compile('(1[0-2]|0[1-9]|[1-9])/(3[0-1]|[1-2]\\d|0[1-9]|[1-9]| [1-9])/(\\d{4})\\ (1[0-2]|0[1-9]|[1-9]):([0-5]\\d|\\d):(6[0-1]|[0-5]\\d|\\d)\\ (am|pm)', re.IGNORECASE), __datetime=<class 'datetime.datetime'>)>

Price

You have to pay something for everything, in the case of convtools this time for code generation and compilation of converters:

In [47]: %timeit c.format_dt("%b %Y").gen_converter()
54.8 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [48]: %timeit c.datetime_parse("%m/%d/%Y %I:%M:%S %p").gen_converter()
99.7 µs ± 67.3 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Therefore, before using this functionality, you need to consider whether your case falls under one of the following:

  1. date format is static (known at the time the code was written) and you can call once gen_converter somewhere globally and continue to use it

  2. the date format is dynamic, but the generated converter will be used to handle say at least 1K (thousands) of dates.

Conclusion

The functionality described above is far from the only one that the library offers convtoolsYou can read more about it at the links below:

I would be grateful for feedback, ideas in discussions on Github.

Related posts