Text cleanup with Python. Part 1

Text cleanup with Python. Part 1

Let’s take the simplest situation, when you saved some data with full name, phone numbers, email and username from any site. However, users do not particularly like to follow the rules for filling in fields. Therefore, sometimes in P.I.B. numbers and various symbols are present, which will further complicate the search for such data. Yes, phone numbers can be recorded individually. And therefore it is necessary to bring them to some common denominator. So, the logical conclusion is that the data must be cleaned. This is what we will deal with in this article.

For a long time I ignored the built-in functions for character filtering and used a simple “replace”. However, with this method, it is simply impossible to take into account all the symbols that need to be replaced, since there may be more than one hundred of them. However, python already has a built-in tool that will allow us to leave only letters, removing the rest of the characters – isalpha(). It returns True if the character is alphabetic. If not, False is returned. Also, using the isdecimal() method, you can remove all letters and symbols except numbers. Well, if the presence of numbers and letters is critical, but it is desirable to remove symbols, you can use the isalnum() method.

Cleaning lines from symbols and numbers

Let’s get down to business and write a small function that will perform the necessary operations. Suppose we have a row with P.I.B. that needs to be cleared. Let’s take something invented and add numbers and symbols to it.

For example: Dyachenko-Volobuev))#90= Oleg vladimirovych52415

As you can see, there is only enough here. This is not the limit. There is even worse. So, let’s start by creating a function fio_normalize (fio: str, ascii_l: bool) -> str, which will accept text as input and return it in a cleaned form.

Sometimes instead of P.I.B. there are lines that contain spam. That is, they contain a link. Therefore, first, let’s check if there is “http” in the line. If there is, there is no point in cleaning further and you just need to return an empty value.

if “http” in fio or “https” in fio or “Http” in fio or “Https” in fio:

return “”

Also, there can be dashes in the line. After all, the last name can be complex, something like: Petrov-Vodkin. Therefore, you need to check if there is a dash in the line. If at the beginning and at the end, delete. Then check whether it is in the line itself and, if it is, replace it with a word. This is necessary so that the character is not removed by the isalpha() method.

if fio.startswith("-") or fio.endswith("-"):

fio = fio.strip("-").strip()

if "-" in fio:

fio = fio.replace("-", "тирре")

Now, actually, the line is prepared for removing symbols and numbers. Therefore, we perform this operation and replace the word with which we replaced “-“.

fio = "".join(x for x in fio if x.isalpha() or x == " ").strip().replace("тирре", "-")

Also, there may be a transliteration in the line. These are once Russian letters replaced with English ones. For example: Petrov. In this case, the transliterate library can help. However, you shouldn’t expect too much from it, since different people write different endings differently. And therefore the word can be simply distorted. Slightly. And it is not essential for a person. But searching is already a problem. Nevertheless, it is worth trying to perform a transliteration. After all, it can be boring. Therefore, install the transliterate module using the command in the terminal:

pip install transliterate

and import into our script:

from transliterate import translit

However, before doing the transliteration, you should understand whether the word is made of English letters. For this, we will use the counter and the string library, or rather its ascii_letters method. After that, we will compare the received number with the number of characters in the line. And if it matches, then this word needs transliteration.

However, that is not all. Sometimes such a funny thing happens when, at first glance, a line is written in Russian. But when you look closely, you realize that some symbols in it have been replaced with English letters. They also need to be cleaned. For example: “n” can be replaced by “h”.

To do this, you need to make a substitution table and implement it with an additional function that you need to write. However, more about her later. For now, let’s take it for granted that the function exists, and with its help we replace letters in words.

if ascii_l and ascii_count == len(fio):

fio = translit(fio, "ru")

elif ascii_l:

temp = []

for x in fio:

temp.append(replacer(x)) if x in string.ascii_letters else temp.append(x)

fio = "".join(temp)

The next thing to do is to write each word in P.I.B. with a capital letter. And also take into account the presence of a dash in the compound surname. Therefore, let’s write another small piece of code.

fio = " ".join(x.strip().capitalize() for x in fio.split())

lst = []

for x in fio.split():

if "-" in x:

lst.append("-".join(z.capitalize() for z in x.split("-")))

else:

lst.append(x)

fio = " ".join(lst)

Since we have P.I.B., it should contain only three words. Now we do not take into account not quite traditional spellings. Therefore, you need to check the number of words in a line. And if there are more than three of them, cut to the desired amount.

You still need to check that the string is no longer than 50 characters. Of course, for P.I.B. this is rare. But this happens. Therefore, we leave it for fullness, but cut it to 50 characters. Why? The thing is, if you’re adding data to a SQLite database, it doesn’t matter. But already when adding to MongoDB and subsequent creation of indexes, we will receive an error on the number of characters in the indexed field.

if len(fio.split()) > 3:

fio = " ".join(fio.split()[0:3])

if len(fio) > 50:

fio = fio[:51]

Well, we return the processed string from the function. Or void if the string is empty.

return fio if fio else ""

Complete code for the string cleanup function

def fio_normalize(fio: str, ascii_l=True) -> str:

if "http" in fio or "https" in fio or "Http" in fio or "Https" in fio:

return ""

if fio.startswith("-") or fio.endswith("-"):

fio = fio.strip("-").strip()

if "-" in fio:

fio = fio.replace("-", "тирре")

fio = "".join(x for x in fio if x.isalpha() or x == " ").strip().replace("тирре", "-")

ascii_count = 0

for xz in fio:

if xz == " ":

ascii_count += 1

ascii_count += sum(1 for x in xz if x in string.ascii_letters)

if ascii_l and ascii_count == len(fio):

fio = translit(fio, "ru")

elif ascii_l:

temp = []

for x in fio:

temp.append(replacer(x)) if x in string.ascii_letters else temp.append(x)

fio = "".join(temp)

fio = " ".join(x.strip().capitalize() for x in fio.split())

lst = []

for x in fio.split():

if "-" in x:

lst.append("-".join(z.capitalize() for z in x.split("-")))

else:

lst.append(x)

fio = " ".join(lst)

if len(fio.split()) > 3:

fio = " ".join(fio.split()[0:3])

if len(fio) > 50:

fio = fio[:51]

return fio if fio else ""

Now we need to talk about the function, with the help of which we will replace the same occurrences of English letters in Russian words. Let’s create a function def replacer(txt: str) -> str, which receives a character as input and returns the replaced one, if it is in the substitution table.

def replacer(txt: str) -> str:

symbols = ("ahkbtmxcepAHKBTMXCEP",

"анквтмхсерАНКВТМХСЕР")

tr = {ord(a): ord(b) for a, b in zip(*symbols)}

return txt.translate(tr)

Well, I think that this is the end of the first part of the article. In the next part of the article, we will talk about how to clean numbers from letters and symbols, as well as normalize the phone number. Let’s write the code to test the functions we wrote for cleaning the text and test it on an example.

And that’s probably all.

Thank you for attention. I hope you find this information useful

Subscribe to our Telegram channels!

Related posts