The use of Python for the collection and processing of digital footprint data
The digital trail is usually only talked about in general terms, and the programming description for working with it is only mentioned. This article reviews a set of Python libraries and techniques that can be used to collect and process digital footprint data.
Contents
The concept of “digital footprint”
Legally, the concept of “digital footprint” is not fixed, in the literature it is described as data about a specific person, as well as data about an organization or event. However, information about people is most often meant.
In 2022, a professional standard for a specialist in modeling, collection and analysis of digital footprint data was adopted in Russia – Professional Standard 06.046 | Specialist in modeling, digital footprint | Professional Standards 2023. This document also deals with personal data:
General Information: Conducting a comprehensive analysis of the digital footprint of a person (groups of people) and information and communication systems (hereinafter – ICS). |
Thus, a digital footprint is data on the Internet related to a specific object, which is most often used by a person. It is important to keep in mind the laws on the protection of personal data and intellectual property when working with a digital footprint. This article discusses the collection of data from the Internet by the law.
Digital footprint parsing & data collection
A logical way to collect a digital footprint with software tools is to program a human-like logic for working with publicly available data. People search and collect information about something specific through a search, and then, studying the pages on the Internet, choose information that talks about what interests them.
Processing depends on the specific purpose of further processing and the type of data collected. Text processing can include cleaning of unnecessary characters and tokenization – dividing the text into words/signs/symbols. Processing of numbers – pass processing and normalization. Image processing is simple formatting.
Thus, the main stages of digital trail data collection and processing are:
-
Sending an HTTP request to the search engine’s web server mentioning the object of interest;
-
Receiving a link to the page on the Internet about the object of interest from the response of the web server, sending an HTTP request to receive the code of this page;
-
Selection of information of interest from the received page code: either manually configured data collection from certain segments of the page, or checking for mention of the object in the text, and collection of such proposals;
-
Processing of digital footprint data:
-
Text: cleaning, tokenization;
-
Numbers: pass processing, normalization;
-
Image: Simple formatting.
There are many Python libraries for parsing. Very easy for beginners are “Requests” and “Beautiful Soup”. Here is an article on the subject. When working with a digital trace, similar actions are performed, you just need to additionally search and select information about one object of interest.
-
Sending an HTTP request to the search engine’s web server mentioning the object of interest can be done using the library “Requests“. You need to specify the object in the request, in the example it is specified in the URL.
import requests url="https://www.google.com/search?q=object&sxsrf=APwXEdcTFNrK6vqIhKkA8ofiMVABpdXz3Q%3A1685681947166&ei=G3d5ZPDmCaqsrgSvxpLoBg&ved=0ahUKEwiw4KjN5qP_AhUqlosKHS-jBG0Q4dUDCA8&uact=5&oq=object&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzINCAAQigUQsQMQgwEQQzINCAAQigUQsQMQgwEQQzINCAAQigUQsQMQgwEQQzIKCAAQigUQsQMQQzIICAAQgAQQsQMyCAgAEIAEELEDMg0IABCKBRCxAxCDARBDMg0IABCKBRCxAxCDARBDMgcIABCKBRBDMgsIABCKBRCxAxCDAToHCCMQigUQJzoECCMQJzoRCC4QgAQQsQMQgwEQxwEQ0QM6BQgAEIAEOgsIABCABBCxAxCDAToICC4QgAQQsQM6DgguEIAEELEDEMcBENEDOhIIABCKBRCxAxCDARBDEEYQ_wFKBAhBGABQAFjiC2DwDGgAcAF4AIAB2gOIAaALkgEJMC4xLjMuMC4xmAEAoAEBwAEB&sclient=gws-wiz-serp" r = requests.get(url) r.text
-
Obtaining a link to the page on the Internet about the object of interest from the response of the web server can be done using “Beautiful Soup– save the “href” attribute of the “a” tag, which sends to a new page. Next, sending an HTTP request to get the code for this page can be done using the “Requests».
from bs4 import BeautifulSoup # Поиск всех полученных ссылок soup = BeautifulSoup(r.text, 'lxml') found_urls = soup.find_all('a') # Поиск ссылки на другой сайт for url in found_urls: href = url.get('href') #Ссылки на другие сайты в Google начинаются так if href.startswith("/url?q="): print(href[7:]) # пропуск /url?q= break r2 = requests.get(href[7:]) r2.text
-
The selection from the received page code of the information of interest can be done either manually using the segment search “Beautiful Soup“, or you can check for mentions of the object in the text by checking in a loop and using the built-in library “String“.
def find_sentences_with_word(text, word): # Разбиваем текст на предложения предварительно # приведя все к нижнему регистру sentences = text.lower().split(". ") # Инициализируем список для хранения предложений с указанным словом sentences_with_word = [] # Проверяем каждое предложение for sentence in sentences: # Разбиваем предложение на слова sentence.lower() words = sentence.split() # Проверяем, содержит ли предложение указанное слово if word in words: sentences_with_word.append(sentence) return sentences_with_word # Пример текста, в котором будем искать предложения text = "Существует большое количество инструментов IT. " \ "Объект является одним из них, он требует особых навыков. " \ "Многие думают про объект и хотят им пользоваться для упрощения работы. " \ "Python - мощный язык программирования" # Слово, которое ищем в предложениях word = "объект" # Поиск предложений с указанным словом sentences_with_word = find_sentences_with_word(text, word) # Вывод найденных предложений for sentence in sentences_with_word: print(sentence)
API & Digital Footprint Data Collection
The collection of a digital footprint is also possible with software interaction with other services through their API. Yes, it is possible to receive data from various online platforms in a different way.
-
Collect data from Google using the Google Search API: The Google Search API allows you to perform searches and retrieve results in a structured format. To use the API, you will need a key, which can be obtained using the Google Cloud Console. Here is the documentation on how to get an API key from Google.
from googleapiclient.discovery import build
# Ключ API для Google Search API
api_key = "YOUR_API_KEY"
# Создание объекта для взаимодействия с API
service = build("customsearch", "v1", developerKey=api_key)
# Выполнение поискового запроса
result = service.cse().list(q="python", cx="YOUR_CX").execute()
# Обработка результатов
if "items" in result:
for item in result["items"]:
title = item.get("title", "")
link = item.get("link", "")
print(title)
print(link)
print("-----------")
from googleapiclient.discovery import build
# Ключ API для YouTube Data API
api_key = "YOUR_API_KEY"
# Создание объекта для взаимодействия с API
youtube = build("youtube", "v3", developerKey=api_key)
# Выполнение запроса на поиск видео по ключевому слову
request = youtube.search().list(q="python tutorial", part="snippet", maxResults=10)
response = request.execute()
# Обработка результатов
if "items" in response:
for item in response["items"]:
video_title = item["snippet"]["title"]
video_id = item["id"]["videoId"]
print(video_title)
print(f"https://www.youtube.com/watch?v={video_id}")
print("-----------")
Processing of digital footprint data
The processing of digital footprint data is notably different from the processing of other data. The article covers very simple examples, here is a more in-depth article. Depending on the type of data, different methods can be applied.
-
Text: For simple cleanup, you can use the built-in library “String“, For tokenization (breaking the text into units) it is convenient to use “NLTK“.
from nltk.tokenize import word_tokenize
import string
text = "This is a sample sentence for text preprocessing."
clean_text = text.lower() # Приведение текста к нижнему регистру
clean_text = clean_text.translate(str.maketrans("", "", string.punctuation)) # Удаление знаков пунктуации
tokens = word_tokenize(clean_text) # Токенизация текста
print(tokens)
-
Numbers: To process passes, you can use “NumPy» – Remove missing values or replace them with mean or median, for normalization you can use “Scikit-learn“, “Pandas“.
import numpy as np
import pandas as pd
# Пример данных с пропущенными значениями
data = [1, 2, np.nan, 4, 5]
# Удаление пропущенных значений
clean_data = data.dropna()
# Замена пропущенных значений средним значением
mean_value = data.mean()
data_filled = data.fillna(mean_value)
# Нормализация данных
normalized_data = (data - np.mean(data)) / np.std(data)
print("Удаление пропущенных значений:", clean_data)
print("Замена пропущенных значений:", data_filled)
print("Нормализация данных:", normalized_data)
from PIL import Image
image_path = "image.jpg"
# Открытие изображения
image = Image.open(image_path)
# Изменение размера изображения
resized_image = image.resize((500, 500))
# Конвертация изображения в оттенки серого
gray_image = image.convert("L")
# Сохранение предобработанного изображения
resized_image.save("resized_image.jpg")
gray_image.save("gray_image.jpg")
If you look events in Mykolaiv – https://city-afisha.com/afisha/