Speech recognition (transcription) from audio recordings of dialogues. Whisper. Personal experience

Speech recognition (transcription) from audio recordings of dialogues. Whisper. Personal experience

The task: to recognize speech on audio recordings of dialogues between employees and customers.

We use Whisper.
We work at Colab.

In addition to speech recognition, this Whisper model has a standard timing function. Source and documentation here.

According to the documentation:

Whisper is a universal speech recognition model. It is trained on a large dataset of diverse audio, and is a multitasking model that can perform multilingual speech recognition, speech translation, and speech identification.”

Basic documentation application

The model is installed as standard.
There are several modifications (tiny, base, small, medium, large), Colab hosts all of them.

!pip install -U openai-whisper
import whisper

model_name="large"
model = whisper.load_model(model_name)

The basic application of the model is also simple and standard.

result = model.transcribe(file)
print(result["text"])

We will recognize the speech of two participants in a two-channel audio file and add timing.

Download the audio file

We download the file from the computer, remember the name, create a folder for the future archive.

import os
from google.colab import files

uploaded = files.upload()
file = next(iter(uploaded.keys()))
source_file_name = file.replace('.wav','')

path = "/content/" + source_file_name
os.mkdir(path)

We make sure that the name of the file has been saved correctly and that the file contains two channels.

print(source_file_name)

audio_file = wave.open(source_file_name + '.wav')
CHANNELS = audio_file.getnchannels()
print("Количество каналов:", CHANNELS)

We define the language

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio(source_file_name + '.wav')
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

It is necessary to pay attention to the fact that the language is determined based on the first 30 seconds. This is sometimes confusing, as interlocutors may greet each other in one language, then decide they are more comfortable speaking in another language and switch to another language.

We create 2 files with one track

This part does not directly address Whisper and speech recognition. Here we read all the samples from the file, reset each even one and create a new file. We do the same with odd samples.

The code for splitting the file into 2 tracks
import wave, struct

# файл делится на две дорожки и создается два файла

audio_file = wave.open(source_file_name + '.wav')

SAMPLE_WIDTH = audio_file.getsampwidth() # глубина звука
CHANNELS = audio_file.getnchannels() # количество каналов
FRAMERATE = audio_file.getframerate() # частота дискретизации
N_SAMPLES = audio_file.getnframes() # кол-во семплов на каждый канал

N_FRAMES = audio_file.getnframes()

# Определяем параметры аудиофайла

nchannels = CHANNELS
sampwidth = SAMPLE_WIDTH
framerate = FRAMERATE
nframes = N_FRAMES

comptype = "NONE"  # тип компрессии
compname = "not compressed"  # название компрессии

# узнаем кол-во семплов и каналов в источнике
N_SAMPLES = nframes
CHANNELS = nchannels

def create_file_one_channel(name):

  # создаем пустой файл в который мы будем записывать результат обработки в режиме wb (write binary)
  out_file = wave.open(name, "wb")

  # в "настройки" файла с результатом записываем те же параметры, что и у "исходника"
  out_file.setframerate(framerate)
  out_file.setsampwidth(sampwidth)
  out_file.setnchannels(CHANNELS)

  # обратно перегоняем список чисел в байт-строку
  audio_data = struct.pack(f"<{N_SAMPLES * CHANNELS}h", *values_copy)

  # записываем обработанные данные в файл с резхультатом
  out_file.writeframes(audio_data)

##########

print('started')

# читаем из файла все семплы
samples = audio_file.readframes(N_FRAMES)

# просим struct превратить строку из байт в список чисел
# < - обозначение порядка битов в байте (можно пока всегда писать так)
# По середине указывается общее количество чисел, это произведения кол-ва семплов в одном канале на кол-во каналоов
# h - обозначение того, что одно число занимает два байта

values = list(struct.unpack("<" + str(N_FRAMES * CHANNELS) + "h", samples))
print(values[:20])

values_copy = values[:]

# обнулим каждое четное значение
for index, i in enumerate(values_copy):
    if index % 2 == 0:
      values_copy[index] = 0

create_file_one_channel('1_channel.wav')
print(values_copy[:20])

values_copy = values[:]

# обнулим каждое нечетное значение
for index, i in enumerate(values_copy):
    if index % 2 != 0:
      values_copy[index] = 0

create_file_one_channel('2_channel.wav')
print(values_copy[:20])

Received the 1st channel file and the 2nd channel file: 1_channel.wav and 2_channel.wav.

Transcription of 1 channel

We transcribe a file with 1 channel

from datetime import timedelta

source_file_name_channel="1_channel"
print(source_file_name_channel)

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio(source_file_name_channel + '.wav')
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
print('started...')
print()

result = model.transcribe(source_file_name_channel + '.wav',)

segments = result['segments']

text_massive = []

for segment in segments:
    startTime = str(0)+str(timedelta(seconds=int(segment['start'])))
    endTime = str(0)+str(timedelta(seconds=int(segment['end'])))
    text = segment['text']
    segmentId = segment['id']+1
    segment = f"{segmentId}. {startTime} - {endTime}\n{text[1:] if text[0] == ' ' else text}"
    #print(segment)
    text_massive.append(segment)

print()
print('Finished')

and save the result in a docx format file.

!pip install python-docx
from docx import Document

# сохраняем текст с таймингом

text = text_massive

# создаем новый документ
doc = Document()

# добавляем параграф с текстом
doc.add_paragraph(source_file_name + '_' + source_file_name_channel + '_' + model_name)
for key in text:
  doc.add_paragraph(key)

# сохраняем документ
doc.save(path + '/' + source_file_name + '_' + source_file_name_channel + '_text_timing_' + model_name + '.docx')

Everything is the same with the second channel, only instead of “1_channel” we use “2_channel”.
You can also apply a loop of these two values, it doesn’t matter.

At this stage, we received 2 text files with text and timing, and these files are located in the previously created folder with the appropriate name. It remains to create an archive and download.

Creating and downloading an archive

import shutil/

# Создание архива с файлами из папки
shutil.make_archive(path + '_' + model_name, 'zip', path)

# Скачивание архива
files.download(path + '_' + model_name + '.zip')

Deleting files and folders

If you need to “manually” process another file, then the following code will delete previously applied and received files and folders from Colab.

# задаем сразу все и сразу все удаляем

file_massive = [
    source_file_name + '.wav',
    source_file_name + '_' + model_name + '.zip',
    '1_channel.wav',
    '2_channel.wav',
    path + '/' + source_file_name + '_1_channel_text_timing_' + model_name + '.docx',
    path + '/' + source_file_name + '_2_channel_text_timing_' + model_name + '.docx'
  ]

# Удаление файла
for key in file_massive:
  print(key)
  os.remove(key)
print()

# Удаление папки
print(path)
os.rmdir(path)
print()

print('Finished')

Colab is cleared and you can work with another file.

Addition

If there are several files and you want to automate, you can download all audio files and run all fragments in a row in a single cycle.

For automation, you can download files from the client’s card or from a special storage and, after transcription, save the dialogue back to the client’s card or to a special storage.

Transcription immediately in English

Whisper has an interesting option – to translate immediately into English, bypassing the text conclusion. This can be useful if the files are in different languages ​​and need to be parsed in a uniform way. In this case, it is advisable to translate all dialogues into English and further process them in English.

For transcription immediately into English, an additional instruction is used.

result = model.transcribe(source_file_name_channel + '.wav', task="translate")

Note

If you found an inaccuracy in the article, or you think it would be useful to add something – please let us know in the comments.

Related posts