import whisper
import pyaudio
import numpy as np
import soundfile as sf
from scipy.signal import resample
from datetime import datetime
import math

model = whisper.load_model("tiny")
audio_input = pyaudio.PyAudio()

quality = math.floor(16392 * 3)
sample_rate = 16000
chunk_size = quality

stream = audio_input.open(format=pyaudio.paInt16, channels=1, rate=sample_rate, input=True, frames_per_buffer=chunk_size)

whisper_sample_rate = 16000
energyThreashold = 400
print("up")

try:
    while True:
        audio_data = stream.read(chunk_size)
        audio_array = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0
        audio_array_resampled = resample(audio_array, int(len(audio_array) * whisper_sample_rate / sample_rate))
        energy = np.sum(np.abs(np.frombuffer(audio_data, dtype=np.int16)))/quality
        if (energy > energyThreashold):
            result = model.transcribe(audio_array_resampled)
            print(datetime.now()," - ",result['text'])

except KeyboardInterrupt:
    print("Down")
stream.stop_stream()
stream.close()
audio_input.terminate()

balancing the sample rate with the "quality" variable is an art, but it's good enough for review.

This is a pretty rough first play with the whisper model, but I've got live audio going into it with output being fairly good.

The SRT format looks simple enough, I think it may be possible to auto generate subtitles.

Technically the hardest part would likely be formatting, cutting between characters speeches. I've also seen the tool output sound descriptions but those may also need to be included possibly manually on a second pass.

Cool tech though.