Transcribing wavs

Created on December 5, 2023

Updated on July 31, 2024

Synchronous transcription
- File size limits
- Audio duration limits
  - Bytes to seconds
  - Split using ffmpeg
Asynchronous Transcription

:ID: CBE89A34-2938-4EF5-A2CC-B509E6B7EFA5

I needed to do a bit a research on transcribing audio files. I did some ruby scripting to test several approaches including Azure Speech and OpenAI speech to text (whisper). Both services have synchronous transcription options but with limits. At the time of writing this, OpenAI speech to text is limited to transcribing an audio file with a maximum size of 25 MB. Azure Speech takes a different approach and limits based on duration (you can send a file of whatever size, but it will only transcribe up to 60 seconds).

I knew I was going to be dealing with wav files that would likely always be more than 60 seconds in duration and greater than 25 MB. I did explore Azure’s batch transcription, which worked well, but had significantly more overhead. So, I looked at a couple work-arounds to the synchronous transcription limitations.

# Synchronous transcription

This is a simple matter of POSTing a file to an endpoint and waiting for the response. It gets interesting when dealing with the limits of how much file/audio we’re allowed to transcribe in a single HTTP request.

# File size limits

This was the easier limit to deal with. It’s trivial to determine a file size of any file type, but a bit more involved to determine the audio duration of a wav file.

# Convert to mp3

An mp3 is compressed audio format, while a wav is a lossless, uncompressed format. So, I figured I could use ffmpeg to do the conversion to mp3 and have some levers to pull (eg, bit-rate) to further reduce filesize if needed. The quality of the audio matters only insofar as it impacts the ability of the model to accurately perform the transcription.

The script below is the simplest implementation and assumes the mp3 will always be under the 25 MB limit. The sample.wav is a 30 second clip from the Changelog Podcast. This converts the wav file to mp3 as a Tempfile before uploading to the speech to text service. It’s important to ensure that tempfile is closed and deleted.

require 'json'
require 'tempfile'

ENDPOINT = 'https://api.openai.com'.freeze
PATH = '/v1/audio/transcriptions'.freeze

file_path = ARGV[0] || 'sample.wav'

begin
  # Convert wav to mp3
  t_file = Tempfile.create([File.basename(file_path), '.mp3'])
  `ffmpeg -i #{file_path} -codec:a libmp3lame -qscale:a 2 -y #{t_file.path}`

  puts "Wav file size: #{File.open(file_path).size}"
  puts "Mp3 file size: #{t_file.size}"

  headers = {
    'Authorization' => "Bearer #{ENV['OPENAI_API_KEY']}",
    'Content-Type' => 'multipart/form-data',
    'Accept' => 'application/json'
  }

  conn = Faraday.new(url: ENDPOINT, headers:) do |f|
    f.request :multipart
  end

  response = conn.post(PATH) do |req|
    req.body = {
      file: Faraday::Multipart::FilePart.new(t_file, 'audio/wav'),
      model: 'whisper-1'
    }
  end

  puts response.status
  puts response.body
ensure
  t_file.close
  t_file.unlink
end

Running this script will produce the following output (the ffmpeg output is omitted):

Wav file size: 2274232
Mp3 file size: 331404
200
{
  "text": "Then we are now now we just do a newsletter once a week and we
  we publish that out and best newsletter in tech By the way, hey, thank
  you very much. I appreciate that. It's genuinely really good. It's the
  only one that I subscribe to I love that about you. How do you know it's
  really good if it's the only one you subscribe to well You got me That's
  I have a special skill of ruining compliments. Um, no, that's awesome"
}

This created an mp3 around 15% the size of the original wav.

Here is a breakdown of the ffmpeg command used in the script above:

ffmpeg -i sample.wav -codec:a libmp3lame -qscale:a 2 -y sample.mp3

-i: input file
-codec:a libmp3lame: audio codec for mp3 encoding
-qscale:a 2: sets the quality for the audio stream (a) to level 2. This is codec specific, which in this case represents the lame option -V2 (VRB 170-210). This can be adjusted to produce smaller file sizes/lower audio quality. See also https://trac.ffmpeg.org/wiki/Encode/MP3#VBREncoding.
-y: will overwrite output file without requiring confirmation. I needed this since I was creating the tempfile first.

# Convert to mp3 and split

The above assumes the conversion will always be under 25 MB, which might be true depending on the source wavs and the bit rate selected. If, however, there is a need to split the mp3 into smaller files, this can be done with ffmpeg, but by specifying the duration each mp3 should be. For example:

ffmpeg -i sample.wav -codec:a libmp3lame -qscale:a 2 -f segment -segment_time 10 /var/tmp/sample%03d.mp3

This will convert the wave into mp3s each with a 10 second duration, named sample001.mp3, sample002.mp3, …etc.

# Split the wav file

In the event ffmpeg or equivalent tool will not be available in the environment in which the code will run, another option is the split the wav file without converting to mp3. Wav files have a 44 byte header at the beginning that contains information like sample rate, bit depth, etc, so each smaller wav file will need to be written with the correct header information in order to be read and transcribed properly.

Fortunately, this can be done easily with the wavefile gem. The script below takes a wav file and splits it at approximately some size limit (5MB by default), writing each smaller file to a Tempfile and yielding the list of tempfiles to the caller to do with whatever (in this case posting to the transcription service)

require 'wavefile'
require 'tempfile'

class WaveChunker
  FORMAT = WaveFile::Format.new(:mono, :pcm_16, 44100)

  attr_reader :file, :chunk_size

  # @param file [IO] a wave file
  # @param chunk_size [Integer] max size for each wave file in bytes
  def initialize(file, chunk_size = 5_000_000)
    @file = file
    @chunk_size = chunk_size
  end

  # Pass a block to operate on the list of temp wav files
  def chunk
    reader = WaveFile::Reader.new(file)

    puts <<~FORMAT
      Chunking #{file.path}:
        Audio format: #{reader.native_format.audio_format}
        Channels: #{reader.native_format.channels}
        Bits per sample: #{reader.native_format.bits_per_sample}
        Sample rate: #{reader.native_format.sample_rate}
    FORMAT

    t_files = []
    t_file = Tempfile.new # This might also work with StringIO if its OK to keep everything in memory
    writer = WaveFile::Writer.new(t_file, FORMAT)

    reader.each_buffer do |buffer|
      if t_file.size >= chunk_size
        writer.close
        t_files << t_file
        t_file = Tempfile.new
        writer = WaveFile::Writer.new(t_file, FORMAT)
      end

      writer.write(buffer)

      if reader.current_sample_frame == reader.total_sample_frames
        writer.close
        t_files << t_file
      end
    end

    t_files.each(&:rewind)

    yield t_files
  ensure
    # Close and delete tmp files
    t_files.each do |tf|
      tf.close
      tf.unlink
    end
  end
end

I used the WaveChunker like this:

ENDPOINT = 'https://api.openai.com'.freeze
PATH = '/v1/audio/transcriptions'.freeze

file_path = ARGV[0] || 'sample.wav'
file = File.open(file_path)

WaveChunker.new(file, 24_000_000).chunk do |files|
  headers = {
    'Authorization' => "Bearer #{ENV['OPENAI_API_KEY']}",
    'Content-Type' => 'multipart/form-data',
    'Accept' => 'application/json'
  }

  conn = Faraday.new(url: ENDPOINT, headers:) do |f|
    f.request :multipart
    f.response :logger
  end

  files.each do |f|
    response = conn.post(PATH) do |req|
      req.body = {
        file: Faraday::Multipart::FilePart.new(f, 'audio/wav'),
        model: 'whisper-1'
        # We could include the optional `prompt` param with the previous chunk's transcription
        # to tell the model we are continuing from previously transcribed audio.
      }
    end
    puts response.status
    puts response.body
  end
end

# Audio duration limits

Azure speech lets you synchronously transcribe up to 60 seconds of audio. Anything more than that is recommended to use batch transcription. But why not split an wav file in 60 sec parts and POST each one combining the results? There are reasons why not to do this, but seems like it would be fun to try.

# Bytes to seconds

To do this using the WaveChunker above, I need to know the number of bytes in 60 seconds of audio for a given file. Fortunately, the wav file header positions 28-31 (0 based) provide the bytes per second of audio. It represents the following calculation:

(sample_rate * bits_per_second * num_channels) / 8

So in order to compute the number of bytes in a 60 seconds of audio, I multiply that by 60 (seconds) and pass the byte value to the WaveChunker

file_path = 'sample.wav'
file = File.open(file_path)

wav_header = file.read[0..43] # First 44 bytes of a wav file is the header

bytes_per_second = wav_header[28..32].unpack1('L<') # 32 bit unsigned int little-endian

chunk_size = bytes_per_second * 60 # byte size of 60 seconds of audio

file.rewind
# split wave file into 60 second chunks
WaveChunker.new(file, chunk_size).chunk do |files|
  # ...
end

# Split using ffmpeg

Alternatively, a wave file can be split using ffmpeg as seen above in the mp3 examples:

ffmpeg -i sample.wav -f segment -segment_time 60 sample_%03d.wav

This will split the wav file into several wav files each with a 60 second duration, named sample001.wav, sample002.wav, …etc.

# Asynchronous Transcription

Azure Speech batch transcription. This is way more involved that synchronous transcription. The steps are:

Upload file an Azure storage container configured with permissions to allow the Speech instance read from it. Authenticating the request using signed headers makes up much of this.
Create the batch job for one or more files.
Poll the job status to determine when it is ready (it is possible to instead use callbacks).
Get the list of transcription files
For each file, get the transcript

require 'base64'
require 'json'
require 'openssl'
require 'time'
require 'uri'

# Upload file to Azure storage account container
ACCOUNT_NAME = ENV['STORAGE_ACCOUNT']
CONTAINER_NAME = ENV['STORAGE_CONTAINER']
MIME_TYPE = 'audio/wav'
UPLOAD_METHOD = 'PUT'
STORAGE_VERSION = '2023-11-03'
STORAGE_URL = "https://#{ACCOUNT_NAME}.blob.core.windows.net"
SIGNING_HEADERS = [
  'Content-Encoding',
  'Content-Language',
  'Content-Length',
  'Content-MD5',
  'Content-Type',
  'Date',
  'If-Modified-Since',
  'If-Match',
  'If-None-Match',
  'If-Unmodified-Since',
  'Range'
].freeze

# This class is used to generate the signature used to authenticate requests to the storage REST API
class StorageSignature < Struct.new(:headers, :filename)
  def sign
    OpenSSL::HMAC.base64digest('SHA256', Base64.decode64(ENV['STORAGE_KEY']), signing_string)
  end

  def to_s = sign

  private

  def signing_string
    [
      UPLOAD_METHOD,
      *SIGNING_HEADERS.map { |h| headers.fetch(h, '') },
      *cannonicalized_headers,
      cannonicalized_resource
    ].join("\n")
  end

  def cannonicalized_headers
    headers.select { |k| k =~ /^x-ms-/i }
            .map { |k, v| "#{k.downcase}:#{v}" }
            .sort
  end

  def cannonicalized_resource
    "/#{ACCOUNT_NAME}/#{CONTAINER_NAME}/#{filename}"
  end
end

file_path = ARGV[0] || 'sample.wav'
def verbose? = ARGV[1] == 'verbose'
filename = File.basename(file_path)
file = File.open(file_path)

def validate_response(response, operation, expected_status)
  if verbose?
    p response.status
    p response.body
  end

  if response.status == expected_status
    puts "#{operation} success."
  else
    raise 'Something went wrong. Tip: rerun the script with verbose enabled: `./batch_transcription.rb sample.wav verbose`'
  end
end

storage_resource_path = "/#{CONTAINER_NAME}/#{filename}"
storage_headers = {
  'Content-Type' => MIME_TYPE,
  'Content-Length' => file.size.to_s,
  'x-ms-date' => Time.now.httpdate,
  'x-ms-blob-type' => 'BlockBlob',
  'x-ms-version' => STORAGE_VERSION
}

puts 'Begin file upload. This may take some time depending on the size of the file...'
conn = Faraday.new(url: STORAGE_URL, headers: storage_headers) do |f|
  f.response :logger if verbose?
end
signature = StorageSignature.new(conn.headers, filename)
conn.headers['Authorization'] = "SharedKey #{ACCOUNT_NAME}:#{signature}"

response = conn.public_send(UPLOAD_METHOD.downcase, storage_resource_path) do |req|
  req.body = Faraday::Multipart::FilePart.new(file, MIME_TYPE)
end

validate_response(response, 'File upload', 201)

# Batch transcription

SPEECH_ENDPOINT = "https://#{ENV['SPEECH_REGION']}.api.cognitive.microsoft.com/"
SPEECH_PATH = '/speechtotext/v3.1/transcriptions'

speech_headers = {
  'Content-Type' => 'application/json',
  'Ocp-Apim-Subscription-Key' => ENV['SPEECH_KEY']
}

conn = Faraday.new(url: SPEECH_ENDPOINT, headers: speech_headers) do |f|
  f.response :logger if verbose?
end

# Create batch transcription job

puts 'Creating batch transctiption job...'
response = conn.post(SPEECH_PATH) do |req|
  req.body = {
    'contentUrls' => [
      URI(STORAGE_URL + storage_resource_path)
    ],
    'locale' => 'en-US',
    'displayName' => "#{filename}_transcription",
    'timeToLive' => 'P1W', # Delete keep the transcript file after 1 week (ISO 8601 duration format)
    'properties' => {
      'diarizationEnabled' => true, # Identify speakers
      'diarization' => { # The docs say this isn't required for only 2 speakers, but it performed better with it.
        'speakers' => {
          'minCount' => 2,
          'maxCount' => 2
        }
      }
    }
  }.to_json
end

validate_response(response, 'Batch transcription job', 201)

batch_info = JSON.parse(response.body)
transcription_status_url = batch_info['self']

# Get transcription status. Poll until job is no longer running.
def get_transcription_status(conn, url)
  puts 'Getting status...'
  response = conn.get(url)
  validate_response(response, 'Get transcription status', 200)
  JSON.parse(response.body)
end

puts 'Begin polling job status...'

transcription_status = get_transcription_status(conn, transcription_status_url)
until transcription_status['status'] != 'Running'
  puts "Transcription status is #{transcription_status['status']}. Waiting 1 minute before trying again..."
  sleep 60
  transcription_status = get_transcription_status(conn, transcription_status_url)
end

transcription_files_url = transcription_status['links']['files']

# Get transcription files

response = conn.get(transcription_files_url)
validate_response(response, 'Get transcription files', 200)

transcription_files = JSON.parse(response.body)
transcription_files = transcription_files['values'].select { |v| v['kind'] == 'Transcription' }
                                                    .map { |v| v['links']['contentUrl'] }
# Get transcripts

transcription_files.map do |f|
  response = conn.get(f)
  validate_response(response, 'Get transcription', 200)
  puts transcript = JSON.parse(response.body)
  transcript
end

Alex's Slip-box

Transcribing wavs

Table of Contents

# Synchronous transcription

# File size limits

# Convert to mp3

# Convert to mp3 and split

# Split the wav file

# Audio duration limits

# Bytes to seconds

# Split using ffmpeg

# Asynchronous Transcription

Alex's Slip-box

Transcribing wavs

Table of Contents

# Synchronous transcription

# File size limits

# Convert to mp3

# Convert to mp3 and split

# Split the wav file

# Audio duration limits

# Bytes to seconds

# Split using ffmpeg

# Asynchronous Transcription

Search Results