Whisper audio transcription with diarization

How to Guide

If you have a use case for transcribing audio calls and getting the information pulled out into a json format, this is the guide for you. The model I am using is Whisper, since its open source release from OpenAI - there are faster implementation from its base with flash attention: gitlink


The architecture follows

  1. Identify Speaker
  2. Break Audio into chunks
  3. Transcribe Audio
  4. Analysis of each speaker

For our example we will be using OpenAI API

Identify speaker

speaker_diarization = Pipeline.from_pretrained("pyannote/speaker-diarization-3.0"
who_speaks_when = speaker_diarization(filepath, 

Well how do you break the audio apart??? Sorry bud I am trying to monetize this feature. Do it yourself 😑

Transcribe Audio

Once you get the chunks apart you can do a simple transcription with the following code

import openai
filepath = "audio_file.mp3"
audio_file= open(filepath, "rb")
raw_transcript = openai.Audio.transcribe("whisper-1", audio_file)

Analysis of each speaker

Once you have the transcription of each speaker - finding a specific speaker is a simple search. Customer reps tend to ask “how are you doing?”

Well use the trusty embed for each customer rep line to the transcription to find which speech is the closest to the question.

import fasttext
import fasttext.util
# Download English model
fasttext.util.download_model('en', if_exists='ignore')
ft = fasttext.load_model('cc.en.300.bin')
# Let's assume these are the transcriptions from different speakers obtained from the previous steps
# Common phrases that a customer service rep might say
customer_service_phrases = [
    'How can I assist you?',
    'How can I help you today?',
    'Thank you for calling Customer Support.',
    'Is there anything else I can help you with?',
# Compute the average embedding of customer service phrases
cs_embedding = sum(ft.get_sentence_vector(phrase) for phrase in customer_service_phrases) / len(customer_service_phrases)

# Compute the similarity between speakers' text and customer service phrases
similarities = {}
for speaker, text in transcriptions.items():
    speaker_embedding = ft.get_sentence_vector(text)
    similarity = sum(cs_embedding * speaker_embedding)  # cosine similarity can also be used
    similarities[speaker] = similarity

# Identify the speaker with the highest similarity as the customer service rep
customer_service_rep = max(similarities, key=similarities.get)

print(f"The customer service rep is likely: {customer_service_rep}")

LLM Open Function

Now that you have speaker identification and transcription you can use open functions to pull out critical information

Create a class for the information you want to pull out.

class rubric_judge(BaseModel):
    rubric_name: str = Field(..., description="Did you ask for there name/company?")
    rubric_contact: str = Field(..., description="Did you ask for there email/phone number?")
    rubric_addr: str = Field(..., description="Did you ask for their address?")
class CallInformation(BaseModel):
    pickup_name: str = Field(..., description="Give me the full name of the customer")
    pickup_street: str = Field(..., description="Give me the street address, zipcode and state")
    asset_num:int = Field(..., description='Give me the asset number provided for the machinery')    
    problem_description: str = Field(...,description="Provide a brief description of the problem that the customer called for")

Utilize the class above for call transcription pullout

def get_completion_from_messages(messages, 
    response = openai.ChatCompletion.create(
          "name": "extract_call_transcription",
          "description": "Extract important information from call transcript",
          "parameters": CallInformation.schema()
    function_call = response.choices[0].message["function_call"]
    arguments = json.loads(function_call["arguments"])
    return arguments

By calling the function you can get the message results

messages = []
messages.append({"role": "system", "content": "Don't make assumptions about what values to plug into functions. Return n/a if you dont know."})
messages.append({"role": "user", "content": f" I am giving you a call transcript : {raw_transcript}"})

Your output should be similar to this :

{'pickup_name': 'Chris',
 'pickup_street': '144 main street',
 'asset_num': 777958,
 'problem_description': 'The machine is broken, the pulley system is broken'}

The use case are plenty once you know the problem description - you can classify the problem and route the call to the correct department.