Example: Audio transcription and analysis • quallmer

This example demonstrates a two-step process for analyzing audio content: (1) transcribing audio to text using OpenAI’s Whisper model, and (2) extracting structured information from the transcripts using qlm_code(). This workflow enables large-scale analysis of speeches, interviews, and other audio content.

Loading packages and data

library(quallmer)

## Warning: package 'ellmer' was built under R version 4.5.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.2

library(purrr)

## Warning: package 'purrr' was built under R version 4.5.2

library(knitr)

First, we identify the audio files to analyze:

# Get all audio files from the data folder
audio_files <- list.files("data/audio/",
                          pattern = "\\.(wav|mp3)$",
                          full.names = TRUE)

cat("Found", length(audio_files), "audio files:\n")

## Found 6 audio files:

for (f in audio_files) {
  size_mb <- file.size(f) / 1024^2
  duration_sec <- NA  # Would require audio package to calculate
  cat(sprintf("  %s (%.2f MB)\n", basename(f), size_mb))
}

##   F1523643-7930-4FCA-B0E1-1CB5AEBE6BF2.mp3 (0.50 MB)
##   harvard.wav (3.10 MB)
##   OSR_cn_000_0072_8k.wav (0.30 MB)
##   OSR_cn_000_0075_8k.wav (0.38 MB)
##   OSR_fr_000_0041_8k.wav (1.22 MB)
##   OSR_in_000_0064_8k.wav (0.57 MB)

These audio files contain speech samples in multiple languages from the Open Speech Repository and other sources.

Step 1: Transcribing audio with Whisper

OpenAI’s Whisper model provides high-quality, multilingual transcription. We use the openai package to transcribe each audio file:

library(openai)

# Transcribe all audio files
transcriptions <- map_chr(audio_files, function(file_path) {
  cat("Transcribing:", basename(file_path), "\n")

  transcription <- create_transcription(
    file = file_path,
    model = "whisper-1"
  )

  transcription$text
})

# Name the transcriptions by filename
names(transcriptions) <- basename(audio_files)

# Save transcriptions
saveRDS(transcriptions, "data/transcriptions_whisper.rds")

Viewing the transcriptions

Let’s examine the transcribed content:

# Display each transcription (truncated for readability)
for (i in seq_along(transcriptions)) {
  cat("=== File:", names(transcriptions)[i], "===\n")

  # Show first 200 characters
  text_preview <- substr(transcriptions[i], 1, 200)
  if (nchar(transcriptions[i]) > 200) {
    text_preview <- paste0(text_preview, "...")
  }
  cat(text_preview, "\n\n")
}

## === File: F1523643-7930-4FCA-B0E1-1CB5AEBE6BF2.mp3 ===
## And this is such an important topic because I'm sure you've felt it. There's tension in the air, right? Everywhere you go. I don't know if it's the headlines or the fact that everybody is so stressed ... 
## 
## === File: harvard.wav ===
## The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is th... 
## 
## === File: OSR_cn_000_0072_8k.wav ===
## 院子门口不远处就是一个地铁站,这是一个美丽而神奇的景象,树上长满了又大又甜的桃子,海豚和鲸鱼的表演是很好看的节目。 邮局门前的人行道上有一个蓝色的邮箱。 
## 
## === File: OSR_cn_000_0075_8k.wav ===
## 天文望远镜可以用来观察天空 它到过很多地方观光旅游 山间的小道蜿蜒曲折 春天来了,山上开满了樱花 下雪以后,田野里白矮矮的一片 
## 
## === File: OSR_fr_000_0041_8k.wav ===
## Pourrais-je avoir un verre d'eau ? La SNCF assurera un train sur trois. Les coupoles de l'immense palais s'écroulèrent. On apercevait la voile blanche du petit bateau. Il ne sentit ni douleur ni secou... 
## 
## === File: OSR_in_000_0064_8k.wav ===
## शालिनी के पास सौ रुपए हैं। सीता और सुनील का लड़का बहुत होशयार है। तुम्हारी कविता लिठने का शौक कब से शुरू हुआ। सोते हुए शेर को जगाना उच्चित नहीं है। शोर मत करो नहीं तो सुहासिनी जाग जाएगी। काम शुरू होने...

# Word counts
word_counts <- map_int(transcriptions, ~length(strsplit(.x, "\\s+")[[1]]))
cat("Word count statistics:\n")

## Word count statistics:

cat("  Range:", min(word_counts), "-", max(word_counts), "words\n")

##   Range: 2 - 119 words

cat("  Mean:", round(mean(word_counts)), "words\n")

##   Mean: 49 words

Step 2: Defining the transcript analysis codebook

Now we create a codebook to extract structured information from the transcripts. This codebook analyzes language, topics, tone, and content:

# Define a comprehensive transcript analysis codebook
codebook_transcripts <- qlm_codebook(
  name = "Speech Transcript Analysis",
  instructions = paste(
    "You are a research assistant analyzing transcribed speech content.",
    "Provide structured analysis of the speech based on its content.",
    "Be objective and accurate in your assessments."
  ),
  schema = ellmer::type_object(
    language = ellmer::type_string(
      "Primary language of the speech, in English (e.g., 'English', 'Mandarin', 'French', 'Indonesian')"
    ),
    language_confidence = ellmer::type_enum(
      c("high", "medium", "low"),
      "Confidence in language identification"
    ),
    speech_type = ellmer::type_enum(
      c("conversational", "formal_speech", "reading", "spontaneous", "other"),
      "Type or style of speech"
    ),
    main_topics = ellmer::type_string(
      "Main topics or themes discussed in the speech (comma-separated, max 5 topics)"
    ),
    key_phrases = ellmer::type_string(
      "Important phrases or keywords mentioned (comma-separated, max 5 phrases)"
    ),
    tone = ellmer::type_enum(
      c("formal", "informal", "neutral", "technical", "conversational"),
      "Overall tone of the speech"
    ),
    sentiment = ellmer::type_enum(
      c("positive", "negative", "neutral", "mixed"),
      "Overall sentiment expressed in the speech"
    ),
    summary = ellmer::type_string(
      "Brief 2-3 sentence summary of the speech content"
    )
  ),
  role = "You are an expert linguist and discourse analyst.",
  input_type = "text"
)

# View the codebook structure
codebook_transcripts

## quallmer codebook: Speech Transcript Analysis 
##   Input type:   text
##   Role:         You are an expert linguist and discourse analyst.
##   Instructions: You are a research assistant analyzing transcribed speech co...
##   Output schema:ellmer::TypeObject
##   Levels:
##     language: nominal
##     language_confidence: nominal
##     speech_type: nominal
##     main_topics: nominal
##     key_phrases: nominal
##     tone: nominal
##     sentiment: nominal
##     summary: nominal

Coding transcripts using Gemini 2.5 Flash

We use Gemini 2.5 Flash to analyze the transcripts. This model is fast and cost-effective for text analysis:

# Apply transcript analysis using qlm_code()
coded_transcripts <- qlm_code(
  transcriptions,
  codebook = codebook_transcripts,
  model = "google_gemini/gemini-2.5-flash",
  name = "audio_transcripts_gemini",
  notes = "Analysis of multilingual speech transcripts",
  include_cost = TRUE
)

# Add filenames to results
coded_transcripts$.filename <- names(transcriptions)

# Save results
saveRDS(coded_transcripts, "data/coded_transcripts_gemini.rds")

Examining the results

Let’s view the extracted information:

# Display key results
coded_transcripts %>%
  select(.filename, language, language_confidence, speech_type,
         tone, sentiment) %>%
  kable(
    col.names = c("File", "Language", "Confidence", "Type", "Tone", "Sentiment"),
    caption = "Transcript Analysis Results"
  )

Transcript Analysis Results
File	Language	Confidence	Type	Tone	Sentiment
F1523643-7930-4FCA-B0E1-1CB5AEBE6BF2.mp3	English	high	conversational	conversational	negative
harvard.wav	English	high	other	neutral	neutral
OSR_cn_000_0072_8k.wav	Mandarin	high	spontaneous	neutral	positive
OSR_cn_000_0075_8k.wav	Mandarin	high	other	neutral	neutral
OSR_fr_000_0041_8k.wav	French	high	spontaneous	neutral	mixed
OSR_in_000_0064_8k.wav	Hindi	high	spontaneous	neutral	mixed

Total cost for analyzing 6 transcripts:

cat("Transcription cost (Whisper): ~$",
    round(sum(word_counts) * 0.006 / 100, 4), " (estimated)\n", sep = "")

## Transcription cost (Whisper): ~$0.0177 (estimated)

cat("Analysis cost (Gemini): $",
    round(sum(coded_transcripts$cost, na.rm = TRUE), 4), "\n", sep = "")

## Analysis cost (Gemini): $0.0128

Language distribution

cat("Languages detected:\n")

## Languages detected:

language_table <- table(coded_transcripts$language)
print(language_table)

## 
##  English   French    Hindi Mandarin 
##        2        1        1        2

cat("\nLanguage confidence levels:\n")

## 
## Language confidence levels:

print(table(coded_transcripts$language_confidence))

## 
##   high medium    low 
##      6      0      0

Topics and themes

# Display topics for each transcript
coded_transcripts %>%
  select(.filename, main_topics) %>%
  kable(
    col.names = c("File", "Main Topics"),
    caption = "Topics Identified in Each Transcript"
  )

Topics Identified in Each Transcript
File	Main Topics
F1523643-7930-4FCA-B0E1-1CB5AEBE6BF2.mp3	societal tension, stress, negative news, impatience
harvard.wav	Smells, Food flavors, Health benefits, Culinary preferences
OSR_cn_000_0072_8k.wav	local observations, scenery, urban features, entertainment, nature
OSR_cn_000_0075_8k.wav	astronomy, nature, travel, seasons, landscapes
OSR_fr_000_0041_8k.wav	daily observations, human emotions, nature, idiomatic expressions, everyday situations
OSR_in_000_0064_8k.wav	personal characteristics, daily observations, advice, interpersonal interactions, hobbies

Detailed view of one transcript

Let’s examine the complete analysis for one transcript:

# Select the first transcript for detailed view
transcript_detail <- coded_transcripts[1, ]

cat("=== Detailed Analysis ===\n\n")

## === Detailed Analysis ===

cat("File:", transcript_detail$.filename, "\n\n")

## File: F1523643-7930-4FCA-B0E1-1CB5AEBE6BF2.mp3

cat("Language:", transcript_detail$language,
    "(", transcript_detail$language_confidence, "confidence )\n")

## Language: English ( 1 confidence )

cat("Speech type:", transcript_detail$speech_type, "\n")

## Speech type: 1

cat("Tone:", transcript_detail$tone, "\n")

## Tone: 5

cat("Sentiment:", transcript_detail$sentiment, "\n\n")

## Sentiment: 2

cat("Main topics:", transcript_detail$main_topics, "\n\n")

## Main topics: societal tension, stress, negative news, impatience

cat("Key phrases:", transcript_detail$key_phrases, "\n\n")

## Key phrases: tension in the air, stressed out, negative news, impatience of people

cat("Summary:", transcript_detail$summary, "\n\n")

## Summary: The speaker observes a pervasive tension in society, suggesting it stems from headlines, general stress, or negative news. This tension is evident in the impatience displayed by people.

cat("=== Original Transcription (first 300 chars) ===\n")

## === Original Transcription (first 300 chars) ===

cat(substr(transcriptions[1], 1, 300), "...\n")

## And this is such an important topic because I'm sure you've felt it. There's tension in the air, right? Everywhere you go. I don't know if it's the headlines or the fact that everybody is so stressed out or the news is so negative, but the impatience of people when they're waiting. ...

Summary statistics

# Speech type distribution
cat("Speech types:\n")

## Speech types:

print(table(coded_transcripts$speech_type))

## 
## conversational  formal_speech        reading    spontaneous          other 
##              1              0              0              3              2

# Tone distribution
cat("\nTone distribution:\n")

## 
## Tone distribution:

print(table(coded_transcripts$tone))

## 
##         formal       informal        neutral      technical conversational 
##              0              0              5              0              1

# Sentiment distribution
cat("\nSentiment distribution:\n")

## 
## Sentiment distribution:

print(table(coded_transcripts$sentiment))

## 
## positive negative  neutral    mixed 
##        1        1        2        2

Creating an audit trail

Document the complete analysis:

qlm_trail(coded_transcripts, path = "audio_analysis")

This creates two files:

audio_analysis.rds: Complete trail object containing the coding run, codebook, and metadata
audio_analysis.qmd: Quarto document with full audit trail documentation

Summary

This example demonstrates:

Two-step workflow: Transcription (Whisper) → Analysis (LLM)
Multilingual capability: Whisper handles multiple languages automatically
Structured extraction: Codebooks define what to extract from transcripts
Scalability: Process multiple audio files in batch
Cost efficiency: Whisper is significantly cheaper than human transcription
Reproducibility: All steps are documented and can be replicated

This workflow enables researchers to analyze audio content at scale, from political speeches to interviews to podcast episodes. The combination of Whisper for transcription and LLMs for analysis provides both accuracy and interpretability.