
Comparing and replicating coded results
compare.RmdIn this tutorial, we will explore how to assess the reliability and
validity of LLM-coded results using the quallmer package.
We will cover three key functions:
-
qlm_compare()- for assessing inter-rater reliability between multiple coded results -
qlm_validate()- for validating coded results against a gold standard -
qlm_replicate()- for re-executing coding with different settings to test reliability
These tools help ensure that your qualitative coding is robust, reproducible, and accurate.
Loading packages and data
# We will use the quanteda package
# for loading a sample corpus of inaugural speeches
# If you have not yet installed the quanteda package, you can do so by:
# install.packages("quanteda")
library(quanteda)## Package version: 4.3.1
## Unicode version: 15.1
## ICU version: 74.2
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
## Loading required package: ellmer
# For educational purposes,
# we will use a subset of the inaugural speeches corpus
# The ten most recent speeches in the corpus
data_corpus_inaugural <- quanteda::data_corpus_inaugural[50:60]Using a codebook for this tutorial
For this tutorial, we’ll use the built-in
data_codebook_fact as a quick example. This allows us to
focus on the comparison and validation functions rather than codebook
design.
# View the built-in sentiment codebook
data_codebook_ideology## quallmer codebook: Ideological scaling
## Input type: text
## Role: You are an expert political scientist specializing in ideolo...
## Instructions: Rate the ideological position of this text on a scale from 0...
## Output schema:ellmer::TypeObject
## Levels:
## score: ordinal
## explanation: nominal
Note: The built-in codebooks are provided as examples and starting points. For actual research projects, you should create custom codebooks specific to your research questions (see the “Creating codebooks” tutorial for details).
Initial coding run
Let’s code the speeches using our codebook with a specific model and settings:
# Code the speeches with GPT-4o using the built-in codebook on ideology
coded1 <- qlm_code(data_corpus_inaugural,
codebook = data_codebook_ideology,
model = "openai/gpt-4o",
params = params(temperature = 0),
name = "gpt4o_run")## [working] (0 + 0) -> 10 -> 1 | ■■■■ 9%
## [working] (0 + 0) -> 3 -> 8 | ■■■■■■■■■■■■■■■■■■■■■■■ 73%
## [working] (0 + 0) -> 0 -> 11 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100%
# View the results
coded1## # quallmer coded object
## # Run: gpt4o_run
## # Codebook: Ideological scaling
## # Model: openai/gpt-4o
## # Units: 11
##
## # A tibble: 11 × 3
## .id score explanation
## * <chr> <int> <chr>
## 1 1985-Reagan 8 The text emphasizes limited government, reduced taxes, an…
## 2 1989-Bush 7 The text emphasizes free markets, limited government inte…
## 3 1993-Clinton 4 The text emphasizes themes of renewal, change, and respon…
## 4 1997-Clinton 4 The text emphasizes themes of equality, community, and op…
## 5 2001-Bush 6 The text reflects a centrist to moderately right-leaning …
## 6 2005-Bush 7 The text emphasizes a strong commitment to spreading demo…
## 7 2009-Obama 3 The text emphasizes themes of unity, responsibility, and …
## 8 2013-Obama 3 The text emphasizes equality, collective action, and soci…
## 9 2017-Trump 8 The text emphasizes nationalism, protectionism, and a foc…
## 10 2021-Biden 3 The text emphasizes unity, democracy, and addressing soci…
## 11 2025-Trump 8 The text emphasizes nationalism, strong border control, m…
Replicating with different settings
The qlm_replicate() function allows you to re-execute
coding with different models, parameters, or codebooks while maintaining
a provenance chain. This is useful for testing the sensitivity of your
results to different settings.
Replicating with a different model
# Replicate the coding with openai/gpt-4o-mini
coded2 <- qlm_replicate(coded1,
model = "openai/gpt-4o-mini",
name = "mini_run")## [working] (0 + 0) -> 8 -> 3 | ■■■■■■■■■ 27%
## [working] (0 + 0) -> 0 -> 11 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100%
Replicating with different temperature
# Replicate with higher temperature for more variability
coded3 <- qlm_replicate(coded1,
params = params(temperature = 0.7),
name = "gpt4o_temp07")## [working] (0 + 0) -> 10 -> 1 | ■■■■ 9%
## [working] (0 + 0) -> 3 -> 8 | ■■■■■■■■■■■■■■■■■■■■■■■ 73%
## [working] (0 + 0) -> 0 -> 11 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100%
Comparing multiple coded results
Once you have multiple coded results, you can assess inter-rater
reliability using qlm_compare(). This is useful when you
want to check consistency across different models, coders, or coding
runs.
Computing Krippendorff’s alpha
# Compare the first three runs to assess reliability
comparison <- qlm_compare(coded1, coded2, coded3,
by = "score",
level = "ordinal")
# View the comparison results
comparison##
## ── Inter-rater reliability ──
##
## Subjects: 11
## Raters: 3
##
## ── score (ordinal)
## Percent agreement: 0.3636
## Krippendorff's alpha: 0.9159
## Kendall's W: 0.9045
## Spearman's rho: 0.9421
##
The output shows:
- The reliability measures and their values, appropriate to ordinal data.
- The number of subjects (11 speeches) and raters (3 LLM coding
runs).
- The level of measurement (interval).
Computing percent agreement with a tolerance
If we treat the data as ordinal, but relax the “tolerance” on the agreement to +/-1 of the values to be compared, then we can get a different definition of agreement, thus changing the score. “Percent agreement” then rises substantially.
qlm_compare(coded1, coded2, coded3,
by = "score",
level = "ordinal",
tolerance = 1)##
## ── Inter-rater reliability ──
##
## Subjects: 11
## Raters: 3
##
## ── score (ordinal)
## Percent agreement: 0.9091
## Krippendorff's alpha: 0.9159
## Kendall's W: 0.9045
## Spearman's rho: 0.9421
##
Validating against a gold standard
When you have human-coded reference data (a gold standard), you can
assess the accuracy of LLM coding using qlm_validate().
This computes classification metrics like accuracy, precision, recall,
and F1-score.
Creating a gold standard
For this example, let’s simulate having human-coded sentiment data:
# In practice, this would be your human-coded reference data
gold_standard <- data.frame(
.id = coded1$.id,
score = c(8, 7, 4, 7, 6, 7, 5, 6, 8, 3, 8)
)Computing validation metrics
# Validate the LLM coding against the gold standard
validation <- qlm_validate(coded1,
gold = gold_standard,
by = "score")## ℹ Converting `gold` to <as_qlm_coded> object.
## ℹ Use `as_qlm_coded()` directly to provide coder names and metadata.
# View validation results
validation##
## ── quallmer validation ──
##
## n: 11
##
##
## ── score (ordinal)
## Spearman's rho: 0.8884
## Kendall's tau: 0.8000
## MAE: 0.7273
The output shows:
- Overall accuracy: proportion of correct classifications
- Precision: proportion of positive identifications that were actually correct
- Recall: proportion of actual positives that were identified correctly
- F1-score: harmonic mean of precision and recall
- Cohen’s kappa: agreement adjusted for chance
Of course, we can also perform validation treating this data as ordinal or even interval:
qlm_validate(coded1, gold = gold_standard, by = "score", level = "ordinal")## ℹ Converting `gold` to <as_qlm_coded> object.
## ℹ Use `as_qlm_coded()` directly to provide coder names and metadata.
##
##
## ── quallmer validation ──
##
##
##
## n: 11
##
##
##
##
##
## ── score (ordinal)
##
## Spearman's rho: 0.8884
##
## Kendall's tau: 0.8000
##
## MAE: 0.7273
qlm_validate(coded1, gold = gold_standard, by = "score", level = "interval")## ℹ Converting `gold` to <as_qlm_coded> object.
## ℹ Use `as_qlm_coded()` directly to provide coder names and metadata.
##
##
## ── quallmer validation ──
##
##
##
## n: 11
##
##
##
##
##
## ── score (interval)
##
## Pearson's r: 0.8092
##
## MAE: 0.7273
##
## RMSE: 1.4142
##
## ICC: 0.7460
Best practices for reliability and validation
- Multiple replications: Run coding with at least 2-3 different models or settings to assess consistency
-
Consistent temperature: Use
temperature = 0for more deterministic and reliable results -
Document settings: Use the
nameparameter to track different runs - Gold standard size: Aim for at least 100 examples in your gold standard for reliable validation metrics
-
Measure selection:
- Use Krippendorff’s alpha for nominal/ordinal data
- Use Cohen’s/Fleiss’ kappa for categorical agreement
- Use correlation measures for continuous data
-
Interpretation:
- α or κ > 0.80: Almost perfect agreement
- α or κ > 0.60: Substantial agreement
- α or κ > 0.40: Moderate agreement
- α or κ < 0.40: Fair to poor agreement
Summary
In this tutorial, you learned how to:
- Use
qlm_replicate()to systematically test coding across different models and settings - Use
qlm_compare()to assess inter-rater reliability between multiple coded results - Use
qlm_validate()to measure accuracy against a gold standard - Interpret reliability and validation metrics
These tools help ensure that your qualitative coding is robust, reproducible, and scientifically sound.