
Compare coded results for inter-rater reliability
qlm_compare.RdCompares two or more data frames or qlm_coded objects to assess inter-rater
reliability or agreement. This function extracts a specified variable from
each object and computes reliability statistics using the irr package.
Usage
qlm_compare(
...,
by,
level = NULL,
tolerance = 0,
ci = c("none", "analytic", "bootstrap"),
bootstrap_n = 1000
)Arguments
- ...
Two or more data frames,
qlm_coded, oras_qlm_codedobjects to compare. These represent different "raters" (e.g., different LLM runs, different models, human coders, or human vs. LLM coding). Each object must have a.idcolumn and the variable specified inby. Objects should have the same units (matching.idvalues). Plain data frames are automatically converted toas_qlm_codedobjects.- by
Optional. Name of the variable(s) to compare across raters (supports both quoted and unquoted). If
NULL(default), all coded variables are compared. Can be a single variable (by = sentiment), a character vector (by = c("sentiment", "rating")), or NULL to process all variables.- level
Optional. Measurement level(s) for the variable(s). Can be:
NULL(default): Auto-detect from codebookCharacter scalar: Use same level for all variables
Named list: Specify level for each variable
Valid levels are
"nominal","ordinal","interval", or"ratio".- tolerance
Numeric. Tolerance for agreement with numeric data. Default is 0 (exact agreement required). Used for percent agreement calculation.
- ci
Confidence interval method:
"none"No confidence intervals (default)
"analytic"Analytic CIs where available (ICC, Pearson's r)
"bootstrap"Bootstrap CIs for all metrics via resampling
- bootstrap_n
Number of bootstrap resamples when
ci = "bootstrap". Default is 1000. Ignored whenciis"none"or"analytic".
Value
A qlm_comparison object (a tibble/data frame) with the following columns:
variableName of the compared variable
levelMeasurement level used
measureName of the reliability metric
valueComputed value of the metric
rater1,rater2, ...Names of the compared objects (one column per rater)
ci_lowerLower bound of confidence interval (only if
ci != "none")ci_upperUpper bound of confidence interval (only if
ci != "none")
The object has class c("qlm_comparison", "tbl_df", "tbl", "data.frame") and
attributes containing metadata (raters, n, call).
Metrics computed by measurement level:
Nominal: alpha_nominal, kappa (Cohen's/Fleiss'), percent_agreement
Ordinal: alpha_ordinal, kappa_weighted (2 raters only), w (Kendall's W), rho (Spearman's), percent_agreement
Interval/Ratio: alpha_interval/alpha_ratio, icc, r (Pearson's), percent_agreement
Confidence intervals:
ci = "analytic": Provides analytic CIs for ICC and Pearson's r onlyci = "bootstrap": Provides bootstrap CIs for all metrics via resampling
Details
The function merges the coded objects by their .id column and only includes
units that are present in all objects. Missing values in any rater will
exclude that unit from analysis.
Measurement levels and statistics:
Nominal: For unordered categories. Computes Krippendorff's alpha, Cohen's/Fleiss' kappa, and percent agreement.
Ordinal: For ordered categories. Computes Krippendorff's alpha (ordinal), weighted kappa (2 raters only), Kendall's W, Spearman's rho, and percent agreement.
Interval: For continuous data with meaningful intervals. Computes Krippendorff's alpha (interval), ICC, Pearson's r, and percent agreement.
Ratio: For continuous data with a true zero point. Computes the same measures as interval level, but Krippendorff's alpha uses the ratio-level formula which accounts for proportional differences.
Kendall's W, ICC, and percent agreement are computed using all raters simultaneously. For 3 or more raters, Spearman's rho and Pearson's r are computed as the mean of all pairwise correlations between raters.
See also
qlm_validate() for validation of coding against gold standards,
qlm_code() for LLM coding, as_qlm_coded() for human coding.
Examples
# Load example coded objects
examples <- readRDS(system.file("extdata", "example_objects.rds", package = "quallmer"))
# Compare two coding runs
comparison <- qlm_compare(
examples$example_coded_sentiment,
examples$example_coded_mini,
by = "sentiment",
level = "nominal"
)
print(comparison)
#>
#> ── Inter-rater reliability ──
#>
#> Subjects: 5
#> Raters: 2
#>
#>
#> ── sentiment (nominal)
#> Percent agreement: 1.0000
#> Krippendorff's alpha: 1.0000
#> Kappa: 1.0000
#>
# Compare specific variables with explicit levels
qlm_compare(
examples$example_coded_sentiment,
examples$example_coded_mini,
by = "sentiment"
)
#>
#> ── Inter-rater reliability ──
#>
#> Subjects: 5
#> Raters: 2
#>
#>
#> ── sentiment (nominal)
#> Percent agreement: 1.0000
#> Krippendorff's alpha: 1.0000
#> Kappa: 1.0000
#>