
Compare coded results for inter-rater reliability
qlm_compare.RdCompares two or more coded objects to assess inter-rater reliability or
agreement. For predefined-unit data (data frames or qlm_coded objects),
computes standard reliability statistics. For segmented corpora from
qlm_segment(), computes Krippendorff's alpha for unitizing (see Details).
Usage
qlm_compare(
...,
by,
level = NULL,
tolerance = 0,
ci = c("none", "analytic", "bootstrap"),
bootstrap_n = 1000
)Arguments
- ...
Two or more data frames,
qlm_coded, oras_qlm_codedobjects to compare. These represent different "raters" (e.g., different LLM runs, different models, human coders, or human vs. LLM coding). Each object must have a.idcolumn and the variable specified inby. Objects should have the same units (matching.idvalues). Plain data frames are automatically converted toas_qlm_codedobjects. Alternatively, all inputs may be segmented corpora fromqlm_segment()oras_qlm_coded()withqlm_segment = TRUE(see Details).- by
Optional. Name of the variable(s) to compare across raters (supports both quoted and unquoted). If
NULL(default), all coded variables are compared. Can be a single variable (by = sentiment), a character vector (by = c("sentiment", "rating")), or NULL to process all variables.- level
Optional. Measurement level(s) for the variable(s). Can be:
NULL(default): Auto-detect from codebookCharacter scalar: Use same level for all variables
Named list: Specify level for each variable
Valid levels are
"nominal","ordinal","interval", or"ratio".- tolerance
Numeric. Tolerance for agreement with numeric data. Default is 0 (exact agreement required). Used for percent agreement calculation.
- ci
Confidence interval method:
"none"No confidence intervals (default)
"analytic"Analytic CIs where available (ICC, Pearson's r)
"bootstrap"Bootstrap CIs for all metrics via resampling
- bootstrap_n
Number of bootstrap resamples when
ci = "bootstrap". Default is 1000. Ignored whenciis"none"or"analytic".
Value
A qlm_comparison object (a tibble/data frame) with the following columns:
variableName of the compared variable
levelMeasurement level used
measureName of the reliability metric
valueComputed value of the metric
docidPer-row context: source document identifier and overall indicator for unitizing comparisons; marginal
(n=X)for nominal per-category alpha rows;NAotherwise.rater1,rater2, ...Names of the compared objects (one column per rater)
ci_lowerLower bound of confidence interval (only if
ci != "none")ci_upperUpper bound of confidence interval (only if
ci != "none")
The object has class c("qlm_comparison", "tbl_df", "tbl", "data.frame") and
attributes containing metadata (raters, n, call).
Metrics by measurement level (predefined-unit comparisons):
Nominal: alpha_nominal, kappa (Cohen's/Fleiss'), percent_agreement
Ordinal: alpha_ordinal, kappa_weighted (2 raters only), w (Kendall's W), rho (Spearman's), percent_agreement
Interval/Ratio: alpha_interval/alpha_ratio, icc, r (Pearson's), percent_agreement
For unitizing measures (segmented corpora), see Details.
Confidence intervals:
ci = "analytic": Provides analytic CIs for ICC and Pearson's r onlyci = "bootstrap": Provides bootstrap CIs for all metrics via resampling
Details
The function merges the coded objects by their .id column and only includes
units that are present in all objects. Missing values in any rater will
exclude that unit from analysis.
Measurement levels and statistics:
Nominal: For unordered categories. Computes Krippendorff's alpha, Cohen's/Fleiss' kappa, and percent agreement.
Ordinal: For ordered categories. Computes Krippendorff's alpha (ordinal), weighted kappa (2 raters only), Kendall's W, Spearman's rho, and percent agreement.
Interval: For continuous data with meaningful intervals. Computes Krippendorff's alpha (interval), ICC, Pearson's r, and percent agreement.
Ratio: For continuous data with a true zero point. Computes the same measures as interval level, but Krippendorff's alpha uses the ratio-level formula which accounts for proportional differences.
Kendall's W, ICC, and percent agreement are computed using all raters simultaneously. For 3 or more raters, Spearman's rho and Pearson's r are computed as the mean of all pairwise correlations between raters.
Unitizing (segmentation) reliability
When all inputs are segmented corpora – created by qlm_segment() or
as_qlm_coded() with qlm_segment = TRUE – agreement is measured at
the character level using Krippendorff's alpha for unitizing continua
(Krippendorff, 2019, section 12.6). This accounts for segments of
unequal length and partial overlaps between coders' unitizations. The
observed and expected coincidence matrices are constructed from the
lengths of pairwise segment intersections across all observer pairs.
The output includes a docid column with per-document and overall
results. Segmented corpora must reference the same source text.
Four members of the unitizing alpha family are supported:
alpha_u_binary(|_ualpha)Computed when
byis omitted. Measures agreement on which character spans are identified as segments versus gaps (irrelevant matter). Collapses all segment values to a binary distinction. Use this for pure boundary agreement when segments carry no codes (section 12.6.4, eq. 35).alpha_u_nominal(_ualpha[nominal])Computed when
bynames a docvar. Measures agreement on both boundary placement and the value (code) assigned to each segment. This is the most comprehensive measure: low values can reflect boundary disagreement, coding disagreement, or both (section 12.6.3, eq. 34).alpha_cu_nominal(_cualpha[nominal])Computed alongside
alpha_u_nominalwhenbyis specified. Measures coding agreement conditional on unitization, restricting the coincidence matrix to intersections of non-gap segments only. This isolates "do the coders agree on the codes?" from "do they agree on the boundaries?" (section 12.6.5, eqs. 36–37).alpha_u_per_value[k](_(k)ualpha[nominal])Computed alongside
alpha_u_nominalwhenbyis specified. Reports the reliability of each individual valuek, showing which codes are applied reliably and which are not. Coverage (the percentage of allk-valued matter found in valued intersections) is reported in thedocidcolumn (section 12.6.6, eq. 38).
References
Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology (4th ed.). Sage. doi:10.4135/9781071878781
See also
Related workflow functions: qlm_validate() for validation of
coding against gold standards, qlm_code() for LLM coding,
as_qlm_coded() for human coding, qlm_segment() for LLM-powered
text segmentation.
Underlying reliability calculations (internal): reliability_alpha()
and reliability_alpha_u() for Krippendorff's alpha;
reliability_kappa() (Cohen) and reliability_kappa_fleiss();
reliability_kendall_w(); reliability_icc().
Examples
# Load example coded objects
examples <- readRDS(system.file("extdata", "example_objects.rds", package = "quallmer"))
# Compare two coding runs
comparison <- qlm_compare(
examples$example_coded_sentiment,
examples$example_coded_mini,
by = "sentiment",
level = "nominal"
)
print(comparison)
#>
#> ── Inter-rater reliability ──
#>
#> Subjects: 5
#> Raters: 2
#>
#>
#> ── sentiment (nominal)
#> Percent agreement 1.0000
#> Krippendorff's alpha 1.0000
#> alpha (value=1) [(n=4)] 1.0000
#> alpha (value=2) [(n=6)] 1.0000
#> Kappa 1.0000
#> kappa (value=1) [(n=4)] 1.0000
#> kappa (value=2) [(n=6)] 1.0000
#>
# Compare specific variables with explicit levels
qlm_compare(
examples$example_coded_sentiment,
examples$example_coded_mini,
by = "sentiment"
)
#>
#> ── Inter-rater reliability ──
#>
#> Subjects: 5
#> Raters: 2
#>
#>
#> ── sentiment (nominal)
#> Percent agreement 1.0000
#> Krippendorff's alpha 1.0000
#> alpha (value=1) [(n=4)] 1.0000
#> alpha (value=2) [(n=6)] 1.0000
#> Kappa 1.0000
#> kappa (value=1) [(n=4)] 1.0000
#> kappa (value=2) [(n=6)] 1.0000
#>