
Validate coding: inter-rater reliability or gold-standard comparison
validate.RdArguments
- data
A data frame containing the unit identifier and coder columns.
- id
Character scalar. Name of the column identifying units (e.g. document ID, paragraph ID).
- coder_cols
Character vector. Names of columns containing the coders' codes (each column = one coder).
- min_coders
Integer: minimum number of non-missing coders per unit for that unit to be included. Default is 2.
- mode
Character scalar: either
"icr"for inter-rater reliability statistics, or"gold"to compare coders against a gold-standard coder.- gold
Character scalar: name of the gold-standard coder column (must be one of
coder_cols) whenmode = "gold".- output
Character scalar: either
"list"(default) to return a named list of metrics whenmode = "icr", or"data.frame"to return a long data frame with columnsmetricandvalue. Formode = "gold", the result is always a data frame.
Value
If mode = "icr":
If
output = "list"(default): a named list of scalar metrics (e.g.res$fleiss_kappa).If
output = "data.frame": a data frame with columnsmetricandvalue.
If mode = "gold": a data frame with one row per non-gold coder and
columns:
- coder_id
Name of the coder column compared to the gold standard
- n
Number of units with non-missing gold and coder codes
- accuracy
Overall accuracy
- precision_macro
Macro-averaged precision across categories
- recall_macro
Macro-averaged recall across categories
- f1_macro
Macro-averaged F1 score across categories
Details
This function has been superseded by qlm_compare() for inter-rater
reliability and qlm_validate() for gold-standard validation.
This function validates nominal coding data with multiple coders in two ways: Krippendorf's alpha (Krippendorf 2019) and Fleiss's kappa (Fleiss 1971) for inter-rater reliability statistics, and gold-standard classification metrics following Sokolova and Lapalme (2009).
mode = "icr": compute inter-rater reliability statistics (Krippendorff's alpha (nominal), Fleiss' kappa, mean pairwise Cohen's kappa, mean pairwise percent agreement, share of unanimous units, and basic counts).mode = "gold": treat one coder column as a gold standard (typically a human coder) and, for each other coder, compute accuracy, macro-averaged precision, recall, and F1.
References
Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology. 4th ed. Thousand Oaks, CA: SAGE. doi:10.4135/9781071878781
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. doi:10.1037/h0031619
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. doi:10.1177/001316446002000104
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002
Examples
if (FALSE) { # \dontrun{
# Inter-rater reliability (list output)
res_icr <- validate(
data = my_df,
id = "doc_id",
coder_cols = c("coder1", "coder2", "coder3"),
mode = "icr"
)
res_icr$fleiss_kappa
# Inter-rater reliability (data.frame output)
res_icr_df <- validate(
data = my_df,
id = "doc_id",
coder_cols = c("coder1", "coder2", "coder3"),
mode = "icr",
output = "data.frame"
)
# Gold-standard validation, assuming coder1 is human gold standard
res_gold <- validate(
data = my_df,
id = "doc_id",
coder_cols = c("coder1", "coder2", "llm1", "llm2"),
mode = "gold",
gold = "coder1"
)
} # }