Validate coding: inter-rater reliability or gold-standard comparison

Usage

validate(
  data,
  id,
  coder_cols,
  min_coders = 2L,
  mode = c("icr", "gold"),
  gold = NULL,
  output = c("list", "data.frame")
)

Arguments

data: A data frame containing the unit identifier and coder columns.
id: Character scalar. Name of the column identifying units (e.g. document ID, paragraph ID).
coder_cols: Character vector. Names of columns containing the coders' codes (each column = one coder).
min_coders: Integer: minimum number of non-missing coders per unit for that unit to be included. Default is 2.
mode: Character scalar: either "icr" for inter-rater reliability statistics, or "gold" to compare coders against a gold-standard coder.
gold: Character scalar: name of the gold-standard coder column (must be one of coder_cols) when mode = "gold".
output: Character scalar: either "list" (default) to return a named list of metrics when mode = "icr", or "data.frame" to return a long data frame with columns metric and value. For mode = "gold", the result is always a data frame.

Value

If mode = "icr":

If output = "list" (default): a named list of scalar metrics (e.g. res$fleiss_kappa).
If output = "data.frame": a data frame with columns metric and value.

If mode = "gold": a data frame with one row per non-gold coder and columns:

coder_id: Name of the coder column compared to the gold standard
n: Number of units with non-missing gold and coder codes
accuracy: Overall accuracy
precision_macro: Macro-averaged precision across categories
recall_macro: Macro-averaged recall across categories
f1_macro: Macro-averaged F1 score across categories

Details

This function has been superseded by qlm_compare() for inter-rater reliability and qlm_validate() for gold-standard validation.

This function validates nominal coding data with multiple coders in two ways: Krippendorf's alpha (Krippendorf 2019) and Fleiss's kappa (Fleiss 1971) for inter-rater reliability statistics, and gold-standard classification metrics following Sokolova and Lapalme (2009).

mode = "icr": compute inter-rater reliability statistics (Krippendorff's alpha (nominal), Fleiss' kappa, mean pairwise Cohen's kappa, mean pairwise percent agreement, share of unanimous units, and basic counts).
mode = "gold": treat one coder column as a gold standard (typically a human coder) and, for each other coder, compute accuracy, macro-averaged precision, recall, and F1.

References

Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology. 4th ed. Thousand Oaks, CA: SAGE. doi:10.4135/9781071878781
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. doi:10.1037/h0031619
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. doi:10.1177/001316446002000104
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002

Examples

if (FALSE) { # \dontrun{
# Inter-rater reliability (list output)
res_icr <- validate(
  data = my_df,
  id   = "doc_id",
  coder_cols  = c("coder1", "coder2", "coder3"),
  mode = "icr"
)
res_icr$fleiss_kappa

# Inter-rater reliability (data.frame output)
res_icr_df <- validate(
  data = my_df,
  id   = "doc_id",
  coder_cols  = c("coder1", "coder2", "coder3"),
  mode   = "icr",
  output = "data.frame"
)

# Gold-standard validation, assuming coder1 is human gold standard
res_gold <- validate(
  data = my_df,
  id   = "doc_id",
  coder_cols  = c("coder1", "coder2", "llm1", "llm2"),
  mode = "gold",
  gold = "coder1"
)
} # }