Convert coded data to qlm_coded format

Converts a data frame or quanteda corpus of coded data (human-coded or from external sources) into a qlm_coded object. This enables provenance tracking and integration with qlm_compare(), qlm_validate(), and qlm_trail() for coded data alongside LLM-coded results.

Usage

as_qlm_coded(
  x,
  id,
  name = NULL,
  is_gold = FALSE,
  codebook = NULL,
  texts = NULL,
  notes = NULL,
  metadata = list()
)

# S3 method for class 'data.frame'
as_qlm_coded(
  x,
  id,
  name = NULL,
  is_gold = FALSE,
  codebook = NULL,
  texts = NULL,
  notes = NULL,
  metadata = list()
)

# Default S3 method
as_qlm_coded(
  x,
  id,
  name = NULL,
  is_gold = FALSE,
  codebook = NULL,
  texts = NULL,
  notes = NULL,
  metadata = list()
)

Arguments

x

A data frame or quanteda corpus object containing coded data. For data frames: Must include a column with unit identifiers (default ".id"). For corpus objects: Document variables (docvars) are treated as coded variables, and document names are used as identifiers by default.

id

For data frames: Name of the column containing unit identifiers (supports both quoted and unquoted). Default is NULL, which looks for a column named ".id". Can be an unquoted column name (id = doc_id) or a quoted string (id = "doc_id"). For corpus objects: NULL (default) uses document names from names(x), or specify a docvar name (quoted or unquoted) to use as identifiers.

name

Character. a string identifying this coding run (e.g., "Coder_A", "expert_rater", "Gold_Standard"). Default is NULL.

is_gold

Logical. If TRUE, marks this object as a gold standard for automatic detection by qlm_validate(). When a gold standard object is passed to qlm_validate(), the gold = parameter becomes optional. Default is FALSE.

codebook

Optional list containing coding instructions. Can include:

name: Name of the coding scheme
instructions: Text describing coding instructions
schema: NULL (not used for human coding)

If NULL (default), a minimal placeholder codebook is created.

texts

Optional vector of original texts or data that were coded. Should correspond to the .id values in data. If provided, enables more complete provenance tracking.

notes

Optional character string with descriptive notes about this coding. Useful for documenting details when viewing results in qlm_trail(). Default is NULL.

metadata

Optional list of metadata about the coding process. Can include any relevant information such as:

coder_name: Name of the human coder
coder_id: Identifier for the coder
training: Description of coder training
date: Date of coding

The function automatically adds timestamp, n_units, notes, and source = "human".

Value

A qlm_coded object (tibble with additional class and attributes) for provenance tracking. When is_gold = TRUE, the object is marked as a gold standard in its attributes.

Details

When printed, objects created with as_qlm_coded() display "Source: Human coder" instead of model information, clearly distinguishing human from LLM coding.

Gold Standards

Objects marked with is_gold = TRUE are automatically detected by qlm_validate(), allowing simpler syntax:

# With is_gold = TRUE
gold <- as_qlm_coded(gold_data, name = "Expert", is_gold = TRUE)
qlm_validate(coded1, coded2, gold, by = "sentiment")  # gold = not needed!

# Without is_gold (or explicit gold =)
gold <- as_qlm_coded(gold_data, name = "Expert")
qlm_validate(coded1, coded2, gold = gold, by = "sentiment")

Examples

# Basic usage with data frame (default .id column)
human_data <- data.frame(
  .id = 1:10,
  sentiment = sample(c("pos", "neg"), 10, replace = TRUE)
)

coder_a <- as_qlm_coded(human_data, name = "Coder_A")
coder_a
#> # quallmer coded object
#> # Run:      Coder_A
#> # Source:   Human coder
#> # Units:    10
#> 
#> # A tibble: 10 × 2
#>      .id sentiment
#>  * <int> <chr>    
#>  1     1 pos      
#>  2     2 pos      
#>  3     3 pos      
#>  4     4 neg      
#>  5     5 neg      
#>  6     6 pos      
#>  7     7 pos      
#>  8     8 pos      
#>  9     9 pos      
#> 10    10 neg      

# Use custom id column with NSE (unquoted)
data_with_custom_id <- data.frame(
  doc_id = 1:10,
  sentiment = sample(c("pos", "neg"), 10, replace = TRUE)
)
coder_custom <- as_qlm_coded(data_with_custom_id, id = doc_id, name = "Coder_C")

# Or use quoted string
coder_custom2 <- as_qlm_coded(data_with_custom_id, id = "doc_id", name = "Coder_D")

# Create a gold standard from data frame
gold <- as_qlm_coded(
  human_data,
  name = "Expert",
  is_gold = TRUE
)

# Validate with automatic gold detection
coder_b_data <- data.frame(
  .id = 1:10,
  sentiment = sample(c("pos", "neg"), 10, replace = TRUE)
)
coder_b <- as_qlm_coded(coder_b_data, name = "Coder_B")

# No need for gold = when gold object is marked (NSE works for 'by' too)
qlm_validate(coder_a, coder_b, gold = gold, by = sentiment, level = "nominal")
#> 
#> ── quallmer validation ──
#> 
#> n: 10
#> 
#> 
#> ── sentiment (nominal) 
#> By class:
#> <macro>:
#> accuracy: 1.0000
#> precision: 1.0000
#> recall: 1.0000
#> F1: 1.0000
#> Cohen's kappa: 1.0000
#> accuracy: 0.3000
#> precision: 0.3333
#> recall: 0.3095
#> F1: 0.2929
#> Cohen's kappa: -0.2963
#> 

# Create from corpus object (simplified workflow)
data("data_corpus_manifsentsUK2010sample")
crowd <- as_qlm_coded(
  data_corpus_manifsentsUK2010sample,
  is_gold = TRUE
)
# Document names automatically become .id, all docvars included

# Use a docvar as identifier with NSE (unquoted)
crowd_party <- as_qlm_coded(
  data_corpus_manifsentsUK2010sample,
  id = party,
  is_gold = TRUE
)

# Or use quoted string
crowd_party2 <- as_qlm_coded(
  data_corpus_manifsentsUK2010sample,
  id = "party",
  is_gold = TRUE
)

# With complete metadata
expert <- as_qlm_coded(
  human_data,
  name = "expert_rater",
  is_gold = TRUE,
  codebook = list(
    name = "Sentiment Analysis",
    instructions = "Code overall sentiment as positive or negative"
  ),
  metadata = list(
    coder_name = "Dr. Smith",
    coder_id = "EXP001",
    training = "5 years experience",
    date = "2024-01-15"
  )
)