Segment texts using an LLM

Applies a codebook to input texts to segment them into thematic or conceptual units, returning a quanteda::corpus() where each segment is a document. This is the LLM-powered analogue of quanteda::corpus_segment().

Usage

qlm_segment(x, codebook, model, ..., name = NULL, notes = NULL)

Arguments

x: A character vector of texts or a quanteda::corpus() object. Named character vectors use names as document identifiers; unnamed vectors use sequential labels (text1, text2, ...).
codebook: A codebook object created with qlm_codebook(). The schema should be a ellmer::type_object() whose fields become docvars in the output corpus. Do not include a field named text; it is reserved for the verbatim segment text and is added automatically.
model: Provider (and optionally model) name in the form "provider/model" or "provider" (which will use the default model for that provider). Passed to the name argument of ellmer::chat(). Examples: "openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet-20241022", "ollama/llama3.2", "openai" (uses default OpenAI model).
...: Additional arguments passed to ellmer::chat() or ellmer::parallel_chat_structured(). Arguments recognized by ellmer::parallel_chat_structured() are routed there; all other arguments (including provider-specific arguments like base_url, credentials, or api_args for OpenAI-compatible endpoints) are passed to ellmer::chat().
name: Character string identifying this coding run. Default is NULL.
notes: Optional character string with descriptive notes about this segmentation run. Default is NULL.

Value

A quanteda::corpus() where each segment is a document. Document names follow the {source}.{i} convention of quanteda::corpus_segment(). Docvars include:

docid: Name of the source document.
segid: Integer segment index within the source document.
...: Any fields defined in the codebook schema.
...: Original docvars inherited from the input (if x is a corpus).

Details

The codebook schema defines additional document-level variables (docvars) for each segment. A text field (the verbatim segment text) is always added automatically and must not appear in the schema. Measurement levels defined in the codebook are not applicable to segmentation and are silently ignored.

Examples

if (FALSE) { # \dontrun{
# Aspect-based segmentation of a hotel review (character vector input
# returns a data.frame).
review <- paste(
  "The room was clean and tidy, despite being rather basic in its furnishings.",
  "The location of the hotel was really great, however.",
  "We loved the proximity to both public transport and to the city's main attractions."
)

cb_absa <- qlm_codebook(
  name = "Aspect-based segmentation",
  instructions = paste(
    "Segment the text according to the distinct aspects (topics or features).",
    "Each segment will continue as long as it is part of the same aspect.",
    "An aspect-based segment may be more than one sentence or may be just a",
    "part of a sentence.",
    "",
    "Aspects in hotel reviews include: cleanliness, features, location, service,",
    "and value. Return each aspect segment with its verbatim text and a short",
    "aspect label."
  ),
  schema = type_object(
    aspect    = type_string("Short aspect label"),
    sentiment = type_enum(c("negative", "neutral", "positive"),
                          "Sentiment toward this aspect")
  )
)

segs <- qlm_segment(review, cb_absa, model = "anthropic")
quanteda::docvars(segs)
#   docid segid      aspect sentiment
# 1 text1     1 cleanliness  positive
# 2 text1     2    features  negative
# 3 text1     3    location  positive

# Corpus input preserves existing docvars
reviews_corp <- quanteda::corpus(
  c(hotel_a = review),
  docvars = data.frame(city = "London", stars = 4L)
)
segs_corp <- qlm_segment(reviews_corp, cb_absa, model = "anthropic")
quanteda::docvars(segs_corp)
} # }

Usage

Arguments

Value

Details

See also

Examples