
Segment texts using an LLM
qlm_segment.RdApplies a codebook to input texts to segment them into thematic or conceptual
units, returning a quanteda::corpus() where each segment is a document.
This is the LLM-powered analogue of quanteda::corpus_segment().
Arguments
- x
A character vector of texts or a
quanteda::corpus()object. Named character vectors use names as document identifiers; unnamed vectors use sequential labels (text1,text2, ...).- codebook
A codebook object created with
qlm_codebook(). The schema should be aellmer::type_object()whose fields become docvars in the output corpus. Do not include a field namedtext; it is reserved for the verbatim segment text and is added automatically.- model
Provider (and optionally model) name in the form
"provider/model"or"provider"(which will use the default model for that provider). Passed to thenameargument ofellmer::chat(). Examples:"openai/gpt-4o-mini","anthropic/claude-3-5-sonnet-20241022","ollama/llama3.2","openai"(uses default OpenAI model).- ...
Additional arguments passed to
ellmer::chat()orellmer::parallel_chat_structured(). Arguments recognized byellmer::parallel_chat_structured()are routed there; all other arguments (including provider-specific arguments likebase_url,credentials, orapi_argsfor OpenAI-compatible endpoints) are passed toellmer::chat().- name
Character string identifying this coding run. Default is
NULL.- notes
Optional character string with descriptive notes about this segmentation run. Default is
NULL.
Value
A quanteda::corpus() where each segment is a document. Document
names follow the {source}.{i} convention of quanteda::corpus_segment().
Docvars include:
docidName of the source document.
segidInteger segment index within the source document.
- ...
Any fields defined in the codebook schema.
- ...
Original docvars inherited from the input (if
xis a corpus).
Details
The codebook schema defines additional document-level variables (docvars)
for each segment. A text field (the verbatim segment text) is always added
automatically and must not appear in the schema. Measurement levels defined
in the codebook are not applicable to segmentation and are silently ignored.
See also
qlm_code() for document-level coding, qlm_codebook() for
creating codebooks, quanteda::corpus_segment() for pattern-based
segmentation.
Examples
if (FALSE) { # \dontrun{
# Aspect-based segmentation of a hotel review (character vector input
# returns a data.frame).
review <- paste(
"The room was clean and tidy, despite being rather basic in its furnishings.",
"The location of the hotel was really great, however.",
"We loved the proximity to both public transport and to the city's main attractions."
)
cb_absa <- qlm_codebook(
name = "Aspect-based segmentation",
instructions = paste(
"Segment the text according to the distinct aspects (topics or features).",
"Each segment will continue as long as it is part of the same aspect.",
"An aspect-based segment may be more than one sentence or may be just a",
"part of a sentence.",
"",
"Aspects in hotel reviews include: cleanliness, features, location, service,",
"and value. Return each aspect segment with its verbatim text and a short",
"aspect label."
),
schema = type_object(
aspect = type_string("Short aspect label"),
sentiment = type_enum(c("negative", "neutral", "positive"),
"Sentiment toward this aspect")
)
)
segs <- qlm_segment(review, cb_absa, model = "anthropic")
quanteda::docvars(segs)
# docid segid aspect sentiment
# 1 text1 1 cleanliness positive
# 2 text1 2 features negative
# 3 text1 3 location positive
# Corpus input preserves existing docvars
reviews_corp <- quanteda::corpus(
c(hotel_a = review),
docvars = data.frame(city = "London", stars = 4L)
)
segs_corp <- qlm_segment(reviews_corp, cb_absa, model = "anthropic")
quanteda::docvars(segs_corp)
} # }