
Sample from Large Movie Review Dataset (Maas et al. 2011)
data_corpus_LMRDsample.RdA sample of 100 positive and 100 negative reviews from the Maas et al. (2011) dataset for sentiment classification. The original dataset contains 50,000 highly polar movie reviews.
Format
The corpus docvars consist of:
- docnumber
serial (within set and polarity) document number
- rating
user-assigned movie rating on a 1-10 point integer scale
- polarity
either
negorposto indicate whether the movie review was negative or positive. See Maas et al (2011) for the cut-off values that governed this assignment.
References
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). "Learning Word Vectors for Sentiment Analysis". The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
See also
data_codebook_sentiment for an example codebook and usage with this corpus
Examples
if (requireNamespace("quanteda", quietly = TRUE)) {
# Inspect the corpus
summary(data_corpus_LMRDsample)
# Sample a few reviews
head(data_corpus_LMRDsample, 3)
}
#> Corpus consisting of 3 documents and 3 docvars.
#> 1035_3.txt :
#> "A frustrating documentary. Louis Kahn's son, who saw his fat..."
#>
#> 3540_3.txt :
#> "I truly was disappointed by this film which I had high hopes..."
#>
#> 4526_4.txt :
#> "Rather foolish attempt at a Hitchcock-type mystery-thriller,..."
#>