IMDB Movie Reviews

Sentiment Analysis Binary Classification

A dataset of 50,000 highly polarized movie reviews for sentiment analysis. Each review is labeled as positive or negative.

Total Samples: 50,000
Train/Test Split: 25,000 / 25,000
Classes: Positive, Negative
Sample Data:
Review Text Sentiment
This movie was absolutely brilliant. The storyline, acting, and cinematography were all top-notch... Positive
One of the worst films I've ever seen. The plot made no sense and the acting was terrible... Negative
A masterpiece! Every scene was perfectly crafted and the emotional depth was incredible... Positive

SST-2 (Stanford Sentiment Treebank)

Sentiment Analysis Sentence-Level

Movie review sentences with fine-grained sentiment labels. Part of the GLUE benchmark for evaluating language understanding.

Total Samples: 70,042
Train/Dev/Test: 67,349 / 872 / 1,821
Classes: Positive, Negative
Sample Data:
Sentence Label
A stirring, funny and finally transporting re-imagining of beauty and the beast. Positive
Unflinchingly bleak and desperate. Negative
Offers that rare combination of entertainment and education. Positive

SQuAD (Stanford Question Answering)

Question Answering Reading Comprehension

A reading comprehension dataset with questions posed on Wikipedia articles. Answers are spans of text from the passage.

Total Questions: 100,000+
Articles: 500+
Version: SQuAD 1.1 & 2.0
Sample Data:
Question Context Snippet Answer
When did Beyonce start becoming popular? ...Beyoncé Giselle Knowles-Carter rose to fame in the late 1990s with Destiny's Child... late 1990s
What is the capital of France? ...Paris is the capital and most populous city of France... Paris

CoNLL-2003 (Named Entity Recognition)

NER Token Classification

Annotated corpus for Named Entity Recognition with entities like persons, organizations, locations, and miscellaneous.

Total Tokens: ~300,000
Entity Types: PER, ORG, LOC, MISC
Language: English
Sample Data:
Token Entity Label
BarackB-PER
ObamaI-PER
visitedO
GoogleB-ORG
inO
CaliforniaB-LOC

AG News

Text Classification News Categorization

News articles categorized into four classes: World, Sports, Business, and Sci/Tech.

Total Samples: 120,000
Train/Test: 120,000 / 7,600
Categories: 4 (World, Sports, Business, Sci/Tech)
Sample Data:
Title Description Category
Wall St. Bears Claw Back After the worst week for stocks... Business
AI Breakthrough Scientists develop new machine learning algorithm... Sci/Tech
Champions League Final The most anticipated match of the season... Sports

Multi30k

Machine Translation Multilingual

Image descriptions in English, German, French, and Czech. Used for machine translation and multimodal tasks.

Total Samples: 31,014
Languages: English, German, French, Czech
Task: Image Captioning & Translation
Sample Data:
English German
A man in an orange hat starring at something. Ein Mann mit einem orangen Hut starrt auf etwas.
A Boston Terrier is running on lush green grass. Ein Boston Terrier läuft über saftig-grünes Gras.

Want to Learn How to Use These Datasets?

Explore our comprehensive NLP learning resources and tutorials

Start Learning