Open datasets

Open, citable, and reproducible.

All lab datasets are published with permanent DOIs on Harvard Dataverse, with companion models on HuggingFace. Every dataset ships with a data card, loading scripts, and documented annotation methodology.

Roman Urdu Sentiment Corpus (RUDaSA)

Harvard Dataverse · HuggingFace

Sentiment AnalysisRoman UrduLow-Resource NLPHarvard Dataverse

A large-scale Roman Urdu sentiment dataset built via privacy-preserving embedding pipelines. Competitive benchmarking of state-of-the-art Transformer models. Addresses a critical gap in low-resource South Asian NLP.

Task

Sentiment Analysis

Language

Roman Urdu

Models tested

XLM-RoBERTa, mBERT, and others

Privacy

Privacy-preserving embedding pipeline

DOI

10.21203/rs.3.rs-9827763/v1

Harvard Dataverse HuggingFace Preprint: 10.21203/rs.3.rs-9827763/v1

RUEmoCorp

Harvard Dataverse · HuggingFace

Emotion ClassificationRoman UrduMulti-instituteHarvard Dataverse

The first large-scale Roman Urdu emotion corpus. Multi-institute annotation with substantial inter-annotator agreement (Fleiss κ = 0.658). 134K labeled samples across multiple emotion categories — fully open-source.

Task

Emotion Classification

Language

Roman Urdu

Size

134K labeled samples

Agreement

Fleiss κ = 0.658 (substantial)

Annotation

Multi-institute

DOI

10.21203/rs.3.rs-9759243/v1

Harvard Dataverse HuggingFace Preprint: 10.21203/rs.3.rs-9759243/v1