Data

This page organizes the corpora that are used by the Webis research group. Their availability for external use is as follows: (1) corpora that have been officially released by our group can be downloaded here, (2) corpora of the PAN workshop series, (3) internal Webis corpora (will be officially released in the future) are supplied upon request, (4) affiliated corpora made available by courtesy of our research partners can be downloaded here, (5) other corpora must be obtained from the original publisher/creator. A note for corpus developers: if you are interested in getting your corpus listed here drop us a mail.

Released Webis Corpora
Name Publisher/Creator Year Size (bytes) Size (units) Default Task
ArguAna TripAdvisor Webis group & FG Engels 2014 - 2K reviews Sentiment Analysis
LFA-11 Webis Group & FG Engels 2011 5 MB - Genre and Sentiment Analysis
WDVC-15 FG Engels & Webis Group 2015 5 GB 24M revisions Vandalism Detection
Webis-Ambient-15 Webis Group 2015 114 MB 6K documents Clustering/Cluster Labeling
Webis-ArgRank-17 Webis Group 2017 13 MB 18K arguments Computational Argumentation
Webis-CBC-16 Webis Group 2016 255 MB 3K tweets Clickbait Detection
Webis-CBC-17 Webis Group 2017 - 20K tweets Clickbait Detection
Webis-CLS-10 Webis Group 2010 530 MB 800K documents Cross-Language Text Classification
Webis-CPC-11 Webis Group 2011 19 MB 8K paraphrases Plagiarism Detection
Webis-Editorials-16 Webis Group 2016 5 MB 300 documents Computational Argumentation
Webis-Query-Log-12 Webis Group 2012 - 150 search logs Exploratory Search
Webis-TRC-12 Webis Group 2012 120 MB 150 interaction logs Text Reuse Detection, Paraphrasing, and Exploratory Search
Genre-KI-04 Webis Group 2004 11 MB 1K documents Web Genre Analysis
Webis-KIQC-13 Webis Group 2013 1 MB 3K questions Known-Item Search
Webis-Mnemonics-17 Webis Group 2017 2 MB 1K mnemonics Password analysis
Webis-ODP-10 Webis Group 2010 113 MB 5M documents Clustering/Cluster Labeling
Webis-PRA-12 Webis Group 2012 884 KB 14K company names Spelling Error Detection
Webis-PC-08 Webis Group 2008 298 MB - Plagiarism Detection
Webis-QSeC-10 Webis Group 2010 2 MB - Query Segmentation
Webis-QSpell-17 Webis Group 2017 1 MB - Query Spelling Correction
Webis-Sentences-17 Webis Group 2017 200 GB 3B sentences Text statistics
Webis-SMC-12 Webis Group 2012 123 KB - Search Mission Detection
Webis-Revenue-10 FG Engels & Webis Group 2010 6 MB 1K documents Entity and Relation Extraction
Webis-SDMbridge-12 Webis Group 2012 58 MB 15K models Simulation Data Mining
Webis-TLDR-17 Webis Group 2017 - 4M content-summary pairs Abstractive Summarization
Webis-WVC-07 Webis Group 2007 12 KB 1K documents Vandalism Detection
Webis-Tripad-13-Sentiment Webis Group 2013 3 MB 2K reviews Sentiment Analysis
Webis-Tripad-14 Webis Group 2014 61 MB 266K reviews Sentiment Analysis and Author Profiling
Webis-Debate-16 Webis Group 2016 908 KB 27K text segments Computational Argumentation
Webis-Web-Archive-17 Webis Group 2017 94 GB 1M documents Web Analysis
PAN Corpora
Name Publisher/Creator Year Size (bytes) Size (units) Default Task
PAN-PC-09 Webis Group 2009 2 GB 41K documents Plagiarism Detection
PAN-PC-10 Webis Group 2010 2 GB 27K documents Plagiarism Detection
PAN-PC-11 Webis Group 2011 2 GB 27K documents Plagiarism Detection
PAN-WQF-12 Webis Group 2012 4 GB 2M documents Quality Flaw Prediction in Wikipedia
PAN-WVC-10 Webis Group 2010 439 MB 32K documents Vandalism Detection
PAN-WVC-11 Webis Group 2011 371 MB 24K documents Vandalism Detection
PAN-AIC-18 Publisher PAN 2018 4 MB 2K problems Author Identification
PAN-SCDC-18 Publisher PAN 2018 7 MB 3K problems Author Identification
PAN-APC-18 Publisher PAN 2018 7 GB 8K problems Author Profiling
PAN-AOC-16 Publisher PAN 2016 2 MB 205 problems Author Obfuscation
Internal Webis Corpora
Name Publisher/Creator Year Size (bytes) Size (units) Default Task
Arxiv Webis Group - 674 MB 550 documents -
Bauphysik Webis Group 2010 70 MB - Vertical Search
ODP Cluster Labeling Webis Group 2010 - 6K documents Cluster Labeling
Converter Testfiles Webis Group - 2 GB - -
Wikipedia Editwars Webis Group 2008 919 MB - Editwar Detection
Genre Corpus (2008) Webis Group 2008 26 MB 2K documents Web Genre Analysis
German Newsgroups Webis Group - 54 MB 27K documents Cluster Analysis
Google News Crawl Webis Group - 404 MB 35K documents -
Gutenberg Wordcount Webis Group - 4 MB - -
Netspeak Dictionary Webis Group - 3 GB - -
Slashdot Webis Group - 3 GB - -
TLDP Crawl Webis Group - 366 MB 15K documents -
Twitter Movie Sentiments Webis Group 2010 1 GB - Sentiment Analysis
Webdiversity Webis Group - 225 MB - -
Youtube Comments Webis Group - 2 GB 324K documents -
Affiliated Corpora
Name Publisher/Creator Year Size (bytes) Size (units) Default Task
Dagstuhl-15512 ArgQuality Corpus Dagstuhl-15512 Quality breakout group 2017 1 MB 304 arguments Computational Argumentation
Burrows Authorship Corpora Steven Burrows, RMIT University 2010 8 MB - Source Code Authorship Attribution
Paderborn Genre Analysis Corpus 2012 Baumann, Lettmann, Stein 2012 20 MB - Web Genre Analysis
Scientific Author's Writing Style Corpus 2017 Rexha, Kröll, Ziak, Kern 2017 - 66 cases Authorship Attribution
Other Corpora
Name Publisher/Creator Year Size (bytes) Size (units) Default Task
20 Newsgroups Carnegie Mellon University 1999 18 MB 20K documents Text Classification, Text Clustering
7Sectors-WebKB CMU World Wide Knowledge Base 2001 6 MB 5K documents Text Classification, Text Clustering
A Corpus of Plagiarised Short Answers University of Sheffield 2009 80 KB 100 documents Plagiarism Detection
Annotated Customer Reviews Simon Fraser University Burnaby 2004 870 KB - Sentiment Analysis
AOL Query Log AOL 2006 2 GB 112M queries Query Log Analysis
Argument Annotated Essays, v1 TU Darmstadt 2014 7 MB 90 essays Computational Argumentation
Argument Annotated Essays, v2 TU Darmstadt 2016 6 MB 402 essays Computational Argumentation
Araucaria Argumentation Corpus University of Dundee 2014 9 MB 664 examples Computational Argumentation
Arguing Subjectivity Corpus University of Pittsburgh 2012 732 KB 84 documents Computational Argumentation
Bergsma-Wang-Corpus 2007 S. Bergsma and Q. I. Wang 2007 2 MB 2K queries Web Search Analysis
BLOGS06 test collection University of Glasgow 2006 - 4M documents Link Analysis
BNC Writing Errors J. Wagner et al. 2007 274 MB - Writing Error Detection
British National Corpus (XML) BNC Consortium 2007 5 GB 4K texts Text Analysis (English)
Brown Corpus Brown University 2011 22 MB 500 documents Text Analysis (English)
Change My View Modes Columbia University 2017 - 78 discussion threads Computational Argumentation
CEEAUS 2010 Beta Edition Kobe University 2010 - 2K documents Cross-Language Analysis
CLEANEVAL 2007 University of Trento and University of Leeds 2007 15 MB 1K documents Main Content Extraction
CLEF-IP 2009 Information Retrieval Facility Society (IRF) 2009 14 GB 2M documents Patent Retrieval
CLEF-IP 2010 Information Retrieval Facility Society (IRF) 2010 9 GB 3M documents Patent Retrieval
ClueWeb09 Carnegie Mellon University 2009 4 TB 1B web pages Web Mining
ClueWeb12 Carnegie Mellon University 2012 5 TB 733M web pages Web Mining
CoNLL-2003 University of Antwerpen 2003 12 MB - Named Entity Recognition
CoPhIR Consiglio Nazionale delle Ricerche (ISTI-CNR) 2003 54 GB 106M images Image Retrieval
DBLP University of Massachusetts Amherst 2006 910 MB - Network Analysis
Dbpedia 3.5 DBpedia 2010 8 GB - Data Mining
DMOZ Open Directory Project 2010 11 GB - Clustering and Clusterlabeling and Data Mining
ECML PKDD Discovery Challenge 2008 ECML 2008 304 MB 17M lines Collaborative Filtering and Spam Detection
ESL 123 Mass Noun Examples Microsoft Corporation 2006 204 KB 123 sentences Cross-Language Analysis
Essay Argument Strength UT Dallas 2015 30 KB 1K scores Essay scoring
Essay Organization UT Dallas 2010 30 KB 1K scores Essay scoring
Essay Prompt Adherence UT Dallas 2014 38 KB 830 scores Essay scoring
Essay Thesis Clarity UT Dallas 2013 6 MB 830 scores Essay scoring
Finegrained Sentiment Uppsala University 2011 4 MB 294 reviews Sentiment Analysis
European Corpus Initiative Multilingual Corpus I European Corpus Initiative 1994 824 MB 49M words Text Analysis (Multilingual)
Europarl (v1 & v3) University of Edinburgh 2007 3 GB - Machine Translation
Falko Essaykorpus L2 V2 Institut für deutsche Sprache und Linguistik 2005 5 MB 248 documents Interlanguage Analysis
German General Inquirer Dictionary Harvard University 1966 240 KB 182 categories Sentiment Analysis (German Wordlist)
Google Books N-Gram 20090715 Google 2009 898 GB - Data Mining
Google Web 1T 5-gram Version 1 Google 2006 55 GB 5B n-grams Text Analysis (English)
IBM Context-dependent Argumentation, ACL-14 IBM 2014 3 MB 3K argument units Computational Argumentation
IBM Context-dependent Argumentation, EMNLP-15 IBM 2015 8 MB 7K argument units Computational Argumentation
IBM Term-relatedness IBM 2015 132 KB 10K term pairs Text Analysis (English)
ICWSM 2009 Data Challenge ICWSM 2009 37 GB - Network Analysis
imat2009 dataset Yandex 2009 650 MB - Machine-learned Ranking
International Corpus of Learner English v2 Center for English Corpus Linguistics 2009 92 MB 6K documents Language Analysis
IP2Location LITE databases 2016-18 IP2Location 2018 2 GB 3 years IP-geolocation and proxies
The JRC-Acquis Multilingual Parallel Corpus (3) European Commission's Office for Official Publications (OPOCE) 2009 2 GB - Cross-Language Research
Koppel Authorship Corpus M. Koppel and J. Schler 2004 4 MB - Authorship Verification
Learning To Rank 3 Microsoft 2008 8 GB - Machine-learned Ranking
Lee 50 Documents M. D. Lee et al. 2005 130 KB 50 documents Text Similarity Analysis
METER Corpus Department of Journalism and Department of Computer Science at Sheffield University 2002 10 MB - Text Reuse
MIR Flickr 2008 LIACS Medialab at Leiden University, Netherlands 2008 3 GB 25K documents Image Retrieval
Multi Domain Sentiment Dataset (Processed ACL) John Hopkins University 2007 29 MB - Sentiment Analysis
Montclair Electronic Language Database Montclair State University 2001 56 KB 33 documents Cross-Language Analysis
Movielens University of Minnesota 1998-2009 74 MB 11M ratings Collaborative Filtering
Movie Review Data Cornell University 2004-2005 219 MB 12K reviews Sentiment Analysis
Netflix Challenge (Partial) Netflix 2006 2 GB - Collaborative Filtering
New York Times Corpus New York Times 2008 3 GB 2M articles Text Mining
NBC 2016 Russian Troll Tweets NBC 2018 34 MB 267K tweets Propaganda detection
ODP239 C. Carpineto and G. Romano 2009 5 MB - Subtopic Information Retrieval
OHSUMED Test Collection Oregon Health & Science University 1994 461 MB - Text Clustering
OPUS (Europarl3_0b and EMEA0) Jörg Tiedemann 2009 9 GB 22 languages (286 bitexts) Machine Translation
Reason Identification and Classification Dataset UT Dallas 2014 4 MB - Computational Argumentation
Reuters 21578 (22173) Reuters, David D. Lewis 1996 8 MB 22K articles Text Clustering
Reuters RCV1 Reuters, David D. Lewis 2000 1 GB 365 documents Text Clustering
Reuters RCV1 - CCAT split Reuters, David D. Lewis 2002 2 GB - Machine Learning
Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test Collection National Research Council of Canada 2009 166 MB - Cross-Language Categorization
Request For Comments Collections (to 4501) RFC Editor 2008 55 MB 4K documents Data Mining
Rovereto Twitter N-Gram Corpus University of Trento, Italy 2011 5 GB 75M tweets Social Network Analysis
SILS Learner Corpus of English Waseda University 2007 16 MB - Cross-Language Analysis
SMS Spam Collection v T. A. Almeida and J. M. G. Hidalgo 2011 210 KB 6K messages Spam Identification
TIPSTER Complete Advanced Research Projects Agency 1993 1 MB - Information Retrieval
TREC vol4 National Institute of Standards and Technology (NIST) 1996 436 MB 295K documents Data Mining
TREC vol5 National Institute of Standards and Technology (NIST) 1997 389 MB 260K documents Data Mining
TREC web National Institute of Standards and Technology (NIST) 1999-2004 90 GB - Data Mining
Tswana Learner English Corpus Center for Text Technology 2006 2 MB - Cross-Language Analysis
Twitter tweets Yang and Leskovec 2011 26 GB 467M tweets Social Network Analysis
UKPConvArg1 TU Darmstadt 2016 21 MB 16K argument pairs Computational Argumentation
UKPConvArg2 TU Darmstadt 2016 23 MB 9K argument pairs Computational Argumentation
USPTO Patents from 2001 to 2010 U.S. Patent & Trademark Office 2010 10 TB - Patent Analysis
Uppsala Student English Uppsala University 2001 3 MB 2K documents Cross-Language Analysis
WaCKy: deWaC Web-As-Corpus Kool Yinitiative 2009 26 GB 2B words Text Analysis (German)
WaCKy: frWaC Web-As-Corpus Kool Yinitiative 2009 5 GB 2B words Text Analysis (French)
WaCKy: itWaC Web-As-Corpus Kool Yinitiative 2009 31 GB 2B words Text Analysis (Italian)
WaCKy: sdeWaC Web-As-Corpus Kool Yinitiative 2009 20 GB 1B words Text Analysis (German)
WaCKy: ukWaC Web-As-Corpus Kool Yinitiative 2009 15 GB 2B words Text Analysis (English)
WaCKy: WaCkypedia_EN Web-As-Corpus Kool Yinitiative 2009 6 GB 1B words Text Analysis (English)
Web People Search Corpus (WePS-1) NLP Group (UNED), Proteus Project (NYU) 2007 295 MB 2K web pages Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-2) NLP Group (UNED), Proteus Project (NYU) 2009 328 MB 3K web pages Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-3) NLP Group (UNED), Proteus Project (NYU) 2010 571 MB 50K web pages Person Disambiguation, Text Clustering
Wikipedia Revision Dump Wikimedia Foundation 2006 46 GB - Data Mining
Wikipedia Revision Dump Wikimedia Foundation 2008 133 GB - Data Mining
Wikipedia Full Dump Wikimedia Foundation 2011 5 TB - Data Mining
Wikipedia History Snapshots Wikimedia Foundation 2006-2012 32 GB - Data Mining
Wikipedia Snapshots Wikimedia Foundation 2006-2012 280 GB - Data Mining
Wikipedia Participation Challenge Wikimedia Foundation 2011 976 MB - User Behaviour Prediction
Wordsim353 L. Finkelstein et al. 2002 60 KB 353 word pairs Word Similarities
Wortschatz Leipzig Universität Leipzig 2006 8 GB 15 languages Text Analysis (Multilingual)
Yahoo N-Grams Yahoo 2006 13 GB - Text Analysis (English)
Yahoo Learning To Rank Challenge 2010 Yahoo 2010 421 MB - Document Ranking
TripAdvisor Data Set University of Illinois at Urbana-Champaign 2010 220 MB - Opinion Mining