自然語言處理(NLP)是人工智能領域的一個重要方向。NLP數據集具有行業相關性,如果你沒有可用的NLP數據集,希望如下NLP數據集對你有幫助。
1、Full-text corpus data
簡介:Full-text corpus data from six large corpora of English-- iWeb, NOW, Wikipedia, COCA, COHA, GloWbE
規模:
COCA(440 million words | 190,000 texts)
iWeb(13.8 billion words | 22 million web pages)
NOW(6.04 billion words | 6,000,000 texts)
Wikipedia(1.8 billion words | 4.4 million texts)
Corpus del Español (Web/Dialects)(1.8 billion words | 1,800,000 texts)
GloWbE(1.8 billion words | 1,800,000 texts)
COHA(385 million words | 115,000 texts)
地址:https://www.corpusdata.org/
2、Apache Software Foundation Public Mail Archives
簡介:A collection of all publicly available Apache Software Foundation mail archives as of July 11, 2011
規模:
地址:
https://aws.amazon.com/de/datasets/apache-software-foundation-public-mail-archives/
http://mail-archives.apache.org/mod_mbox/
3、WordNet
簡介:WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept
規模:117K synsets
地址:https://wordnet.princeton.edu/
4、The Blog Authorship Corpus
簡介:The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004
規模:19,320 bloggers、681,288 posts、over 140 million words
地址:http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
5、Amazon product data
簡介:This dataset contains product reviews and metadata from Amazon
規模:142.8 million reviews
地址:http://jmcauley.ucsd.edu/data/amazon/
6、Web data: Amazon reviews
簡介:This dataset consists of reviews from amazon
規模:~35 million reviews
地址:https://snap.stanford.edu/data/web-Amazon.html
7、Common Crawl
簡介:A corpus of web crawl data composed of over 5 billion web pages
規模:5 billion web pages
地址:https://registry.opendata.aws/commoncrawl/
8、Yelp Open Dataset
簡介:The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes
規模:5,996,996 reviews、188,593 businesses、280,992 pictures、10 metropolitan areas
地址:https://www.yelp.com/datasets
9、MACHINE TRANSLATION
簡介:Translation between language pairs
規模:30M
地址:http://statmt.org/wmt18/index.html
10、Enron Email Data
簡介:Enron email data publicly released as part of FERC's Western Energy Markets investigation converted to industry standard formats by EDRM
規模:1,227,255 emails with 493,384 attachments covering 151 custodians
地址:http://aws.amazon.com/de/datasets/enron-email-data/
11、Federal Contracts
簡介:A data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov
規模:
地址:https://aws.amazon.com/de/datasets/federal-contracts-from-the-federal-procurement-data-center-usaspending-gov/
12、Sentiment140
簡介:Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter
規模:160K
地址:http://help.sentiment140.com/home
13、Freebase Data Dump
簡介:A data dump of all the current facts and assertions in the Freebase system
規模:millions of topics in hundreds of categories
地址:https://aws.amazon.com/de/datasets/freebase-data-dump/
14、IMDB movie review dataset
簡介:This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets
規模:25K for training,25K for testing,additional unlabeled data
地址:http://ai.stanford.edu/~amaas/data/sentiment/
論文:http://www.aclweb.org/anthology/P11-1015
15、Web 1T 5-gram Version 1
簡介:Web 1T 5-gram Version 1, contributed by Google Inc., contains English word n-grams and their observed frequency counts
規模:1 trillion word tokens of text from publicly accessible Web pages
地址:https://catalog.ldc.upenn.edu/LDC2006T13
16、Harvard Library APIs & Datasets
簡介:World's largest academic libraries
規模:over 12.7M bib records、4M image records、2M finding aid components
地址:https://library.harvard.edu/services-tools/harvard-library-apis-datasets#Harvard-Library-Bibliographic-Dataset
17、Reddit comments may 2015
簡介:A small portion of the Reddit comments
規模:
地址:https://www.kaggle.com/reddit/reddit-comments-may-2015/home
18、Twenty Newsgroups Data Set
簡介:This data set consists of 20000 messages taken from 20 newsgroups
規模:20000 messages taken from 20 newsgroups
地址:https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
19、Crosswikis data
簡介:crosswikis-data
規模:
地址:https://nlp.stanford.edu/data/crosswikis-data.tar.bz2/
20、其他數據集
https://aws.amazon.com/de/datasets
https://www.kaggle.com/datasets
https://nlp.stanford.edu/links/statnlp.html
https://www.figure-eight.com/data-for-everyone
https://github.com/awesomedata/awesome-public-datasets
閱讀更多 深度學習社區 的文章