深度學習數據集彙總(NLP數據集)

自然語言處理(NLP)是人工智能領域的一個重要方向。NLP數據集具有行業相關性,如果你沒有可用的NLP數據集,希望如下NLP數據集對你有幫助。

1、Full-text corpus data

簡介:Full-text corpus data from six large corpora of English-- iWeb, NOW, Wikipedia, COCA, COHA, GloWbE

規模:

COCA(440 million words | 190,000 texts)

iWeb(13.8 billion words | 22 million web pages)

NOW(6.04 billion words | 6,000,000 texts)

Wikipedia(1.8 billion words | 4.4 million texts)

Corpus del Español (Web/Dialects)(1.8 billion words | 1,800,000 texts)

GloWbE(1.8 billion words | 1,800,000 texts)

COHA(385 million words | 115,000 texts)

地址:https://www.corpusdata.org/

2、Apache Software Foundation Public Mail Archives

簡介:A collection of all publicly available Apache Software Foundation mail archives as of July 11, 2011

規模:

地址:

https://aws.amazon.com/de/datasets/apache-software-foundation-public-mail-archives/

http://mail-archives.apache.org/mod_mbox/

3、WordNet

簡介:WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept

規模:117K synsets

地址:https://wordnet.princeton.edu/

4、The Blog Authorship Corpus

簡介:The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004

規模:19,320 bloggers、681,288 posts、over 140 million words

地址:http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

5、Amazon product data

簡介:This dataset contains product reviews and metadata from Amazon

規模:142.8 million reviews

地址:http://jmcauley.ucsd.edu/data/amazon/

6、Web data: Amazon reviews

簡介:This dataset consists of reviews from amazon

規模:~35 million reviews

地址:https://snap.stanford.edu/data/web-Amazon.html

7、Common Crawl

簡介:A corpus of web crawl data composed of over 5 billion web pages

規模:5 billion web pages

地址:https://registry.opendata.aws/commoncrawl/

8、Yelp Open Dataset

簡介:The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes

規模:5,996,996 reviews、188,593 businesses、280,992 pictures、10 metropolitan areas

地址:https://www.yelp.com/datasets

9、MACHINE TRANSLATION

簡介:Translation between language pairs

規模:30M

地址:http://statmt.org/wmt18/index.html

10、Enron Email Data

簡介:Enron email data publicly released as part of FERC's Western Energy Markets investigation converted to industry standard formats by EDRM

規模:1,227,255 emails with 493,384 attachments covering 151 custodians

地址:http://aws.amazon.com/de/datasets/enron-email-data/

11、Federal Contracts

簡介:A data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov

規模:

地址:https://aws.amazon.com/de/datasets/federal-contracts-from-the-federal-procurement-data-center-usaspending-gov/

12、Sentiment140

簡介:Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter

規模:160K

地址:http://help.sentiment140.com/home

13、Freebase Data Dump

簡介:A data dump of all the current facts and assertions in the Freebase system

規模:millions of topics in hundreds of categories

地址:https://aws.amazon.com/de/datasets/freebase-data-dump/

14、IMDB movie review dataset

簡介:This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets

規模:25K for training,25K for testing,additional unlabeled data

地址:http://ai.stanford.edu/~amaas/data/sentiment/

論文:http://www.aclweb.org/anthology/P11-1015

15、Web 1T 5-gram Version 1

簡介:Web 1T 5-gram Version 1, contributed by Google Inc., contains English word n-grams and their observed frequency counts

規模:1 trillion word tokens of text from publicly accessible Web pages

地址:https://catalog.ldc.upenn.edu/LDC2006T13

16、Harvard Library APIs & Datasets

簡介:World's largest academic libraries

規模:over 12.7M bib records、4M image records、2M finding aid components

地址:https://library.harvard.edu/services-tools/harvard-library-apis-datasets#Harvard-Library-Bibliographic-Dataset

17、Reddit comments may 2015

簡介:A small portion of the Reddit comments

規模:

地址:https://www.kaggle.com/reddit/reddit-comments-may-2015/home

18、Twenty Newsgroups Data Set

簡介:This data set consists of 20000 messages taken from 20 newsgroups

規模:20000 messages taken from 20 newsgroups

地址:https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

19、Crosswikis data

簡介:crosswikis-data

規模:

地址:https://nlp.stanford.edu/data/crosswikis-data.tar.bz2/

20、其他數據集

https://aws.amazon.com/de/datasets

https://www.kaggle.com/datasets

https://nlp.stanford.edu/links/statnlp.html

https://www.figure-eight.com/data-for-everyone

https://github.com/awesomedata/awesome-public-datasets

深度學習數據集彙總(NLP數據集)


分享到:


相關文章: