Directories and general data sets¶
- Awesome Public Datasets: various public datasets (Agriculture, Biology, Finance, Sports and a lot more)
- r/datasets: datasets for data mining, analytics, and knowledge discovery
- data.world
- Kaggle Datasets: discover and seamlessly analyze open data
- Google Dataset Search
- fivethirtyeight/data: data and code behind the stories and interactives at FiveThirtyEight
- BuzzFeedNews: open-source data, analysis, libraries, tools, and guides from BuzzFeed’s newsroom
- Socrata OpenData
- AWS Public Datasets: public datasets hosted on AWS
- Google BigQuery Public Datasets: public datasets hosted on Google BigQuery
- Wikipedia Datasets
- The World Bank: free and open access to global development data
- Common Crawl: an open repository of web crawl data
- Pew Research Center Datasets
- data.gov: the home of the U.S. Governments open data (data, tools, and resources)
You can also retrieve various data via web APIs.
Language¶
- corpora: a collection of small corpuses of interesting data for the creation of bots and similar stuff
- ConceptNet 5: semantic network containing lots of things computers should know about the world, especially when understanding text written by people, also has a web API
- Topical-Chat: knowledge-grounded human-human conversation dataset
- Fake news: Text & metadata from fake & biased news sources around the web
- SARC: a large self-annotated corpus for sarcasm
- 200,000+ Jeopardy Questions: json/csv dataset crawled from j-archive
- shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words (useful for filtering language)
- 30,000 published crossword puzzles
- Structured data from ingredient phrases using conditional random fields
- Retired Comedy Phrases
- Natural Language Corpus Data: Beautiful Data
- bot-ai/bot-lang: a collection of common keywords or commands a user might use while interacting with a bot
- forked from howdyai/bot-common-keywords
- Wordbanks Word Lists (enchantedlearning.com)
- Third Eye Data: TV News Archive chyrons (archive.org)
- Full Hacker News dataset (available on BigQuery)
- Full Reddit submission corpus (2006 – August 2015)
- 20 Newsgroups: a collection of approximately 20,000 newsgroup documents
- Old Fulton NY Post Cards: search over 33,100,000 historical newspaper pages (US and Canada)
- Textstelle: a collection of corpora for the creation of bots and other things that generate text (textstelle.0x0a.li)
- some word lists for bot-making by Nora Reed (barrl.net)
Geographical and location data¶
- Climate datasets
- Publica Mundi: Geospatial data
- NYC Open Data: Learn about where you live, work, eat, shop and play using NYC Open Data.
- NYC Taxi & Limousine Commission – Trip Record Data —
pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts - Uber Movement: anonymized data from over 2 billion trips to help improve urban planning around the world (requires to request access) (movement.uber.com)
Space¶
- openNASA: 30k+ data sets from NASA
- Open Exoplanet Catalogue: a database of all discovered extra-solar planets
- Apollo11GuidanceComputerVertAndNounList.txt: a list of “nouns” and “verbs” to control the Apollo 11 guidance computer
Biology¶
- Catalogue of Life: online database of the world’s known species of animals, plants, fungi and micro-organisms, available as an API and downloadable data set
Health¶
- HealthData.gov: making high value health data more accessible to entrepreneurs, researchers, and policy makers in the hopes of better health outcomes for all (healthdata.gov)
- OpenFDA: Open-source APIs for FDA drug, device, and food data
- Medical Data for Machine Learning
- Open Payments Dataset: payments made by healthcare manufacturers (pharmaceutical companies, medical device manufacturers) to any doctor they work with
- Open Food Facts: a food products database
General science¶
- Yelp’s Academic Dataset
- Academic Torrents: datasets from scientific papers
- CERN Open Data Portal
Politics¶
- CREST (CIA Records Search Tool): Declassified CIA documents
- European Parliament Proceedings Parallel Corpus 1996-2011
- Public list of .gov domains (home.dotgov.gov)
- github.com/unitedstates: a shared commons of data and tools for the United States
Film, TV, literature¶
- Cornell Movie-Dialogs Corpus: a large metadata-rich collection of fictional conversations extracted from raw movie scripts
- Movielens Data by GroupLens: rating data sets from the MovieLens web site
- UC Irvine Machine Learning Lab’s Movie Data Set: a list of over 10000 films including many older, odd, and cult films
- Hadley Wickham’s Normalized IMDB Movie Data
- AM Stat Movie Data Set: weekend and daily per theater box office receipt data as well as total U.S. gross receipts
- LinkedMDB: open semantic web database for movies, including a large number of interlinks to several datasets on the open data cloud and references to related web pages
- Open Movie Database: a free web service to obtain movie information
- Cornell University Movie Review Data
- Jonathan Koren Movies Data Set: faceted metadata describing contemporary American films, along with relevant judgements by actual human users
- markriedl/WikiPlots: a dataset containing story plots from Wikipedia (books, movies, etc.) and the code for the extractor
- corpusmusic/liederCorpusAnalysis: a collection of IPA (International Phonetic Alphabet) transcriptions of German poems from prominent 19th-c. art songs
Art and images¶
- Open Images Dataset by Google: dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories (related blog post)
- The Museum of Modern Art (MoMA) Collection: basic metadata for more than 120,000 records (no images, but some include URLs)
- The Metropolitan Museum of Art Open Access: select datasets of information on more than 420,000 artworks in its Collection for unrestricted commercial and noncommercial use
- The Tate Collection: snapshot of the Tate collection as of October 2014 (no longer actively maintained)
- The collection data of the Carnegie Museum of Art: data on approximately 28,269 objects across all departments of the museum; fine arts, decorative arts, photography, contemporary art, and the Heinz Architectural Center
- National Museum Sweden Wikidata Collection: the collection of the Nationalmuseum in Stockholm
- Museum für Kunst und Gewerbe Hamburg: more than 20,000 artworks and artifacts
- Yale Center for British Art – Collections Department: high-resolution images of Yale University’s collection objects
- VisualGenome: dataset, a knowledge base, an ongoing effort to connect structured image concepts to language
Other¶
- Emoji Data for UTR #51 (unicode.org)
- DVS128 Gesture Dataset (research.ibm.com)
- Open Repair Alliance data downloads (openrepair.org)
- Tesco Grocery 1.0: a large-scale dataset of grocery purchases in London (nature.com)
Latest from the blog
Cheap Bots, Done Quick suspended, this time for good
So long, and thanks for all the bots.
Twitter shutting down free access to their API on February 9
I keep saying it's an end of an era quite often these days.
How do folks get into creative botmaking?
Sharing the stories of our early botmaking days.
What kind of bots are posting in the fediverse?
Exploring the bots people make and follow in the fediverse.