Datasets for ML

Awesome Public Datasets

  A topic-centric list of HQ open datasets
A topic-centric list of HQ open datasets

This is a list of topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. This project was incubated at OMNILab, Shanghai Jiao Tong University during Xiaming Chen's Ph.D. studies. OMNILab is now part of the BaiYuLan Open AI community. Other amazingly awesome lists can be found in sindresorhus's awesome list.

Categories are Agriculture, Architecture, Biology, Chemistry, Complex Networks, Computer Networks, Cyber Security, Data Challenges, Earth Science, Economics, Education, Energy, Entertainment, Finance, GIS, Government, Healthcare, Image Processing, Machine Learning, Museums, Natural Language, Neuro Science, Physics, Prostate Cancer, Psychology + Cognition, Public Domains, Search Engines, Social Networks, Social Sciences, Software, Sports, Time Series, Transportation, eSports, Complementary Collections.

OpenML Datasets A worldwide machine learning lab

OpenML allows fine-grained search over thousands of machine learning datasets. Via the website, you can filter by many dataset properties, such as size, type, format, and many more. Via the APIs you have access to many more filters, and you can download a complete table with statistics of all datasest. Via the APIs you can also load datasets directly into your preferred data structures. We are also working on better organization of all datasets by topic.

Datasets provide training data for machine learning models. OpenML datasets are uniformly formatted and come with rich meta-data to allow automated processing. You can sort or filter them by a range of different properties.

Papers With Code

  A free resource with all data licensed under CC-BY-SA
A free resource with all data licensed under CC-BY-SA

Our mission is to create a free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables. We believe this is best done together with the community, supported by NLP and ML.

All content on this website is openly licenced under CC-BY-SA (same as Wikipedia) and everyone can contribute! We also operate specialized portals for papers with code in astronomy, physics, computer sciences, mathematics and statistics.

  The AI community building the future
The AI community building the future

Registry of Open Data on AWS See all usage examples for datasets listed there

This registry exists to help people discover and share datasets that are available via AWS resources. Here you will find any datasets of these groups: Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Data for Good at Meta, NASA Space Act Agreement, NIH STRIDES, NOAA Open Data Dissemination Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.

DagsHub Launch your ML development to new heights

Welcome to our Datasets database, where you will find hundreds of datasets from various categories such as computer vision, audio, NLP, and more.

All datasets are free and ready for use on our platform for all your projects. Browse through our categories and find the perfect dataset to fit your needs. Get started today and experience the power of data.

CivitAI models Find your favorite LoRAs

It is a platform designed to boost the creation of AI-generated media. We offer an environment where users can upload, share, and discover custom models, each trained on distinct datasets. These models can be leveraged as innovative tools for crafting your creations.

Kaggle Datasets Explore, analyze, and share quality data

Join over 18+ million machine learners to share, stress test, and stay up-to-date on all the latest machine learning techniques and technologies.

Discover a huge repository of community-published models, data & code for your next project.

Looking for more datasets? Okay, please search them through Wikipedia

These datasets are used in machine learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.

High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data.