Notebooks are a web-based Jupyter IDE with shared persistent storage for long-term development and inter-notebook collaboration, backed by accelerated compute.
A read-only collection of public datasets is available for free to use with Notebooks and Workflows.
/datasets
directory, for example, /datasets/mnist
. You can also see these datasets in the Paperspace console by clicking the Public tab.ref: gradient/mnist
.The following table shows the available public datasets:
Name | Description | Source |
---|---|---|
chest-xray-nihcc-3 |
NIH Chest X-ray dataset comprising frontal-view X-ray images of unique patients with text-mined 14 disease image labels. | https://nihcc.app.box.com/v/ChestXray-NIHCC |
coco |
COCO, a large-scale object detection, segmentation, and captioning dataset. | http://cocodataset.org |
conll2003 |
CoNLL 2003, a named entity recognition dataset. | https://www.clips.uantwerpen.be/conll2003/ner/ |
dfki-sentinel-eurosat |
EuroSAT, a land use and land cover classification dataset based on Sentinel-2 satellite images. | https://madm.dfki.de/downloads |
dolly-v2-12b |
Dataset for Dolly, instruction-following large language model trained on the Databricks machine learning platform. | https://github.com/databrickslabs/dolly |
fastai |
Paperspace’s Fast.ai template is built for getting up and running with the Fast.ai online MOOC named Practical Deep Learning. | https://registry.opendata.aws |
gcl |
Paperspace notebook logger key store. | |
glue |
General Language Understanding Evaluation (GLUE) dataset for multi-task benchmarking and analysis platform for natural language understanding. | https://gluebenchmark.com |
ieee-fraud-detection |
Vesta’s real-world e-commerce transactions containing a range of features from device type to product features. | https://www.kaggle.com/c/ieee-fraud-detection |
librispeech_asr |
Data from read audiobooks from the LibriVox project | https://www.openslr.org/12 |
llama |
Dataset consisting of question-answer pairs and source context to benchmark RAG pipelines for different use cases. | https://llamahub.ai/?tab=llama_datasets |
lsun |
Large-scale Scene Understanding (LSUN) dataset containing around one million labeled images for each of 10 scene categories and 20 object categories. | https://github.com/fyu/lsun |
mnist |
The Modified National Institute of Standards and Technology database of handwritten digits with a training set of 60,000 examples, and a test set of 10,000 examples. | http://yann.lecun.com/exdb/mnist/ |
ogb_lsc_pcqm4mv2 |
Quantum chemistry dataset originally curated under the PubChemQC project. OGB-LSC is the Large Scale Competition by Open Graph Benchmark for graph structured data. | https://ogb.stanford.edu/docs/lsc/pcqm4mv2 |
ogbl_wikikg2_custom |
Knowledge Graph extracted from the Wikidata knowledge base. | https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2 |
ogbn_arxiv |
Directed graph, representing the citation network between all Computer Science arXiv papers indexed by Microsoft Academic Graph (MAG). Each node is an arXiv paper and each directed edge indicates that one paper cites another one. | https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv |
openslr |
Open Speech and Language Resources dataset number 12, the LibriSpeech ASR corpus. | https://www.openslr.org/resources.php |
pyg-cora |
Scientific publications classified into one of 7 classes. | https://people.cs.umass.edu/~mccallum/data.html |
pyg-fb15k-237 |
FB15K237 dataset containing 14,541 entities, 237 relations and 310,116 fact triples. | https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.FB15k_237.html |
pyg-qm9 |
QM9 dataset consisting of about 130,000 molecules with 19 regression targets | https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.QM9.html |
pyg-reddit |
Reddit dataset containing posts belonging to different communities. | https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Reddit.html |
pyg-tudataset |
Graph kernel benchmark datasets collected from the TU Dortmund University | https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.TUDataset.html |
realistic-vision-v2-0 |
Fine-tuned stable diffusion model. | https://huggingface.co/SG161222/Realistic_Vision_V2.0 |
squad |
Stanford Question Answering Dataset (SQuAD), a reading comprehension dataset consisting of questions on a set of Wikipedia articles | https://rajpurkar.github.io/SQuAD-explorer/ |
stable-diffusion-classic |
Dataset containing a Safetensors checkpoint for the Stable Diffusion v1-5 model from RunwayML and StabilityAI. | https://stability.ai/stable-image |
stable-diffusion-classic-v2 |
Dataset containing a Safetensors checkpoint for the Stable Diffusion v2 model from StabilityAI. | https://stability.ai/stable-image |
stable-diffusion-diffusers |
Dataset containing the Diffusers model files for the Stable Diffusion v1.5 model from StabilityAI. | https://github.com/Stability-AI/diffusers |
stable-diffusion-diffusers-v2 |
Dataset containing the Diffusers model files for the Stable Diffusion v2 model from StabilityAI. | https://github.com/Stability-AI/stablediffusion |
stable-diffusion-v2-1 |
Dataset containing the Diffusers and Safetensors checkpoint model files for the Stable Diffusion v2-1 model from StabilityAI (756p). | https://github.com/Stability-AI/stablediffusion |
stable-diffusion-v2-1-512 |
Dataset containing the Diffusers and Safetensors checkpoint model files for the Stable Diffusion v2-1 model from StabilityAI (512p). | https://github.com/Stability-AI/stablediffusion |
stable-diffusion-v2-base-classic |
Dataset containing a Safetensors checkpoint for the Stable Diffusion v2 model from StabilityAI. | https://github.com/Stability-AI/stablediffusion |
stable-diffusion-xl |
Dataset containing the Diffusers and Safetensors checkpoint model files for the Stable Diffusion XL model from StabilityAI (1024p). | https://github.com/Stability-AI/stablediffusion |
superb |
Speech processing Universal PERformance Benchmark (SUPERB) dataset to benchmark shared model performance across a wide range of speech processing tasks. | https://superbbenchmark.org/ |
swag |
Situations With Adversarial Generations (SWAG) dataset for grounded commonsense inference, unifying natural language inference and physically grounded reasoning | https://rowanzellers.com/swag/ |
tiny-imagenet-200 |
A subset of the ImageNET dataset created by the Stanford CS231n course, spanning 200 image classes with 500 training examples per class, 50 validation, and 50 test examples per class. | http://cs231n.stanford.edu/tiny-imagenet-200.zip |
wikitext |
Language modeling dataset containing over 100 million tokens extracted from the verified good and featured articles on Wikipedia. | https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/ |
wmt16 |
Datasets used in shared tasks of the First Conference on Machine Translation | https://www.statmt.org/wmt16/ |
xsum |
The Extreme Summarization dataset for evaluating abstractive single-document summarization systems. | https://trends.openbayes.com/dataset/xsum |