Public Datasets

Notebooks are a web-based Jupyter IDE with shared persistent storage for long-term development and inter-notebook collaboration, backed by accelerated compute.


A read-only collection of public datasets is available for free to use with Notebooks and Workflows.

  • Notebooks: Datasets are available in the /datasets directory, for example, /datasets/mnist. You can also see these datasets in the Paperspace console by clicking the Public tab.
  • Workflows: Datasets are in the Gradient namespace, for example, in YAML, ref: gradient/mnist.

The following table shows the available public datasets:

Name Description Source
chest-xray-nihcc-3 NIH Chest X-ray dataset comprising frontal-view X-ray images of unique patients with text-mined 14 disease image labels. https://nihcc.app.box.com/v/ChestXray-NIHCC
coco COCO, a large-scale object detection, segmentation, and captioning dataset. http://cocodataset.org
conll2003 CoNLL 2003, a named entity recognition dataset. https://www.clips.uantwerpen.be/conll2003/ner/
dfki-sentinel-eurosat EuroSAT, a land use and land cover classification dataset based on Sentinel-2 satellite images. https://madm.dfki.de/downloads
dolly-v2-12b Dataset for Dolly, instruction-following large language model trained on the Databricks machine learning platform. https://github.com/databrickslabs/dolly
fastai Paperspace’s Fast.ai template is built for getting up and running with the Fast.ai online MOOC named Practical Deep Learning. https://registry.opendata.aws
gcl Paperspace notebook logger key store.
glue General Language Understanding Evaluation (GLUE) dataset for multi-task benchmarking and analysis platform for natural language understanding. https://gluebenchmark.com
ieee-fraud-detection Vesta’s real-world e-commerce transactions containing a range of features from device type to product features. https://www.kaggle.com/c/ieee-fraud-detection
librispeech_asr Data from read audiobooks from the LibriVox project https://www.openslr.org/12
llama Dataset consisting of question-answer pairs and source context to benchmark RAG pipelines for different use cases. https://llamahub.ai/?tab=llama_datasets
lsun Large-scale Scene Understanding (LSUN) dataset containing around one million labeled images for each of 10 scene categories and 20 object categories. https://github.com/fyu/lsun
mnist The Modified National Institute of Standards and Technology database of handwritten digits with a training set of 60,000 examples, and a test set of 10,000 examples. http://yann.lecun.com/exdb/mnist/
ogb_lsc_pcqm4mv2 Quantum chemistry dataset originally curated under the PubChemQC project. OGB-LSC is the Large Scale Competition by Open Graph Benchmark for graph structured data. https://ogb.stanford.edu/docs/lsc/pcqm4mv2
ogbl_wikikg2_custom Knowledge Graph extracted from the Wikidata knowledge base. https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2
ogbn_arxiv Directed graph, representing the citation network between all Computer Science arXiv papers indexed by Microsoft Academic Graph (MAG). Each node is an arXiv paper and each directed edge indicates that one paper cites another one. https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv
openslr Open Speech and Language Resources dataset number 12, the LibriSpeech ASR corpus. https://www.openslr.org/resources.php
pyg-cora Scientific publications classified into one of 7 classes. https://people.cs.umass.edu/~mccallum/data.html
pyg-fb15k-237 FB15K237 dataset containing 14,541 entities, 237 relations and 310,116 fact triples. https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.FB15k_237.html
pyg-qm9 QM9 dataset consisting of about 130,000 molecules with 19 regression targets https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.QM9.html
pyg-reddit Reddit dataset containing posts belonging to different communities. https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Reddit.html
pyg-tudataset Graph kernel benchmark datasets collected from the TU Dortmund University https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.TUDataset.html
realistic-vision-v2-0 Fine-tuned stable diffusion model. https://huggingface.co/SG161222/Realistic_Vision_V2.0
squad Stanford Question Answering Dataset (SQuAD), a reading comprehension dataset consisting of questions on a set of Wikipedia articles https://rajpurkar.github.io/SQuAD-explorer/
stable-diffusion-classic Dataset containing a Safetensors checkpoint for the Stable Diffusion v1-5 model from RunwayML and StabilityAI. https://stability.ai/stable-image
stable-diffusion-classic-v2 Dataset containing a Safetensors checkpoint for the Stable Diffusion v2 model from StabilityAI. https://stability.ai/stable-image
stable-diffusion-diffusers Dataset containing the Diffusers model files for the Stable Diffusion v1.5 model from StabilityAI. https://github.com/Stability-AI/diffusers
stable-diffusion-diffusers-v2 Dataset containing the Diffusers model files for the Stable Diffusion v2 model from StabilityAI. https://github.com/Stability-AI/stablediffusion
stable-diffusion-v2-1 Dataset containing the Diffusers and Safetensors checkpoint model files for the Stable Diffusion v2-1 model from StabilityAI (756p). https://github.com/Stability-AI/stablediffusion
stable-diffusion-v2-1-512 Dataset containing the Diffusers and Safetensors checkpoint model files for the Stable Diffusion v2-1 model from StabilityAI (512p). https://github.com/Stability-AI/stablediffusion
stable-diffusion-v2-base-classic Dataset containing a Safetensors checkpoint for the Stable Diffusion v2 model from StabilityAI. https://github.com/Stability-AI/stablediffusion
stable-diffusion-xl Dataset containing the Diffusers and Safetensors checkpoint model files for the Stable Diffusion XL model from StabilityAI (1024p). https://github.com/Stability-AI/stablediffusion
superb Speech processing Universal PERformance Benchmark (SUPERB) dataset to benchmark shared model performance across a wide range of speech processing tasks. https://superbbenchmark.org/
swag Situations With Adversarial Generations (SWAG) dataset for grounded commonsense inference, unifying natural language inference and physically grounded reasoning https://rowanzellers.com/swag/
tiny-imagenet-200 A subset of the ImageNET dataset created by the Stanford CS231n course, spanning 200 image classes with 500 training examples per class, 50 validation, and 50 test examples per class. http://cs231n.stanford.edu/tiny-imagenet-200.zip
wikitext Language modeling dataset containing over 100 million tokens extracted from the verified good and featured articles on Wikipedia. https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
wmt16 Datasets used in shared tasks of the First Conference on Machine Translation https://www.statmt.org/wmt16/
xsum The Extreme Summarization dataset for evaluating abstractive single-document summarization systems. https://trends.openbayes.com/dataset/xsum