# Public Datasets

Notebooks are a web-based Jupyter IDE with shared persistent storage for long-term development and inter-notebook collaboration, backed by accelerated compute.

A read-only collection of public datasets is available for free to use with Notebooks and Workflows.

- **Notebooks**: Datasets are available in the `/datasets` directory, for example, `/datasets/mnist`. You can also see these datasets in the Paperspace console by clicking the **Public** tab.
- **Workflows**: Datasets are in the Gradient namespace, for example, in YAML, `ref: gradient/mnist`.

The following table shows the available public datasets:

| Name | Description | Source |
|---|---|---|
| `chest-xray-nihcc-3` | NIH Chest X-ray dataset comprising frontal-view X-ray images of unique patients with text-mined 14 disease image labels. | [https://nihcc.app.box.com/v/ChestXray-NIHCC](https://nihcc.app.box.com/v/ChestXray-NIHCC) |
| `coco` | COCO, a large-scale object detection, segmentation, and captioning dataset. | [http://cocodataset.org](http://cocodataset.org) |
| `conll2003` | CoNLL 2003, a named entity recognition dataset. | [https://www.clips.uantwerpen.be/conll2003/ner/](https://www.clips.uantwerpen.be/conll2003/ner/) |
| `dfki-sentinel-eurosat` | EuroSAT, a land use and land cover classification dataset based on Sentinel-2 satellite images. | [https://madm.dfki.de/downloads](https://madm.dfki.de/downloads) |
| `dolly-v2-12b` | Dataset for Dolly, instruction-following large language model trained on the Databricks machine learning platform. | [https://github.com/databrickslabs/dolly](https://github.com/databrickslabs/dolly) |
| `fastai` | Paperspace’s Fast.ai template is built for getting up and running with the Fast.ai online MOOC named [Practical Deep Learning](https://course.fast.ai/). | [https://registry.opendata.aws](https://registry.opendata.aws) |
| `gcl` | Paperspace notebook logger key store. | |
| `glue` | General Language Understanding Evaluation (GLUE) dataset for multi-task benchmarking and analysis platform for natural language understanding. | [https://gluebenchmark.com](https://gluebenchmark.com) |
| `ieee-fraud-detection` | Vesta’s real-world e-commerce transactions containing a range of features from device type to product features. | [https://www.kaggle.com/c/ieee-fraud-detection](https://www.kaggle.com/c/ieee-fraud-detection) |
| `librispeech_asr` | Data from read audiobooks from the LibriVox project | [https://www.openslr.org/12](https://www.openslr.org/12) |
| `llama` | Dataset consisting of question-answer pairs and source context to benchmark RAG pipelines for different use cases. | [https://llamahub.ai/?tab=llama\_datasets](https://llamahub.ai/?tab=llama_datasets) |
| `lsun` | Large-scale Scene Understanding (LSUN) dataset containing around one million labeled images for each of 10 scene categories and 20 object categories. | [https://github.com/fyu/lsun](https://github.com/fyu/lsun) |
| `mnist` | The Modified National Institute of Standards and Technology database of handwritten digits with a training set of 60,000 examples, and a test set of 10,000 examples. | [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/) |
| `ogb_lsc_pcqm4mv2` | Quantum chemistry dataset originally curated under the PubChemQC project. OGB-LSC is the Large Scale Competition by Open Graph Benchmark for graph structured data. | [https://ogb.stanford.edu/docs/lsc/pcqm4mv2](https://ogb.stanford.edu/docs/lsc/pcqm4mv2) |
| `ogbl_wikikg2_custom` | Knowledge Graph extracted from the [Wikidata](https://en.wikipedia.org/wiki/Wikidata) knowledge base. | [https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2](https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2) |
| `ogbn_arxiv` | Directed graph, representing the citation network between all Computer Science arXiv papers indexed by [Microsoft Academic Graph (MAG)](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/). Each node is an arXiv paper and each directed edge indicates that one paper cites another one. | [https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv](https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv) |
| `openslr` | Open Speech and Language Resources dataset number 12, the LibriSpeech ASR corpus. | [https://www.openslr.org/resources.php](https://www.openslr.org/resources.php) |
| `pyg-cora` | Scientific publications classified into one of 7 classes. | [https://people.cs.umass.edu/~mccallum/data.html](https://people.cs.umass.edu/~mccallum/data.html) |
| `pyg-fb15k-237` | FB15K237 dataset containing 14,541 entities, 237 relations and 310,116 fact triples. | [https://pytorch-geometric.readthedocs.io/en/latest/generated/torch\_geometric.datasets.FB15k\_237.html](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.FB15k_237.html) |
| `pyg-qm9` | QM9 dataset consisting of about 130,000 molecules with 19 regression targets | [https://pytorch-geometric.readthedocs.io/en/latest/generated/torch\_geometric.datasets.QM9.html](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.QM9.html) |
| `pyg-reddit` | Reddit dataset containing posts belonging to different communities. | [https://pytorch-geometric.readthedocs.io/en/latest/generated/torch\_geometric.datasets.Reddit.html](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Reddit.html) |
| `pyg-tudataset` | Graph kernel benchmark datasets collected from the TU Dortmund University | [https://pytorch-geometric.readthedocs.io/en/latest/generated/torch\_geometric.datasets.TUDataset.html](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.TUDataset.html) |
| `realistic-vision-v2-0` | Fine-tuned stable diffusion model. | [https://huggingface.co/SG161222/Realistic\_Vision\_V2.0](https://huggingface.co/SG161222/Realistic_Vision_V2.0) |
| `squad` | Stanford Question Answering Dataset (SQuAD), a reading comprehension dataset consisting of questions on a set of Wikipedia articles | [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/) |
| `stable-diffusion-classic` | Dataset containing a Safetensors checkpoint for the Stable Diffusion v1-5 model from RunwayML and StabilityAI. | [https://stability.ai/stable-image](https://stability.ai/stable-image) |
| `stable-diffusion-classic-v2` | Dataset containing a Safetensors checkpoint for the Stable Diffusion v2 model from StabilityAI. | [https://stability.ai/stable-image](https://stability.ai/stable-image) |
| `stable-diffusion-diffusers` | Dataset containing the Diffusers model files for the Stable Diffusion v1.5 model from StabilityAI. | [https://github.com/Stability-AI/diffusers](https://github.com/Stability-AI/diffusers) |
| `stable-diffusion-diffusers-v2` | Dataset containing the Diffusers model files for the Stable Diffusion v2 model from StabilityAI. | [https://github.com/Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) |
| `stable-diffusion-v2-1` | Dataset containing the Diffusers and Safetensors checkpoint model files for the Stable Diffusion v2-1 model from StabilityAI (756p). | [https://github.com/Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) |
| `stable-diffusion-v2-1-512` | Dataset containing the Diffusers and Safetensors checkpoint model files for the Stable Diffusion v2-1 model from StabilityAI (512p). | [https://github.com/Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) |
| `stable-diffusion-v2-base-classic` | Dataset containing a Safetensors checkpoint for the Stable Diffusion v2 model from StabilityAI. | [https://github.com/Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) |
| `stable-diffusion-xl` | Dataset containing the Diffusers and Safetensors checkpoint model files for the Stable Diffusion XL model from StabilityAI (1024p). | [https://github.com/Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) |
| `superb` | Speech processing Universal PERformance Benchmark (SUPERB) dataset to benchmark shared model performance across a wide range of speech processing tasks. | [https://superbbenchmark.org/](https://superbbenchmark.org/) |
| `swag` | Situations With Adversarial Generations (SWAG) dataset for grounded commonsense inference, unifying natural language inference and physically grounded reasoning | [https://rowanzellers.com/swag/](https://rowanzellers.com/swag/) |
| `tiny-imagenet-200` | A subset of the ImageNET dataset created by the Stanford CS231n course, spanning 200 image classes with 500 training examples per class, 50 validation, and 50 test examples per class. | [http://cs231n.stanford.edu/tiny-imagenet-200.zip](http://cs231n.stanford.edu/tiny-imagenet-200.zip) |
| `wikitext` | Language modeling dataset containing over 100 million tokens extracted from the verified [good](https://en.wikipedia.org/wiki/Wikipedia:Good_articles) and [featured](https://en.wikipedia.org/wiki/Wikipedia:Featured_articles) articles on Wikipedia. | [https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) |
| `wmt16` | Datasets used in shared tasks of the First Conference on Machine Translation | [https://www.statmt.org/wmt16/](https://www.statmt.org/wmt16/) |
| `xsum` | The Extreme Summarization dataset for evaluating abstractive single-document summarization systems. | [https://trends.openbayes.com/dataset/xsum](https://trends.openbayes.com/dataset/xsum) |