# Workflow Spec Reference (private) Workflows automate machine learning tasks, combining GPU instances with an expressive syntax to generate production-ready machine learning pipelines with a few lines of code. A Workflow Spec is a YAML list of jobs that is converted into an Argo template and run on the Gradient distributed runtime engine. ## Key Concepts - `defaults`: At the top of the YAML Workflow file, you can specify default parameters to be used throughout the entire Workflow. This includes environment variables and default machine instance configuration. Instances can also be specified per-job. - `inputs`: The `inputs` block allows you to specify named inputs (for example, a [versioned dataset](https://docs.digitalocean.com/products/paperspace/notebooks/details/storage-architecture/index.html.md) to be referenced and consumed by your jobs). You can also collect inputs in a separate YAML and reference this file as an `inputPath` when creating a Workflow run. Workflow and job-level inputs can be of type: dataset (a persistent, versioned collection of data), string (for example, a generated value or ID that may be output from another job) or volume (a temporary workspace mounted onto a job’s container). Datasets must be defined in advance of being referenced in a workflow. See [Create Datasets for the Workflow](https://docs.digitalocean.com/products/paperspace/workflows/how-to/create-datasets/index.html.md) for more information. - `jobs`: Jobs are also sometimes referred to as “steps” within the Gradient Workflow. A job is an individual task that executes code (such as a training a machine learning model) and can consume inputs and produce outputs. ## Example Workflow Spec To run this Workflow, define datasets named `test-one`, `test-two`, and `test-three` as described in the [Create Datasets for the Workflow](https://docs.digitalocean.com/products/paperspace/workflows/how-to/create-datasets/index.html.md) documentation. Also, to make use of the secret named `hello` in the inputs section, define a [secret](https://docs.digitalocean.com/products/paperspace/accounts-and-teams/use-secrets/index.html.md). ```yaml defaults: # clusterId defaults to the NY2 public cluster, setting this parameter this is equaivalent to using the `--clusterId` flag on the command line. # This parameter often used for github triggered workflows running on private clusters. clusterId: clusterId # Default environment variables for all jobs. Can use any supported # substitution syntax (named secrets, ephemeral secrets, etc.). env: # This environment variable uses a Gradient secret called "hello". HELLO: secret:hello # Default instance type for all jobs resources: instance-type: P4000 container-registries: # optional - my-registry # Workflow takes two inputs, neither of which have defaults. This means that # when the Workflow is run the corresponding input for these values are # required, for example: # # {"inputs": {"data": {"id": "test-one"}, "echo": {"value": "hello world"}}} # inputs: data: type: dataset with: ref: test-one echo: type: string with: value: "hello world" jobs: job-1: # These are inputs for the "job-1" job; they are "aliases" to the # Workflow inputs. # # All inputs are placed in the "/inputs/" path of the run # containers. So for this job you would have the paths "/inputs/data" # and "/inputs/echo". inputs: # The "/inputs/data" directory would contain the contents for the dataset # version. ID here refers to the name of the dataset, not its dataset ID. data: workflow.inputs.data # The "/inputs/echo" file would contain the string of the Workflow input # "echo". echo: workflow.inputs.echo # These are outputs for the "job-1" job. # # All outputs are read from the "/outputs/" path. outputs: # A directory will automatically be created for output datasets and # any content written to that directory will be committed to a newly # created dataset version when the jobs completes. data2: type: dataset with: id: test-two # The container is responsible creating the file "/outputs/" with the # content being a small-ish utf-8 encoded string. echo2: type: string # Set job-specific environment variables env: TSTVAR: test # Set action uses: container@v1 # Set action arguments with: args: - bash - -c - find /inputs/data > /outputs/data2/list.txt; echo ENV $HELLO $TSTVAR > /outputs/echo2; cat /inputs/echo; echo; cat /outputs/data2/list.txt /outputs/echo2 image: bash:5 job-2: inputs: # These inputs use job-1 outputs instead of Workflow inputs. You must # specify job-1 in the needs section to reference them here. data2: job-1.outputs.data2 echo2: job-1.outputs.echo2 outputs: data3: type: dataset with: ref: test-three # List of job IDs that must complete before this job runs needs: - job-1 uses: container@v1 with: args: - bash - -c - wc -l /inputs/data2/list.txt > /outputs/data3/summary.txt; cat /outputs/data3/summary.txt /inputs/echo2 image: bash:5 ``` Below is an example of a valid `workflow.yaml` spec. It clones the repository from `https://github.com/NVlabs/stylegan2`, generates images from the repo script `run_generator.py`, and outputs the results to the Gradient-managed dataset `demo-dataset`. ```yaml jobs: CloneRepo: resources: instance-type: C5 outputs: repo: type: volume uses: git-checkout@v1 with: url: https://github.com/NVlabs/stylegan2.git StyleGan2: resources: instance-type: P4000 needs: - CloneRepo inputs: repo: CloneRepo.outputs.repo outputs: generatedFaces: type: dataset with: ref: demo-dataset uses: script@v1 with: script: |- pip install scipy==1.3.3 pip install requests==2.22.0 pip install Pillow==6.2.1 cp -R /inputs/repo /stylegan2 cd /stylegan2 python run_generator.py generate-images \ --network=gdrive:networks/stylegan2-ffhq-config-f.pkl \ --seeds=6600-6605 \ --truncation-psi=0.5 \ --result-dir=/outputs/generatedFaces image: tensorflow/tensorflow:1.14.0-gpu-py3 ```