# Autoscaling Configuration for Gradient Deployments

Paperspace Deployments are containers-as-a-service that allow you to run container images and serve machine learning models using a high-performance, low-latency service with a RESTful API.

Autoscale your Deployment to adapt to changes in Deployment metrics. Gradient autoscaling uses the [kubernetes horizontal pod autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/). Some defaults have been chosen to make it easier to quickly scale up and down the deployment.

Autoscaling scales up and down the deployment based on a chosen `metric`, `summary` function and specified `value`. The number of current replicas for each deployment never scales below `replicas` or above `maxReplicas`.

Scale down is calculated on a 5 minute period. This means that if your application is underutilized for 5 minutes, it scales down to the number of replicas required to handle the current load.

To change the autoscaling configuration, update the spec through the Paperspace console or CLI.

## Configure Autoscaling

Use the following parameters in the deployment spec to configure autoscaling:

- `enabled` (default: true): Turn autoscaling on or off.
- `maxReplicas` : The upper bound on the number of replicas that can be run by the deployment. The deployment’s active replicas always fall in the range between the value of `replicas` and `maxReplicas`.
- `metric` - Sets the metric used to scale up or down.
- `summary` - Sets the function used to calculate scale events.
- `value` - The summary number causes the deployment to scale.

### Autoscaling Criteria

Multiple metrics can be used in the spec to determine when to scale. If you provide multiple metric blocks, the deployment calculates a proposed replica counts for each metric, and then scale the instances to the value of the highest replica count.

The following metrics can be used:

| `metric` | `summary` | Description | Type |
|---|---|---|---|
| `cpu` | average | Average cpu utilization across each replica (% of 100) | Integer |
| `memory` | average | Average memory utilization across each replica (% of 100) | Integer |
| `requestDuration` | average | Average request duration over a 5 minute period across all IPs behind the proxy (seconds). Minimum of 10 milliseconds | Float |

## Autoscaling Example

The following spec configures all metrics available for autoscaling:

```yaml
resources:
  replicas: 1
  ...
  autoscaling:
    enabled: true # toggle for enabling/disabling autoscaling
    maxReplicas: 3 # max replicas for autoscaling
    metrics:
      - metric: cpu
        summary: average
        value: 50 # 50% cpu utilization across all replicas
      - metric: memory
        summary: average
        value: 22 # 22% memory utilization across all replicas
      - metric: requestDuration
        summary: average
        value: 0.15 # 150 millisecond request duration for the endpoint
```

As requests begin to come through, the request duration over a 5 minute period is greater than 150 ms. As a result, the deployment scales up from 1 to 2 replicas. Over the next 5 minute interval the request duration is still longer than 150 ms and the deployment scales to 3 replicas. After the stabilization period of 5 minutes, the deployment begins to scale down as the request times have fallen below 150 ms.