# Autoscaling Configuration for Gradient Deployments Paperspace Deployments are containers-as-a-service that allow you to run container images and serve machine learning models using a high-performance, low-latency service with a RESTful API. Autoscale your Deployment to adapt to changes in Deployment metrics. Gradient autoscaling uses the [kubernetes horizontal pod autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/). Some defaults have been chosen to make it easier to quickly scale up and down the deployment. Autoscaling scales up and down the deployment based on a chosen `metric`, `summary` function and specified `value`. The number of current replicas for each deployment never scales below `replicas` or above `maxReplicas`. Scale down is calculated on a 5 minute period. This means that if your application is underutilized for 5 minutes, it scales down to the number of replicas required to handle the current load. To change the autoscaling configuration, update the spec through the Paperspace console or CLI. ## Configure Autoscaling Use the following parameters in the deployment spec to configure autoscaling: - `enabled` (default: true): Turn autoscaling on or off. - `maxReplicas` : The upper bound on the number of replicas that can be run by the deployment. The deployment’s active replicas always fall in the range between the value of `replicas` and `maxReplicas`. - `metric` - Sets the metric used to scale up or down. - `summary` - Sets the function used to calculate scale events. - `value` - The summary number causes the deployment to scale. ### Autoscaling Criteria Multiple metrics can be used in the spec to determine when to scale. If you provide multiple metric blocks, the deployment calculates a proposed replica counts for each metric, and then scale the instances to the value of the highest replica count. The following metrics can be used: | `metric` | `summary` | Description | Type | |---|---|---|---| | `cpu` | average | Average cpu utilization across each replica (% of 100) | Integer | | `memory` | average | Average memory utilization across each replica (% of 100) | Integer | | `requestDuration` | average | Average request duration over a 5 minute period across all IPs behind the proxy (seconds). Minimum of 10 milliseconds | Float | ## Autoscaling Example The following spec configures all metrics available for autoscaling: ```yaml resources: replicas: 1 ... autoscaling: enabled: true # toggle for enabling/disabling autoscaling maxReplicas: 3 # max replicas for autoscaling metrics: - metric: cpu summary: average value: 50 # 50% cpu utilization across all replicas - metric: memory summary: average value: 22 # 22% memory utilization across all replicas - metric: requestDuration summary: average value: 0.15 # 150 millisecond request duration for the endpoint ``` As requests begin to come through, the request duration over a 5 minute period is greater than 150 ms. As a result, the deployment scales up from 1 to 2 replicas. Over the next 5 minute interval the request duration is still longer than 150 ms and the deployment scales to 3 replicas. After the stabilization period of 5 minutes, the deployment begins to scale down as the request times have fallen below 150 ms.