Autoscaling Configuration for Gradient Deployments

Paperspace Deployments are containers-as-a-service that allow you to run container images and serve machine learning models using a high-performance, low-latency service with a RESTful API.


Autoscale your Deployment to adapt to changes in Deployment metrics. Gradient autoscaling uses the kubernetes horizontal pod autoscaler. Some defaults have been chosen to make it easier to quickly scale up and down the deployment.

Autoscaling scales up and down the deployment based on a chosen metric, summary function and specified value. The number of current replicas for each deployment never scales below replicas or above maxReplicas.

Scale down is calculated on a 5 minute period. This means that if your application is underutilized for 5 minutes, it scales down to the number of replicas required to handle the current load.

To change the autoscaling configuration, update the spec through the Paperspace console or CLI.

Configure Autoscaling

Use the following parameters in the deployment spec to configure autoscaling:

  • enabled (default: true): Turn autoscaling on or off.

  • maxReplicas : The upper bound on the number of replicas that can be run by the deployment. The deployment’s active replicas always fall in the range between the value of replicas and maxReplicas.

  • metric - Sets the metric used to scale up or down.

  • summary - Sets the function used to calculate scale events.

  • value - The summary number causes the deployment to scale.

Autoscaling Criteria

Multiple metrics can be used in the spec to determine when to scale. If you provide multiple metric blocks, the deployment calculates a proposed replica counts for each metric, and then scale the instances to the value of the highest replica count.

The following metrics can be used:

metric summary Description Type
cpu average Average cpu utilization across each replica (% of 100) Integer
memory average Average memory utilization across each replica (% of 100) Integer
requestDuration average Average request duration over a 5 minute period across all IPs behind the proxy (seconds). Minimum of 10 milliseconds Float

Autoscaling Example

The following spec configures all metrics available for autoscaling:

resources:
  replicas: 1
  ...
  autoscaling:
    enabled: true # toggle for enabling/disabling autoscaling
    maxReplicas: 3 # max replicas for autoscaling
    metrics:
      - metric: cpu
        summary: average
        value: 50 # 50% cpu utilization across all replicas
      - metric: memory
        summary: average
        value: 22 # 22% memory utilization across all replicas
      - metric: requestDuration
        summary: average
        value: 0.15 # 150 millisecond request duration for the endpoint

As requests begin to come through, the request duration over a 5 minute period is greater than 150 ms. As a result, the deployment scales up from 1 to 2 replicas. Over the next 5 minute interval the request duration is still longer than 150ms and the deployment scales to 3 replicas. After the stabilization period of 5 minutes, the deployment begins to scale down as the request times have fallen below 150 ms.