Pachyderm

Pachyderm is for data science teams who want to operationalize the data tasks in their ML lifecycle to iterate on data more quickly and reliably.

Pachyderm is the leader in data versioning and pipelines for MLOps. We provide the data foundation that allows data science teams to automate and scale their machine learning lifecycle while guaranteeing reproducibility. Unlike other data versioning and pipeline products Pachyderm provides data-driven automation, petabyte scalability and end-to-end reproducibility.

Primary Benefits

  • Data-Driven Automation — Automate and unify your MLOps tool chain with data driven pipelines and automated data versioning for increased productivity and reduced risk
  • Petabyte Scalability — Rapidly process the largest unstructured and structured data sets with automatic parallel and incremental processing that requires no code changes
  • End-to-End Reproducibility — Iterate quickly while still meeting audit and data governance requirements through end-to-end reproducibility and immutable data lineage

Key Features

  • Automated Data Versioning — Pachyderm’s Data Versioning gives teams an automated and performant way to keep track of all data changes- Utilizes a Git-like structure that enables effective team collaboration through commits, branches and rollbacks

    • Powerful content-based deduplication reduces the cost of storing and accessing large data sets
    • File-based versioning provides a complete audit trail for all data and artifacts across pipeline stages including intermediate results
    • Stored as native objects (not metadata pointers) so that versioning is automated and guaranteed
  • Data-Driven Pipelines — Pachyderm’s Containerized Pipelines speed data processing while lowering compute costs- Kubernetes native approach supports any library or language

    • Autoscale with parallel processing of data without writing additional code
    • Automated pipelines execute whenever new data is committed
    • Incremental processing saves compute by only processing differences and automatically skipping duplicate data
    • Pipeline steps have JSON/YAML defined inputs and outputs that ease debugging
  • Immutable Data Lineage — Pachyderm’s Data Lineage provides an immutable record for all activities and assets in the ML lifecycle- Track every version of your code, models, and data

    • Maintain reproducibility of data and code for compliance
    • Manage relationships between historical data states
    • Pachyderm’s Global IDs make it easy for teams to track any result all the way back to its raw input, including all analysis, parameters, code, and intermediate results.

Note:

Pachyderm recommends using 4 Nodes of the $20/month plan (4GB RAM / 2vCPU)

The minimum requirements is 2 Nodes of the $10/month plan (2GB RAM / 1vCPU)

Software Included

Package Version License
Pachyderm 2.4.4 Community License
etcd v3.5.5 Apache License 2.0
pachyderm/postgresql 13.3.0 Apache License 2.0
pachyderm/pgbouncer 1.16.2 Apache License 2.0
envoyproxy/envoy v1.22.0 Apache License 2.0

Creating an App using the Control Panel

Click the Deploy to DigitalOcean button to install a Kubernetes 1-Click Application. If you aren’t logged in, this link will prompt you to log in with your DigitalOcean account.

Deploy to DO

Creating an App using the API

In addition to creating Pachyderm using the control panel, you can also use the DigitalOcean API. As an example, to create a 3 node DigitalOcean Kubernetes cluster made up of Basic Droplets in the SFO2 region, you can use the following doctl command. You need to authenticate with doctl with your API access token) and replace the $CLUSTER_NAME variable with the chosen name for your cluster in the command below.

doctl kubernetes clusters create --size s-4vcpu-8gb $CLUSTER_NAME --1-clicks pachyderm

Getting Started After Deploying Pachyderm

Prerequisites

Please install the following:

  • pachctl, is the command-line tool that lets you interact with Pachyderm. It is a client-side tool and it will need to be installed on your local machine.
  • kubectl, the Kubernetes command-line tool allows you to run commands against Kubernetes clusters.
  • doctl, DigitalOcean’s Client API tool supports most functionality found in the control panel.

If you wish to customize your pachyderm configuration you will need to install Helm. You can find the full list of Helm Chart values here. To upgrade pachyderm with your new values follow the instructions here.

This installation is NOT designed to be a production environment. It is meant to help you learn and experiment quickly with Pachyderm.


To connect kubectl to your Kubernetes cluster:

  • Follow the instructions found in step 2 (Connecting to Kubernetes) of the Getting Started section for your Kubernetes cluster.

OR

  • Find your Cluster ID and run the following command:

doctl kubernetes cluster kubeconfig save <Insert Cluster ID Here>

Pachyderm will already be installed in your cluster in the pachyderm namespace.

To confirm run:

kubectl get pods -n pachyderm

Your output should look like this:

NAME                                         READY   STATUS    RESTARTS   AGE
console-d56d7b7f6-j5lvc                      1/1     Running   0          10m
etcd-0                                       1/1     Running   0          10m
pachd-76f7d5455-jk2lj                        1/1     Running   0          10m
pachyderm-kube-event-tail-54f6759474-tnf8q   1/1     Running   0          10m
pachyderm-loki-0                             1/1     Running   0          10m
pachyderm-promtail-7drcc                     1/1     Running   0          10m
pachyderm-promtail-k846r                     1/1     Running   0          10m
pachyderm-promtail-ltdx4                     1/1     Running   0          10m
pachyderm-proxy-fff6dc868-qcxk4              1/1     Running   0          10m
pg-bouncer-57869fc46c-pgqz5                  1/1     Running   0          10m
postgres-0                                   1/1     Running   0          10m

The last step is to connect pachctl to your cluster. To do this, you will need to run three commands:

kubectl get service pachyderm-proxy -n pachyderm

Once an IP address is listed under EXTERNAL-IP, run the following command using that IP:

echo '{"pachd_address": "grpc://<IP ADDRESS HERE>:80"}' |pachctl config set context digitalocean

To verify your connection run: pachctl version

You should see the following output:

COMPONENT           VERSION             
pachctl             2.4.4               
pachd               2.4.4

You can also access the web interface at http://<IP ADDRESS HERE>


You are now ready to start the Beginner Tutorial!