Pachyderm is the leader in data versioning and pipelines for MLOps. We provide the data foundation that allows data science teams to automate and scale their machine learning lifecycle while guaranteeing reproducibility. Unlike other data versioning and pipeline products Pachyderm provides data-driven automation, petabyte scalability and end-to-end reproducibility.
Automated Data Versioning — Pachyderm’s Data Versioning gives teams an automated and performant way to keep track of all data changes- Utilizes a Git-like structure that enables effective team collaboration through commits, branches and rollbacks
Data-Driven Pipelines — Pachyderm’s Containerized Pipelines speed data processing while lowering compute costs- Kubernetes native approach supports any library or language
Immutable Data Lineage — Pachyderm’s Data Lineage provides an immutable record for all activities and assets in the ML lifecycle- Track every version of your code, models, and data
Note:
Pachyderm recommends using 4 Nodes of the $20/month plan (4GB RAM / 2vCPU)
The minimum requirements is 2 Nodes of the $10/month plan (2GB RAM / 1vCPU)
Package | Version | License |
---|---|---|
Pachyderm | 2.4.4 | Community License |
etcd | v3.5.5 | Apache License 2.0 |
pachyderm/postgresql | 13.3.0 | Apache License 2.0 |
pachyderm/pgbouncer | 1.16.2 | Apache License 2.0 |
envoyproxy/envoy | v1.22.0 | Apache License 2.0 |
Click the Deploy to DigitalOcean button to install a Kubernetes 1-Click Application. If you aren’t logged in, this link will prompt you to log in with your DigitalOcean account.
In addition to creating Pachyderm using the control panel, you can also use the DigitalOcean API. As an example, to create a 3 node DigitalOcean Kubernetes cluster made up of Basic Droplets in the SFO2 region, you can use the following doctl
command. You need to authenticate with doctl
with your API access token) and replace the $CLUSTER_NAME
variable with the chosen name for your cluster in the command below.
doctl kubernetes clusters create --size s-4vcpu-8gb $CLUSTER_NAME --1-clicks pachyderm
Please install the following:
pachctl
, is the command-line tool that lets you interact with Pachyderm. It is a client-side tool and it will need to be installed on your local machine.kubectl
, the Kubernetes command-line tool allows you to run commands against Kubernetes clusters.doctl
, DigitalOcean’s Client API tool supports most functionality found in the control panel.If you wish to customize your pachyderm configuration you will need to install Helm. You can find the full list of Helm Chart values here. To upgrade pachyderm with your new values follow the instructions here.
To connect kubectl
to your Kubernetes cluster:
OR
doctl kubernetes cluster kubeconfig save <Insert Cluster ID Here>
Pachyderm will already be installed in your cluster in the pachyderm
namespace.
To confirm run:
kubectl get pods -n pachyderm
Your output should look like this:
NAME READY STATUS RESTARTS AGE
console-d56d7b7f6-j5lvc 1/1 Running 0 10m
etcd-0 1/1 Running 0 10m
pachd-76f7d5455-jk2lj 1/1 Running 0 10m
pachyderm-kube-event-tail-54f6759474-tnf8q 1/1 Running 0 10m
pachyderm-loki-0 1/1 Running 0 10m
pachyderm-promtail-7drcc 1/1 Running 0 10m
pachyderm-promtail-k846r 1/1 Running 0 10m
pachyderm-promtail-ltdx4 1/1 Running 0 10m
pachyderm-proxy-fff6dc868-qcxk4 1/1 Running 0 10m
pg-bouncer-57869fc46c-pgqz5 1/1 Running 0 10m
postgres-0 1/1 Running 0 10m
The last step is to connect pachctl
to your cluster. To do this, you will need to run three commands:
kubectl get service pachyderm-proxy -n pachyderm
Once an IP address is listed under EXTERNAL-IP
, run the following command using that IP:
echo '{"pachd_address": "grpc://<IP ADDRESS HERE>:80"}' |pachctl config set context digitalocean
To verify your connection run: pachctl version
You should see the following output:
COMPONENT VERSION
pachctl 2.4.4
pachd 2.4.4
You can also access the web interface at http://<IP ADDRESS HERE>