Many organizations have HPC clusters with large compute and GPU resource pools. Tapping those resources for AI/ML workloads can be cumbersome, requiring hand-built plug-ins, open source libraries and tools by researchers, students, or individual employees duplicating cost and effort while reducing collaboration.
Moreover AI/ML models often need to maintain traceability, lineage, and governance required by the regulatory or safety bodies in an industry or country. That is available in commercial MLOps platforms which on the other hand were not built to take advantage of HPC compute and GPU resources.
With DKube you can offload your data pre-processing or AI training jobs to a Slurm cluster based on vSphere -as individual jobs/runs or as part of pipelines. Full traceability, lineage and logging of the work being performed is maintained in SQL database. Multiple HPC clusters can be attached while the control plane of the DKube MLOps platform runs on a Kubernetes cluster such as VMWare Tanzu providing you with all the core innovations of Kubeflow and MLFlow.
Please click here to receive a link to the recording in your email inbox.
There's a faster way to go from research to application. Find out how an MLOps workflow can benefit your teams.