Blogs
MLOps on HPC/Slurm with Kubeflow
Over the last decade enterprises have made heavy investments in High Performance Computing (HPC) to solve complex scientific problems. They have used Slurm to schedule these massively parallel jobs on large clusters of compute nodes with accelerated hardware. AI/ML uses similar hardware for deep learning model training and enterprises are looking to find solutions that provide AI/ML model development on top of their existing HPC infrastructure. A recent trend in AI/ML is to use agile MLOps methodologies to productionize AI/ML models quickly. Marrying the two - AI/ML development using MLOps with HPC/Slurm clusters - will lead to a much faster adoption of this combination. This article elaborates on how to combine popular open-source frameworks, Slurm and Kubeflow, to run AI/ML workloads at scale on HPC clusters.