Getting Started

Overview

This guide describes the process of installing and managing DKube. The tool:

  • Ensures that the target system is accessible and has the right software prerequisites

  • Installs DKube on the cluster

  • Manages DKube on the cluster after installation

After installation, the management capabilities include:

Backing up and restoring DKube

Backup and Restore

Stopping and starting DKube

Stop and Start DKube

Upgrading DKube

Upgrade DKube Version

DKube Configuration

The cluster can include one or more master nodes and optional worker nodes.

  • The Master node coordinates the cluster, and can optionally contain GPUs

  • Each Worker node provides more resources, and is a way to expand the capability of the cluster

The Master node must always be running for the cluster to be active. Worker nodes can be added and removed, and the cluster will continue to operate. This is described in the section Restarting DKube After Cluster Restart.

Installation Configuration

The installation scripts can be installed:

  • From the master node in the cluster, or

  • From a remote node that is not part of the cluster

The overall flow of installation is as follows:

  • Copy the Docker installation scripts and associated files to the installation node (master node or remote node)

  • Ensure that the installation node has passwordless access to all of the DKube cluster nodes

  • Edit the installation ini files with the appropriate options

  • Install DKube and its required software components

  • Access DKube through a browser

The figures below show the 2 possible configurations for installation. The only requirement is that the installation node (either the master node on the cluster or a remote node) must have passwordless access to all of the nodes in the cluster. This is discussed in more detail in the sections on installation.

Master Node Installation

_images/Installation_Block_Diagram_Local.png

In a local installation, the scripts are run from the master node of the DKube cluster.

Important

Even if the installation is executed from the master node on the cluster, the ssh public key still needs to be added to the appropriate file on all of the nodes on the cluster, including the master node. This is explained in the section on passwordless ssh key security.

Remote Node Installation

_images/Installation_Block_Diagram_Remote.png

DKube and its associated applications can be installed and managed from a remote node that is not part of the DKube cluster. The installation node needs to have passwordless access to all of the nodes on the DKube cluster.

DKube and Kubernetes

DKube requires Kubernetes to operate. This guide assumes that a supported version of Kubernetes has been installed on the cluster.

Prerequisites

Supported Platforms

The following platforms are supported for DKube:

  • Installation platform can be any node running:

  • Ubuntu 18.04

  • CentOS 7.9

  • Kubernetes 1.18

  • Cluster nodes can include one of the following:

  • On-prem (bare metal or VM)

  • Google GCP

  • Amazon AWS

  • Amazon EKS

  • Rancher 2.4

Note

Please note that not all combinations of provider and OS are supported. Additional platforms are being released continually, and described in application notes.

The DKube installation scripts handle most of the work in getting DKube installed, including the installation of the software packages on the cluster. There are some prerequisites for each node, described below.

Node Requirements

Installation Node Requirements

The installation node has the following requirements:

  • A supported operating system

  • Docker CE

Docker Installation on Ubuntu

The following commands can be used to install Docker on Ubuntu:

sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common -y
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce -y
Docker Installation on CentOS

The following commands can be used to install Docker on CentOS:

sudo yum install -y yum-utils device-mapper-persistent-data lvm2
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install -y docker-ce-18.09.2-3.el7 docker-ce-cli-18.09.2-3.el7 containerd.io
sudo systemctl start docker
sudo systemctl enable docker

DKube Cluster Node Requirements

The DKube Cluster nodes have the following requirements:

  • A supported operating system

  • Nodes should all have static IP addresses, even if the VM exists on a cloud

  • Node names must be lower case

  • All nodes must be on the same subnet

  • All nodes must have the same user name and ssh key

Each node on the cluster should have the following minimum resources:

  • 16 CPU cores

  • 64GB RAM

  • Storage size is dependent on the programs and datasets, and should be large enough to handle the required data, but should be at least 200GB

Important

Only GPUs of the exact same type can be installed on a node. So, for example, you cannot mix an NVIDIA V100 and P100 on the same node. And even GPUs of the same class must have the same configuration (e.g. memory).

Important

The Nouveau driver should not be installed on any of the nodes in the cluster. If the driver is installed, you can follow the instructions in the section Removing Nouveau Driver.

Access to the Cluster

In order to run DKube both during and after installation, a minimum level of security access must be provided from any system that needs to use the node. This includes access to the url in order to open DKube from a browser.

Protocol

Port Range

Source

TCP

30002

Access IP

TCP

32222

Access IP

TCP

32223

Access IP

TCP

32323

Access IP

TCP

6443

Access IP

TCP

443

Access IP

TCP

22

Access IP

All

0-65535

Private Subnet

ICMP

0-65535

Access IP

The source IP access range is in CIDR format. It consists of an IP address and mask combination. For example:

  • 192.168.100.14/24 would allow IP addresses in the range 192.168.100.x

  • 192.168.100.14/16 would allow IP addresses in the range 192.168.x.x

Note

The source IP address 0.0.0.0/0 can be used to allow access from any browser client. If this is used, then a firewall can be used to enable appropriate access.

For specific platform types, additional or different steps must be taken.

GCP

Google GCP System Installation

AWS

Amazon AWS System Installation

Cluster and DKube Resiliency

For highly available operation, DKube supports multi-node resiliency (HA). An HA system prevents any single point of failure through redundant operation. For resilient operation, at least 3 nodes are required. There are 2 different types of independent resiliency: cluster and DKube. The details of how to configure DKube for resilient operation are provided in the pertinent sections that explain how to complete the ini files. Cluster resiliency is specific to the Kubernetes installation.

Cluster Resiliency

Cluster resiliency provides the ability of Kubernetes to offer a highly available control plane. Since the master node in a k8s system manages the cluster, cluster resiliency is enabled by having 3 master nodes. There can be any number of worker nodes. In a resilient cluster, a load balancer monitors the health of pods running on all of the master nodes. If a pod does down, the requests are automatically sent to pods running on other master nodes. In such a system, any number of worker nodes can go down and still offer a usable cluster. But only a single master node can go down and still have the system continue.

Note

Since the master node manages the cluster, for the best resiliency it is advisable to not install any GPUs on the master nodes, and to prevent any DKube-related pods from being scheduled on them. It is up to the user to ensure that the cluster is resilient. Depending upon the type of k8s, the details will vary.

DKube Resiliency

DKube resiliency is independent of - and can be enabled with or without - cluster resiliency. If the storage is installed by DKube, resiliency ensures that the storage and databases for the application have redundancy built in. This prevents an issue with a single node from corrupting the DKube operation. Externally configured storage is not part of DKube resiliency. For DKube resiliency to function, there must be at least 3 schedulable nodes. That is, 3 nodes that allow DKube pods to be scheduled on them. The nodes can be master nodes or worker nodes in any combination.

In order to enable DKube resiliency, the HA option must be set to “true” in the dkube.ini file.

Resiliency Examples

There are various ways that resiliency can be enabled at different levels. This section lists some examples:

Nodes

Master Nodes

Worker Nodes

Master Schedulable

Resiliency

3

1

2

Yes

DKube Only

3

1

2

No

No Resiliency

3

3

0

Yes

Cluster & DKube

4

1

3

Yes/No

DKube Only

4

3

1

Yes

Cluster & DKube

4

3

1

No

Cluster Only

6

3

3

Yes/No

Cluster & DKube

Installation with Existing DKube Data

When installing and uninstalling DKube, the existing DKube storage database can be preserved for re-use or wiped clean. When preserved, an installation of the same version of DKube will start with the contents of the previous installation. This is accomplished through the wipe-data switch when executing the following dkubeadm commands:

Operation

Command

Behavior

Install

sudo ./dkubeadm dkube install

Use the previous data if available

Install

sudo ./dkubeadm dkube install --wipe-data

Do not use the previous data, even if available

Cleanup

sudo ./dkubeadm node cleanup

Do not remove the existing DKube data storage

Cleanup

sudo ./dkubeadm node cleanup --wipe-data

Remove the existing DKube data storage

Important

Previous data can only be used with a DKube installation of the same version of DKube

Node Affinity

DKube allows you to optionally determine what kinds of jobs and workload types get scheduled on each node in the cluster. For example, you might want certain nodes to be used exclusively for GPU-based jobs, or you might want some nodes to be used only for production serving. This control is based on directives that you provide to DKube during installation, which then match up with the node affinity capability built into Kubernetes.

Note

The node affinity capability is optional. If no directives are given to DKube, any job or workload can be run on any node in the cluster.

Node Affinity Usage

This section provides the details on how to use the node affinity capability, with an example.

The node rules are provided in the [NODE-AFFINITY] section of the dkube.ini file, described later in the guide. An example of this section is provided here.

[NODE-AFFINITY]
# Nodes identified by labels on which the dkube pods must be scheduled
# Example: DKUBE_NODES_LABEL: key1=value1
DKUBE_NODES_LABEL: management=true
# Nodes to be tolerated by dkube control plane pods so that only they can be scheduled on the nodes
# Example: DKUBE_NODES_TAINTS: key1=value1:NoSchedule,key2=value2:NoSchedule
DKUBE_NODES_TAINTS: management=true:NoSchedule
# Taints of the nodes where gpu workloads must be scheduled.
# Example: GPU_WORKLOADS_TAINTS: key1=value1:NoSchedule,key2=value2:NoSchedule
GPU_WORKLOADS_TAINTS: gpu=true:NoSchedule
# Taints of the nodes where production workloads must be scheduled.
# Example: PRODUCTION_WORKLOADS_TAINTS: key1=value1:NoSchedule,key2=value2:NoSchedule
PRODUCTION_WORKLOADS_TAINTS: production=true:NoSchedule

Within the dkube.ini file, there are 2 types of field designations:

LABEL

Identified job types can only be scheduled on nodes with this label, but a label does not prevent other job types from also being schedule on the node

TAINT

Identified job types are the only job types scheduled on nodes with this taint

The definitions in the dkube.ini example file above create 3 types of nodes:

management

Management node

gpu

Node that will run a GPU job

production

Node that will handle production jobs

So, in this example:

  • Since the DKUBE_NODES_LABEL has “management=true”

  • Control jobs can only be executed on nodes with the “management” label, but

  • Worker jobs can be scheduled on any node, including the nodes with the “management” label

  • Since the DKUBE_NODES_TAINTS has “management=true:NoSchedule”, control jobs are the only jobs that can be scheduled on nodes with the taint

Assigning a Label

Node labels restrict certain job types to run only on that node, but do not prevent other jobs from also running on that node. In order to assign several nodes the “management” label, the command would be:

kubectl label node <node-1> <node-2> management=true

Assigning a Taint

Node taints restrict certain job types to run only on that node, and prevent any other job type from running on that node. In order to assign several nodes the “management-only” taint, the command would be:

kubectl taint node <node-1> <node-2> management=true:NoSchedule

CI/CD

DKube provides the ability to automatically build and register Docker images based on a set of criteria. The settings are provided in the [CICD] section of dkube.ini file, described later in the guide. The Docker images will be pushed to the registry provided in the ini file.

_images/dkube-ini-file-CICD.png

The following fields should be changed to enable CICD. The other fields should be left in their default settings.

Field

Value

ENABLED

True

DOCKER_REGISTRY

Name of the Docker registry to save images

REGISTRY_USERNAME

Username for Docker registry

REGISTRY_PASSWORD

Password for Docker registry

Getting the DKube Files

The files necessary for installation, including the scripts, the .ini files, and any other associated files are pulled from Docker, using the following commands:

sudo docker login -u <Docker username> Password: <Docker password> sudo docker pull ocdr/dkubeadm:<DKube version> sudo docker run --rm -it -v $HOME/.dkube:/root/.dkube ocdr/dkubeadm:<DKube version> init

Note

The docker credentials and DKube version number (x.y.z) are provided separately.

This will copy the necessary files to the folder $HOME/.dkube

dkubeadm

Tool used to install & manage Kubernetes & DKube on the cluster

k8s.ini

Configuration file for cluster-related installation activitie such as node setup

dkube.ini

Configuration file for DKube installation

ssh-rsa Key Pair

ssh key pair for passwordless access to the remote machines

Platform-Specific Installation Procedure

The installation procedure depends upon the type of platform and the type of Kubernetes (Community or managed).

Kubernetes

GCP

AWS

On-Prem

Instructions

EKS

x

Installing DKube on an Amazon EKS Cluster

Rancher

x

x

x

Installing DKube on a Rancher Cluster