Model Monitoring

_images/Monitor_Block_Diagram_Detailed.png

Over time, models can degrade, providing inference results that no longer achieve your business goals. DKube integrates model monitoring into the overall workflow. This allows the data science or production teams to monitor the serving results, and take action if the results are no longer within acceptable tolerances.

  • Local and remote deployments can be monitored

  • Monitors can be created using data files rather than running deployments

  • Alerts and status can be set up based on goals and tolerances

  • A Dashboard provides a snapshot of all monitored models

  • Problems can be viewed in a set of hierarchical graphs

  • The problem, and its root cause, can be determined

  • Retraining and redeployment can be performed

Monitor Workflow

The general workflow to make use of the model monitoring system is described in this section.

Workflow

Description

Create or import a deployment

Deployments Import Deployment

Add a monitor for the deployment

Add a Monitor

Update the schema

Edit Schema

Create alerts for the monitor

Alerts

Optionally upload a file that sets status thresholds for the monitor

Thresholds

The monitor can be modified after it has been created

Edit an Existing Monitor

The status of the monitors and alerts can be viewed in real-time from the monitor dashboard screen

Monitor Dashboard

Based on the alerts, a specific monitor can be hierarchically investigated to determine what is causing the alert

Monitor Details

Note

It is possible to set up a monitor without a running model, based solely on a set of files. These files can be created manually or automatically from a running model or through a program.

Monitor Menu

The Monitor screens provide a UI-based mechanism to navigate through the workflow.

_images/Monitor_Menu_R33.png

Menu Item

Description

Guide

Code

Create, view, and manage the Code repos

Code

Datasets

Create, view, and manage the Dataset repos

Datasets

Models

Create, view, and manage the Model repos

Models

Images

Catalog of images for use in IDEs & Runs

Images

Deployments

Create, view, and manage deployments and monitors

Deployments

Monitors are described in this section

Storage

View the storage utilization for the user

Utilization

View the CPU, GPU, memory, and pod utilization for the user

Deployments Dashboard

_images/Monitor_Deployments_R33.png

In order to monitor a model (or set of files), a Deployment must be created or imported. The following deployment approaches are possible:

  • Create a deployment from a trained model Deployments

  • Import a deployment from a remote cluster, described in this guide

  • Create a dummy deployment in order to create a monitor for a set of files

The Deployments Dashboard provides a summary of the currently active deployments.

  • The Status column identifies whether the deployment has been created from within DKube or imported

  • If the deployment has been created within DKube, an endpoint URL is provided

  • If the deployment includes a monitor, the status of the monitor status is provided

The following actions are possible for each deployment:

Action

Deployment Type

Description

Edit

Running

Change the model being deployed for that endpoint Change Model Deployment

Edit

Imported

Change the remote deployment

Add Monitor

Both

Add a monitor to the deployment

Import Deployment

In order to monitor a remote model, or to monitor using a set of files, the deployment must first be imported to the local DKube cluster. Select the image-button button, which brings up the import popup.

_images/Monitor_Import_Popup.png

Field

Description

Name

Mandatory name of the deployment

Description/Tags

Optional fields providing more context for reviewing or filtering

Cluster

Optional cluster name if the model has been deployed on a remote cluster. Clusters are added as described at Multicluster Management

The Name field rules are as follows:

Import Type

Description

Remote Model

Must must match the deployment name on the remote cluster

Dummy Deployment

User chosen name that does not need to match any deployment

The Fields other than the Name field can be modified through the Edit icon after the deployment has been imported.

Add a Monitor

_images/Monitor_Dashboard_Add_Monitor.png

A monitor can be added by selecting the “Add Monitor” action icon. This will bring up a screen where the basic monitor fields can be filled in. Once the monitor has been added to the deployment, it can be further configured from the monitor dashboard screen.

Note

After the required inputs have been entered and the new monitor submitted, the Schema can be directly accessed. The Schema can also be modified later from the Monitor Dashboard screen.

Basic

_images/Monitor_Add_Basic_R33.png

Field

Description

Model Type

Type of model being monitored, such as regression or classification

Input Data Type

Type of data being monitored, such as tabular or image

Time Zone

Select the time zone to use for the monitor

Health

The health of the deployment can be monitored.

_images/Monitor_Add_Health_R33.png

Field

Description

Enable

Enable the monitoring of the health of the model instance on the cluster

Frequency

Select how often the monitor should run

Data Drift

The Drift screen sets up the monitor for data drift.

  • For locally running deployments, or deployments that have been imported from a remote cluster, most of the fields will be filled in based on the deployment metadata

  • For a dummy deployment, where the monitor is based on files and not running deployments, the fields must be filled in to identify what needs to be monitored

_images/Monitor_Add_Drift_R33.png

Field

Description

Enable

Enable data drift monitor

Frequency

Select how often the monitor should run

Algorithm

Choose the algorithm to use for evaluating data drift

Train Data

Training dataset name and version that should be used for the monitor

Train Data Upload Transformer Script

Optional script if necessary to preprocess or postprocess the data during the inference

Dataset Content

Format of dataset

Predict Data

Prediction dataset name and version that should be used for the monitor

Files Organized As

Folder organization for predict dataset

Predict Data Upload Transformer Script

Optional script if necessary to preprocess or postprocess the data during the inference

The monitor uses the inputs to do a comparison based on the frequency selected, and uses the thresholds or alerts to trigger an event or update the status of the monitor.

Performance Decay

The Performance screen sets up the monitor for metric performance.

_images/Monitor_Add_Performance_R33.png

Field

Description

Enable

Enable metric performance monitor

Frequency

Select how often the monitor should run

Compute Metrics

Select the format of the files to compute the performance metrics

Labelled Data

The Labelled Data selection expects a dataset file that has columns that provide both the Groundtruth (correct output) and the predicted outputs. Based on this, DKube will calculate the performance for the monitor.

Field

Description

Dataset

Dataset name and version for the calculation

Dataset Content

The format of the dataset file

Files Organized As

Folder organization for dataset

Upload Transformer Script

Optional script if necessary to preprocess or postprocess the data during the calculation

Groundtruth Column Name

Column header name for the groundtruth

Prediction Column Name

Column header name for the model prediction

Timestamp Column Name

Column header name for the timestamp

Pre-Computed Source

A Pre-Computed Source provides the full computation of the metrics. DKube does not do the computation, but rather uses the information in the file.

An example of a pre-computed file is available at Pre-Computed Source Example

Custom

An example of a custom file is available at Custom Metrics Example

Monitor Dashboard

_images/Monitor_Dashboard_R33.png

There are 2 different types of notification available for a monitor:

Type

Description

Status

Provides a status indication based on warning and critical thresholds Thresholds

Alerts

Alerts based on a single threshold set up during the addition of an alert Alerts

Monitor Status

_images/Monitor_Dashboard_State_R33.png

The status of the monitor is defined as follows:

State

Meaning

init

A field is missing

baselining

Calculating results after adding datasets

ready

Available for monitor, but not active

active

Running analysis

error

Problem with the monitor

Threshold Status

_images/Monitor_Dashboard_Status_R33.png

The threshold status of the monitors are provided as a summary at the top of the screen, and for each monitor in the columns below.

Important

The threshold status colors are based on the last run, and are not a cumulative indication of status

The Data Drift and Performance Decay threshold status colors are defined as follows:

Status

Meaning

Green Dot

The last run was within all of the thresholds

Orange Dot

The last run was between the warning and critical thresholds

Red Dot

The last run was higher than the critical threshold

Actions

_images/Monitor_Actions_R33.png

The actions for each monitor are as follows:

Action

Description

Start

Start or restart the monitor after being stopped

Stop

Stop the monitor

Delete

Delete the monitor

Edit Monitor

Modify the basic monitor options

Update Schema

Modify the monitor schema

Add Alerts

Add Alerts for the Monitor

Add or Edit Dashboards

Add or modify the dashboard

Upload Thresholds

Upload the thresholds for warning (orange) and critical (red) status

Important

The monitor must be stopped before editing the basic fields or schema

Note

Customizing dashboards are described in the DKube examples repo under the Monitoring branch Custom Dashboard

Edit an Existing Monitor

_images/Monitor_Dashboard_Edit_Monitor_R33.png

An existing monitor can be modified by selecting the “Edit Monitor” icon on the right of the monitor summary.

Important

The Monitor must be stopped to edit the monitor

Edit Schema

_images/Monitor_Edit_Schema_R33.png

After the basic information has been completed, the schema needs to be modified to reflect the features. The Monitor Schema can be updated by selecting the “Update Schema” icon on the right of the monitor summary.

Important

The Monitor must be stopped to update the schema

The Schema screen lists the features that are part of the training data. From this screen, you can choose which features to monitor, what type of feature it is (input, prediction, etc), and whether the feature is continuous (a number) or categorical (something is a distinct category such a true or false).

Alerts

_images/Monitor_Dashboard_Alerts_R33.png

Alerts provide notifications when an input or output of the Model drifts out of tolerance. Alerts can be added by selecting the alert-button icon on the dashboard.

The Alerts screen shows the alerts that have been added for that monitor, and allows the user to create a new alert. The Alert is configured by selecting what type of Alert is monitored (feature drift or performance decay). In each case, an email can be configured to notify an Alert trigger.

_images/Monitor_Alerts_R31.png _images/Monitor_Add_Alert_Popup_R33.png

Field

Description

Enable

Enable the alert to be active - the alert can be disabled later by editing it

Alert Name

User-chosen name for the alert

Alert Type

Type of comparison, such as data drift or performance decay

Configure Based On

Create alert based on status or threshold

Features

What feature to compare for this alert, and the threshold to use for the alert

Breach Incidents

Optionally set the number of times the feature matches the threshold before triggering an alert

Email Address

Optionally provide an email address to use when an alert is triggered

The alert will show up on the list of Alerts once successfully created.

Alerts can be edited from the Alert List screen by selecting the Edit icon on the far right.

Thresholds

Thresholds can be set for each feature in a monitor. Rather than trigger a single alert when the threshold is exceeded, the threshold capability allows 2 different thresholds that can provode more granularity. The thresholds are:

  • Warning

  • Critical

If neither of those thresholds are exceeded, the monitor is considered to be “Healthy”. A summary of the threshold status is provided in the Monitor Dashboard and described at Threshold Status

An example of a Thresholds file is available at Thresholds File Example

Tickets Dashboard

_images/Monitor_Deployments_Tickets.png

A monitor Ticket can be created and managed from the Tickets tab. There are 2 types of tickets:

  • Incidents

  • Change Requests

Selecting one of the ticket types will bring up a ServiceNow window.

Alerts Dashboard

_images/Monitor_Alerts_Dashboard_R33.png

The Alerts Dashboard shows all of the alerts within DKube, across all of the monitors. It provides information on the monitor name, alert name, type of alert, and the timestamps. The user can go directly to the monitor by selecting the monitor name.

Monitor Details

_images/Monitor_Dashboard_Select_Monitor_R33.png

The process of identifying the root cause of a monitor deviation involves successively reviewing more information on an Alert. From the Monitor Dashboard, select one of the monitors to find out more details on that monitor.

From the Monitor Summary dashboard, the details of a specific monitor can be viewed by selecting the monitor name.

_images/Monitor_Details_Dashboard_R33.png

This brings up a dashboard for that particular monitor, with the associated details. It includes:

  • A summary of the features and alerts status

  • A list of Alerts for that monitor only, for the selected timeframe

A summary of the Alert can be obtained by selecting the Alert name.

_images/Monitor_Alert_Summary_R33.png

More details on the Alert can be obtained by selecting the “Details” button at the top right.

Data Drift

_images/Monitor_Details_Data_Drift_R33.png

Selecting the Data Drift tab provides graphs and tables that help to identify what has drifted, with more information to determine why it has drifted.

The top graph overlays the number of production serving requests with the Alerts. This allows the Production Engineer to determine the amount of live inference traffic activity, and how it compares to the threshold alerts for the features.

The table below the summary graph provides visual and quantified information on how the selected features are changing, and how important a feature is to the resulting Model output. This allows the user to see if the original training data still matches the live inference data, and how the drfit varies over time. This might be a place to start for a retraining activity.

The tables reflect the selected timestamp from the graph above. Selecting different timestamps will bring up different tables.

Performance Decay

_images/Monitor_Details_Performance_R33.png

If Performance is selected, the graphs show how well the Model is performing based on the chosen Model metrics. The top graph combines the number of production requests and the number of alerts.

The bottom graph shows how the metrics are performing.

Configuration, Schema, & Alerts

_images/Monitor_Details_Configuration_R33.png

The Configuration, Scheme, and Alert tabs allows the user to view the options used for the monitor.