KBJNet: Kinematic Bi-Joint Temporal Convolutional Network Attention for Anomaly Detection in Multivariate Time Series Data

Muhammad Abdan Mulia; Muhammad Bintang Bahy; Muhammad Zain Fawwaz Nuruddin Siswantoro; Nur Rahmat Dwi Riyanto; Nella Rosa Sudianjaya; Ary Mazharuddin Shiddiqi

I. Introduction

Nowadays, IT activities generate a significant amount of high-dimensional sensor data. Although big data analytics and deep learning have made handling massive amounts of data possible, identifying irregularities in such data remains challenging due to the vast volume, noise, and uneven data distribution that make it difficult to detect anomalies. This phenomenon is called the ‘dimensionality curse’ (). Moreover, anomalies can arise from interactions between multiple causes, which further complicates the detection process. This problem domain is particularly crucial in data-driven industries that generate many unstable, dispersed, and multimodal time series datasets, such as source management, autonomous driving, and the Internet of Things (IoT).

Anomalies reveal unusual characteristics within the systems and entities responsible for supplying data. These atypical traits offer valuable insights for real-world applications. Detecting data anomalies can uncover outliers, identify environmental conditions requiring human attention, or optimize computing resources by preemptively filtering undesired data segments. For cloud systems, promptly identifying anomalies following an incident is crucial in preventing more significant failures that may impact customers (). The research also explained that intrusion detection plays a vital role in computer network systems by distinguishing between illegal and malicious behaviors. Another aspect that the research covered was the electrocardiography (ECG) signals for assessing heart conditions in medicine. Typically, medical practitioners manually evaluate the resulting time series signal to detect arrhythmia. Finally, a multivariate industrial time series monitors these processes, incorporating data from sensors and control systems within the gas-oil plant heating loop (GHL). An LSTM-based technique is used to detect defects in this context.

Anomaly detection involves identifying data points, patterns, or traffic that significantly deviate from a system’s expected behavior. Outliers that deviate substantially from the rest of the distribution are labeled as anomalies (). Anomaly detection is essential for creating trustworthy computer systems () in commercial, industrial, healthcare, and military applications to ensure crucial processes or decisions are safe (). Anomaly detection based on statistical, rule-based, machine learning, and neural networks with unsupervised methods is becoming increasingly important. These methods provide fast inference speed, improve quality of service, and efficiently manage high-dimensional time series data ().

Various statistical, rule-based, and machine-learning methods have previously been developed to find abnormalities in time series data. Rule-based methods compare data to an anomaly rule, which can be flawed and require frequent updates, making it time-consuming. Statistical methods estimate parameters based on a particular distribution but may fail to capture underlying nonlinearities and dynamical linkages. Machine learning approaches come in three types: supervised, unsupervised, and weakly supervised learning. Unsupervised techniques such as One-Class Support Vector Machine (OC-SVM) (), k-Nearest Neighbor (KNN) (), Support Vector Data Description (SVDD) (), Expectation Maximization (EM) (), Histogram-Based Outlier Score (HBOS) (), Local Outlier Factor (LOF) (), and Local Density Cluster-based Outlier Factor (LDCOF) () have already been employed for identifying anomalies in time series data. However, they may have issues in capturing temporal correlation and performance. Statistical methods such as wavelet theory, Hilbert transform (), principal component analysis (PCA) (), and Markov chain models () has also been used for time series data analysis. Recently, machine learning methods such as SVM (), Regression models (), and clustering () have been created to forecast the distribution of time series data. However, memory constraints can limit their ability to detect temporal patterns.

Anomaly detection methods using deep learning have attracted interest and become popular due to their ability to handle challenging detection problems in various real-world applications. Recurrent neural networks (RNNs) can be a good option to solve sequence modeling problems. However, traditional RNNs struggle to capture remote relationships due to gradient disappearance in long-sequence modeling problems. Popular RNN () variations, including gated recurrent unit (GRU) () and long short-term memory (LSTM) () have already been created to get around this restriction. In modeling temporal patterns, RNNs can benefit from the attention mechanism. However, the computational intensity and slow speed of recursive models such as LSTM hinder their ability to replicate long-term trends accurately. In contrast, some time-series anomaly detection tasks, such as detecting anomalies in sensor data or financial transactions, may require detecting subtle deviations from normal behavior over long periods. The dual-path network has been proposed as an effective method to solve this problem ().

Recently, the Transformer model’s encoding of large sequences allows for almost independent accuracy and inference time, making it an excellent choice for anomaly detection models that mine long-term dependencies and deal with nonlinear dynamics. Nonetheless, the Transformer model can only handle sequences with a length of a few hundred (). The Transformer model has a significant computational complexity for extended sequences, and the training is slow. To address these issues, recent research has proposed combining temporal convolution networks (TCN) with transformers to capture temporal dependencies while avoiding the pitfalls of recursive models ().

While there have been notable improvements in anomaly detection for time series data, conventional statistical approaches and machine learning algorithms have limitations in effectively handling nonlinear, high-dimensional, and noisy data. Although LSTM and GRU neural networks can capture contextual information, they face challenges due to their slow inference speed and inefficiency. On the other hand, transformers demonstrate strengths in parallelization and capturing long-range dependencies in input sequences. However, slow training and high computational complexity hinder their performance on longer sequences.

Based on the aforementioned considerations, we introduce a novel model called KBJNet, which integrates the TCN and transformers architectures using a dual-path network for detecting abnormalities in multivariate time series data. The KBJNet model incorporates an adaptable multi-head mechanism for attention that comprehensively captures the characteristics of each dimension in the data, enabling effective anomaly detection. Our key contributions include:

Our study proposes a new model architecture for capturing anomalies involving a combination of dilated TCN and transformers. The TCN utilizes dilation convolution to establish a perceptual field. To ensure a global perceptual field that covers the whole input sequence, the minimum number of convolutional layers is determined based on factors such as the input sequence length, convolution kernel size, and dilation coefficient. In other words, the range of the dilation convolution is adjusted to encompass the entire input sequence.
We embed this combined TCN and transformers into a dual-path network, which enhances its efficiency and effectiveness for modeling extremely long sequences and high dimensions.
We introduce a dual path network that utilizes a shared TCN Attention mechanism for assigning weights to time steps. This approach facilitates recognizing and prioritizing crucial information within a multivariate time series.
Our method has undergone comprehensive testing on standard datasets and has demonstrated superior performance compared to the current leading techniques in benchmark tests.

II. Literature Review

This section presents a comprehensive literature review on anomaly detection, emphasizing three crucial areas: statistical and machine learning approaches, neural network and deep learning techniques, and the current state-of-the-art. Table I summarizes terminologies used in this study.

Table I

Summary of terminology used.


TERMINOLOGY	DEFINITION

ARIMA	Autoregressive Integrated Moving Average

AUC	Area under the ROC Curve.

CAV	Connected and Autonomous Vehicle

COPOD	Copula-Based Outlier Detection

CPOD	Core Point-based Outlier Detection

DAGMM	Deep Autoencoding Gaussian Mixture Model

DTAAD	Dual Tcn-Attention Networks for Anomaly Detection in Multivariate Time Series Data

ECG	Electrocardiography

EVT	Extreme Value Theory

FFN	Feedforward Neural Network

GAN	Generative Adversarial Network

GDN	Graph Deviation Networks

GHL	Gas-oil Plant Heating Loop

GPD	Generalized Pareto Distribution

GRU	Gated Recurrent Unit

GTA	Graph Learning with Transformer for Anomaly Detection

HBOS	Histogram-Based Outlier Score

IoT	Internet of Things

KBJNet	Kinematic Bi-Joint Temporal Convolutional Network Attention for Anomaly Detection

KDD	Knowledge Discovery and Data Mining

KNN	k-Nearest Neighbor

LDCOF	Local Density Cluster-based Outlier Factor

LOF	Local Outlier Factor

LSTM	Long Short-Term Memory Networks

LSTM-VAE	Long Short-Term Memory Networks and Variational Autoencoder

MAD-GAN	Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks

MAML	Model-Agnostic and Meta-Learning

MBA	MIT-BIH Supraventricular Arrhythmia Database

MSCRED	Multi-Scale Convolutional Recurrent Encoder-Decoder

MSDS	Material Safety Data Sheet

MSE	Mean Squared Error

MSL	Mars Science Laboratory

MTAD-GAT	Multivariate Time-Series Anomaly Detection via Graph Attention Networks

MTS	Multivariate Time Series

NAB	Numenta Anomaly Benchmark

NSIBF	Neural System Identification and Bayesian Filtering

PCA	Principal Component Analysis

POT	Peaks Over Threshold

ReLU	Rectified Linear Unit

RNN	Recurrent Neural Network

SMAP	Soil Moisture Active Passive

SMD	Server Machine Dataset

SoTa	State of the Art

SVD	Support Vector Data

SVDD	Support Vector Data Description

SVM	Support Vector Machine

SWaT	Secure Water Treatment

TCN	Temporal Convolutional Network

TranAD	Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data

TWSVM	Twin Support Vector Machine

USAD	Unsupervised Anomaly Detection

UTRAD	Anomaly Detection and Localization with U-Transformer

WADI	Water Distribution

A. Statistical and machine learning

Several commonly used time series anomaly detection techniques include 3sigma, PCA, KNN, copula-based outlier detection (COPOD), LOF, and OC-SVM. The 3sigma method measures deviations from historical averages, while PCA calculates eigenvector distance differences according to Shyu et al. (). KNN determines anomalies based on the mean distance of nearest neighbors, as discussed in Kiss et al. (). COPOD utilizes statistical probability functions, OC-SVM seeks to learn decision boundaries for typical observations, and LOF is an unsupervised method based on density, as described by Li et al. ().

Patcha & Park () introduced an outline of several methods for anomaly detection, including hidden Markov chains, PCA, process regression, and isolation forest, while also highlighting their limitations. Yaacob et al. () introduced an auto-regressive integrated moving average (ARIMA) method, as a representative statistical approach for modeling and detecting anomalous behaviors. Bandaragoda et al. () widely used isolation forest, which recursively divides the feature space using multiple isolation trees for anomaly detection.

In the healthcare sector, Salem et al. () utilized linear regression combined with SVM to capture anomaly detection in wireless sensor networks. Shang et al. () introduced SVM combined with mean clustering to increase the effectiveness of model training and enhance anomaly detection precision. Boniol et al. () presented GraphAn, a graph-based approach that converts time series data using interval graph distance. Tran et al. () employed clustering and database manipulation history in their outlier detection method, called CPOD. Kingsbury & Alvaro () proposed Elle, another outlier detection method that leverages clustering and database manipulation history.

In their study, Dhiman et al. () employed adaptive threshold techniques and Twin Support Vector Machines (TWSVM) to detect anomalies within two univariate time series data. They proposed these methods as effective approaches in their study. On the other hand, Wang et al. () focused on enhancing the security of CAV (connected and autonomous vehicle) systems. They used an adaptive extended Kalman filter with a pre-trained single-class SVM. Their strategy attempted to increase the CAV systems’ overall security.

B. Neural network and deep learning

Several deep learning-based methods have already been proposed to resolve it. For robust anomaly detection, LSTM-based neural network architecture is used by neural system identification and Bayesian filtering (NSIBF) for Bayesian filtering and system identification. EncDec-AD () used LSTM as the base cell for both the encoder and decoder. To recreate the error for each input data and produce a representation with low-dimension, deep autoencoding Gaussian mixture model (DAGMM) () uses a deep autoencoder. The advantage of this method is it will not exploit temporal information. Meanwhile, MSCRED () uses a convolutional encoder-decoder and an attention-based Conv-LSTM to recreate a multi-scale signature matrix. This will use residual signature matrices to detect anomalies, but it may take longer training time and limited performance with insufficient data.

Ergen and Kozat () introduced an algorithm that uses LSTM to transform dynamic data length sequences into sequences with static length, then a single-class support vector machine-based anomaly detector decision function or a support vector data description technique comes next. OmniAnomaly () proposed a recurrent neural network incorporating stochasticity to identify irregularities in multivariate time series data. LSTM-VAE () combined LSTM and variational autoencoder but overlooked the interconnection between stochastic variables. Multivariate anomaly detection for time series data with generative adversarial networks (MAD-GAN) () adopts generator and discriminator base models in the GAN framework that utilizes LSTM-RNN to visualize time series distributions’ temporal relations. TCN-AE () ignores the correlation between time series and combines TCN and AE. Multivariate time-series anomaly detection via GAN (MTAD-GAT) () employs GAT (GATs) () in both the feature and time dimensions to capture temporal and feature correlations. Anomaly Transformer () proposed a minimax training strategy and used self-attention weights to identify anomalies. Graph learning with transformer for anomaly detection (GTA) () employed an architecture based on transformers to learn and capture temporal dependencies. They utilized this approach to acquire a graph structure that accurately represents the relationships between different elements within the data. Deep transformer networks for anomaly detection (TranAD) () incorporated adversarial training and self-conditioning techniques in a transformer-based model to improve performance.

Huang et al. introduced HitAnomaly, an anomaly detection model based on log analysis. HitAnomaly utilizes a hierarchical transformer structure to capture and represent both the sequences of log templates and their corresponding parameter values. The classification model developed by the researchers was constructed by incorporating an attention mechanism. Additionally, they devised separate log sequences and parameter value encoders to obtain their respective representations. The study provides evidence that the transformer model outperforms LSTM and illustrates the successful modeling of log sequences using a hierarchical framework. Using three log datasets, the results demonstrated that Other currently used log-based anomaly detection methods have not performed as well as HitAnomaly ().

Yu et al. () combines autoregressive (AR) and adaptive ensemble (AE) with the addition of the transformer to capture the information of long sequences. Design convolution and dilated convolution as local TCN, introduce feedback mechanism, and loss ratio to improve detection accuracy and expand association differences.

C. State of the art

The deep learning methodology has promising performance in multivariate time series (MTS) anomaly detection. Various approaches, including transformer-based models, autoencoder-based models, and others, have been proposed, each with unique architectures and techniques. These models represent substantial progress in MTS anomaly detection and offer enticing possibilities for future research endeavors. However, a notable challenge in deep learning methodologies is the slow training process and the considerable computational complexity, potentially hindering their efficacy, particularly when dealing with longer sequences. We summarize the features of the state-of-the-art methods in Table II, highlighting the capabilities of our proposed method.

Table II

Summary of literature review multivariate time series.


METHOD	APPROACH	MAIN ARCHITECTURE	SUPERVISED/UNSUPERVISED	ABLE TO HANDLE LIMITED DATA	INTERPRETABILITY

DAGMM ()	Forecasting	AE	Unsupervised	×	×

HitAnomaly ()	Forecasting	Transformer	Supervised	×	×

TCN-AE ()	Reconstruction	AE	Unsupervised	×	×

OmniAnomaly ()	Reconstruction	VAE	Unsupervised	×	×

LSTM-VAE ()	Reconstruction	VAE	Semi	×	×

GTA ()	Reconstruction	GNN	Semi	×	×

MSCRED ()	Reconstruction	AE	Unsupervised	×	✓

MAD-GAN ()	Reconstruction	GAN	Unsupervised	×	×

USAD ()	Reconstruction	AE	Unsupervised	×	×

MTAD-GAT ()	Hybrid	GNN	Supervised	×	✓

CAE-M ()	Hybrid	AE	Unsupervised	×	×

GDN ()	Forecasting	GNN	Unsupervised	×	✓

TranAD ()	Reconstruction	Transformer	Unsupervised	✓	✓

DTAAD ()	Reconstruction	Transformer	Unsupervised	✓	✓

KBJNet	Reconstruction	Transformer	Unsupervised	✓	✓

III. Methodology

In this section, we present a comprehensive methodology for addressing the problem formulation of anomaly detection using a combination of advanced machine learning techniques. Our methodology encompasses various stages, including data preprocessing, the implementation of dilated temporal convolutional networks (TCN), transformers, and a novel kinematic bi-joint TCN and transformer model. We also describe the training, meta-learning techniques, and inference procedures for efficient anomaly detection and diagnosis. Furthermore, we provide a summary of the performance measures employed to assess the efficiency of our approach in detecting anomalies. By integrating these components, our methodology offers a resilient and precise solution for identifying anomalies in real-world applications.

A. Preprocess

We examine a set of data points or observations organized in a time-stamped sequence and numerous variables. Each datapoint in the set T is gathered at a unique timestamp t, forming the datapoints x_t of the set T. Each x_t belongs to the vector space of real numbers with dimension m, for all values of t. In the univariate setting, m = 1. We assume that the joint probability of the entire time series $x$ M16 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{\bf{x}}\] \end{document} can be factorized into a product of conditional probabilities, where each observation at time t is conditionally dependent on the past observations $x 1 (i), x 2 (i), …, x t – 1 (i)$ M17 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ x_1^{(i)},x_2^{(i)}, \ldots,x_{t-1}^{(i)} \] \end{document} in the same time series component i.

Given a multivariate time series input as the sum of values $z i, 1 : t 0 l$ M18 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ z_{i,1:{t_0}}^l \] \end{document} for each time series i and dimension l. Each $z i, 1 : t 0 l$ M19 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ z_{i,1:{t_0}}^l \] \end{document} represents a sequence of values $z i, 1 l, z i, 2 l, …, z i, 1 : t 0 l$ M20 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ z_{i,1}^l,z_{i,2}^l, \ldots, z_{i,1:{t_0}}^l \] \end{document} in the l-th dimension of the time series data, where $z i, 1 : t 0 l$ M21 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ z_{i,1:{t_0}}^l \] \end{document} is a vector in $ℝ m$ M22 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\mathbb{R}}^m} \] \end{document} . Each data point $x t (i)$ M23 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ x_t^{(i)} \] \end{document} is a vector in $ℝ m$ M24 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\mathbb{R}}^m} \] \end{document} . To increase training stability and strengthen the resilience of KBJNet, we take steps to standardize datasets obtained from different sources.

In the data preprocessing stage, we filter out nonessential information from the datasets to concentrate only on the crucial data for anomaly detection. We exclude irrelevant details such as the source and description of the dataset and other unnecessary information. Instead, we emphasize essential elements like the dataset size, anomaly labels, and the time steps. Additionally, we standardize the data formats and specifications to ensure consistency throughout the dataset.

The data is normalized and transformed into time-series windows for training and testing. The normalization of the time-series data is conducted by applying the following equation:

(1)

x t ← x t – min (T) max (T) – min (T) + ϵ ′,

M1 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {x_t} \leftarrow \frac{{{x_t}-\min ({\cal T})}}{{\max ({\cal T})-\min ({\cal T})+ {\epsilon}^{\prime}}}, \] \end{document}

B. Sliding window

To represent the relationship of a value x_t in a specific timestamp t, we investigate a relevant window of a certain length K as

(2)

W t = {x t – K + 1, …, x t}

M2 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {W_t}=\{ {x_{t-K+1}}, \ldots, {x_t}\} \] \end{document}

For timestamps less than K, to incorporate replication padding, we extend the window W_t by adding a constant vector of length K-t. The input time series $T$ M25 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal T} \] \end{document} is then converted into a sequence of sliding windows $W = {W 1, …, W T}$ M26 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal W}=\{ {W_1}, \ldots,\;{W_T}\} \] \end{document} . The use of sliding windows with replication padding helps preserve the data points’ local context, as shown in Figure 2.

Figure 1

Kinematic bi-joint network architecture for anomaly detection.

Figure 2

An illustration or depiction of data that involves multiple variables and occurs over a period of time.

W_t and O_t, the anomaly score s_t is computed.

The input window is labeled anomalous if its anomaly score is greater than the threshold value, which is calculated using the anomaly scores of the previous input windows.

C. Dilated TCN

We have developed a novel architecture to enhance feature-sharing efficiency while retaining the network’s ability to learn new features. Our approach involves implementing a bi-joint TCN design in which all blocks share a common dilated TCN. This approach significantly reduces redundancy in the feature extraction process while enabling the network to learn new features through its densely connected path.

The dilated convolution operation, concluded in Figure 3, is used in convolutional neural networks, known as a jump filter, that expands the receptive field exponentially in each layer. For a 1-D sequence input $x ∈ ℝ n$ M27 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm{x}} \in {{\mathbb{R}}^n} \] \end{document} and a convolutional filter $f = {0, …, k – 1} ∈ ℝ$ M28 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[f=\left\{ {0,\; {\ldots },\;k-1} \right\} \in\] \end{document} , the operation F on an element s of the sequence is defined as

(3)

ℱ (s) =(x * d f)(s)= ∑ i = 0 k – 1 f (i) ⋅ x s – d ⋅ i

M3 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal F}(s){\rm{ = (x}}{{\rm{*}}_d}f{\rm{)(}}s{\rm{) = }}\mathop \sum \limits_{i{\rm{ = }}0}^{k-1} f(i) \cdot {{\rm{x}}_{s-d \cdot i}} \] \end{document}

where d denotes the dilation factor, k is the convolutional filter size, and $s – d ⋅ i$ M29 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ s-d \cdot i \] \end{document} indicates the index to the past according to d. In general, the receptive field r of a 1D convolutional network with n layers and a kernel size of k is given by $r = 1 + n ⋅ (k – 1)$ M30 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ r=1+n \cdot (k-1) \] \end{document} . To completely cover the input length, we set the number of layers n such that $n = ⌈ (l – 1) / (k – 1) ⌉$ M31 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ n=\left\lceil {(l-1)/(k-1)} \right\rceil \] \end{document} , where $⌈ ⋅ ⌉$ M32 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \left\lceil \cdot \right\rceil \] \end{document} is rounded up. However, this causes the network to become too deep, resulting in a model with many parameters. We obtain a minimum number of layers required by the global TCN ().

Figure 3

The convolution has specific dilation factors of 1, 2, and 4 and a kernel size of 3. The input is represented as x, and the output is represented as y.

Our proposed approach involves feeding the decoder output back into the same TCN for additional processing, which helps the model improve the input data representation over time. This process potentially captures more complex patterns. The feedback loop between the decoder facilitates the model’s learning and adjustment to the input data.

D. Transformer

The Transformer model, widely used in natural language processing and machine vision, is based on attention. Attention scoring computes the dot product of d_k-dimensional queries and keys and the d_v-dimensional value, then applies a softmax activation function to the result to obtain weights multiplied by the value. This scoring function is efficient and compact. In the transformer, inputs undergo a transformation process, creating query, key, and value matrices Q, K, and V. To simplify the subsequent neural network model inference operations, the matrix V is compressed into a smaller representative embedding space using the softmax distribution to generate convex combination weights. The square root of the $d k$ M33 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \sqrt {{d_k}} \] \end{document} is used to stabilize the model’s gradient, reduce weight fluctuations, and promote more stable training.

(4)

Attention (Q, K, V) =softmax (Q K T d k) V,

M4 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm{Attention}}(Q,\;K,\;V)={\rm{softmax}}\;\left({\frac{{Q{K^T}}}{{\sqrt {{d_k}} }}} \right)V, \] \end{document}

where Q, K, and V are matrices in $ℝ n × d model$ M34 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\mathbb{R}}^{n \times {d_{{\rm{model}}}}}} \] \end{document} , and d_model is a learned dimension. Multi-headed attention enables the model to focus on diverse information simultaneously, and the result is concatenated and transformed using a linear projection to obtain d_model-dimensional features. The model consists of two encoders and one decoder, with position encoding added to the output of the model’s first half to obtain the encoders’ input.

Position encoding is performed using sine and cosine functions where pos is the token’s position in the sequence, i is the index of the dimension in the encoding, and d_model is the dimension of the model. The FFN layers apply two linear layers with leaky ReLU activation functions to the input data. The first FFN’s output was then routed through the second linear layer to generate the FFN’s final output. In the decoder, the last FFN is then passed through by a sigmoid activation function.

E. Kinematic bi-joint TCN and transformer

The kinematic bi-joint TCN and transformer, as concluded in Figure 1 model processes input from a dilated TCN with dimensions (B, L, C), where B is the batch size, and L is the sequence length, and C is the number of features. The input is normalized using LayerNorm, which calculates the mean (μ) and variance (σ²) along the feature dimension as follows:

(5)

μ = 1 L ∑ l = 1 L X b l c

M5 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \mu =\frac{1}{L}\sum\limits_{l=1}^L {{X_{blc}}} \] \end{document}

(6)

σ 2 = 1 L ∑ l = 1 L (X b l c – μ) 2

M6 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\sigma ^2}=\frac{1}{L}\sum\limits_{l=1}^L {{{({X_{blc}}-\mu)}^2}} \] \end{document}

The normalized input $X^b l c$ M35 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\hat X_{blc}} \] \end{document} at position (b, l, c) is obtained by subtracting μ from X_blc and dividing by the square root of $σ 2 + ϵ$ M36 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\sigma ^2}+ {\epsilon} \] \end{document} , where ϵ is a small constant added for increasing numerical stability:

(7)

X^b l c = X b l c – μ σ 2 + ϵ

M7 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\hat X_{blc}}=\frac{{{X_{blc}}-\mu }}{{\sqrt {{\sigma ^2}+ {\epsilon}} }} \] \end{document}

The normalized tensor is then adjusted by scaling and shifting using $γ c$ M38 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\gamma _c} \] \end{document} and $β c$ M39 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\beta _c} \] \end{document} learnable parameters to get the output Y_blc of the LayerNorm operation at position (b, l, c):

(8)

Y b l c = γ c X^b l c + β c

M8 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {Y_{blc}}={\gamma _c}{\hat X_{blc}}+{\beta _c} \] \end{document}

Both $γ c$ M40 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{\gamma _c}\] \end{document} and $β c$ M41 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{\beta _c}\] \end{document} are learnable parameters updated during training. The sliding window output $T →$ M42 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[\vec T\] \end{document} is then transferred to a stack of B bi-joint TCN transformer blocks.

Each bi-joint block part of our model comprises one transformer encoder and one decoder. We then combine the output of the first part of the model with position encoding to obtain the input I_i, which is then passed through two separate encoders:

(9)

I i 1 = L a y e r - N o r m (I i + M u l t i H e a d (I i, I i, I i))

M9 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ I_i^1=Layer{\rm{ - }}Norm({I_i}+MultiHead({I_i},\;{I_i},\;{I_i})) \] \end{document}

(10)

I i 1 = L a y e r - N o r m (I i + M u l t i H e a d (T → b, T → b, T → b))

M10 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ I_i^1=Layer{\rm{ - }}Norm({I_i}+MultiHead({\vec T_b},{\vec T_b},{\vec T_b})) \] \end{document}

where $i ∈ {1, 2}$ M43 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ i \in \left\{ {1,\;2} \right\} \] \end{document} for the first and second encoder. The encoder’s output is then connected to the feedforward layer using residual connections and sent separately to the two decoders to obtain the final predicted outputs:

(11)

I i 3 = I i 2 + F F N 1 (LeakyReLU(F F N 2 (I i 2)))

M11 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ I_i^3=I_i^2+FF{N_1}({\rm{LeakyReLU(}}FF{N_2}(I_i^2){\rm{)}}) \] \end{document}

(12)

O i = S i g m o i d (F F N (I i 3))

M12 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal O}_i}=Sigmoid(FFN(I_i^3)) \] \end{document}

The sigmoid activation function is used to constrain the output range of $O i$ M44 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal O}_i} \] \end{document} to be between 0 and 1, which is suitable for the later error reconstruction with the normalized sliding window input.

F. Procedure for training

We use mean squared error (MSE) as the loss criterion to measure the error between the output prediction of each decoder and the original input window x_t. We calculate the losses of the two decoders as L₁ and L₂, respectively, using the following equations:

(13)

L 1 = 1 n ∑ i = 1 n (O 1 – x i) 2, L 2 = 1 n ∑ i = 1 n (O 2 – x i) 2

M13 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {L_1}=\frac{1}{n}\sum\limits_{i=1}^n {{{({O_1}-{x_i})}^2}},\;\;\;{L_2}=\frac{1}{n}\sum\limits_{i=1}^n {{{({O_2}-{x_i})}^2}} \] \end{document}

To obtain the total loss $ℒ$ M45 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{\cal L}\] \end{document} , we combine the losses of the two decoders from the first TCN and the second TCN by taking a weighted sum with a hyperparameter λ. The goal is to minimize the total loss of the hyperparameters W and model parameters Θ:

(14)

{Θ *, W *} = arg min Θ, W ∑ x ∈ X ℒ (ψ (ϕ (x; Θ); W))

M14 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \{ {\Theta ^*},{W^*}\} =\arg \mathop {\min }\limits_{\Theta, W} \sum\limits_{x \in {\cal X}} {\cal L} (\psi (\phi (x;\;\Theta);\;W)) \] \end{document}

where ϕ represents the overall network with total model parameters Θ, W denotes the collection of hyperparameters, and ψ represents the overall learning mapping for anomaly detection task.

G. Meta learning

To improve the training of our KBJNet model with limited data, which exists in Algorithm 1 line 12, In every training epoch, we update the weights of neural networks θ with a gradient descent step using the loss function L and the learning rate α.

Algorithm 1

The KBJNet Training Algorithm.

This gives us the updated weights θ¢. Model-agnostic and meta-learning (MAML) () is performed at the end of each epoch using the updated weights to update the model parameters θ with a meta step-size β. As a result, the model can be trained quickly with limited data. The algorithm can be written as:

(15)

θ ′ ← θ – α ∇ θ L (f (θ)), θ ← θ – β ∇ θ L (f (θ ′))

M15 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \theta^{\prime} \leftarrow \theta -\alpha {\nabla _\theta }L(f(\theta)),\quad \theta \leftarrow \theta -\beta {\nabla _\theta }L(f(\theta^{\prime})) \] \end{document}

H. Inference procedure, anomaly detection, and diagnosis

Our approach, as concluded in Algorithm 2, involves performing online inference sequentially on a sliding window of input data, generating anomaly scores for each timestamp in each dimension. The Peak Over Threshold () approach is used to dynamically select thresholds for each dimension by applying the Extreme Value Theory (EVT) to the univariate time series of anomaly scores obtained during offline training. Instead of manually setting thresholds and making assumptions about the distribution, we use the Generalized Pareto Distribution (GPD) () function following EVT to fit the data and determine the appropriate value-at-risk (label) for dynamically setting the threshold, which is consistent with OmniAnomaly (), TranAD (), and DTAAD () (Figure 4).

Algorithm 2

The KBJNet Testing Algorithm.

Figure 4

Visualization of anomaly prediction.

IV. Experiments

We did tests to assess the effectiveness of our model, KBJNet. The dataset used in our experiments, as well as the performance metrics used, are described. We compared KBJNet with the most widely used models and advanced methods currently available as part of our baseline performed tests. We determined the hyperparameter values using the following values:

Optimizer = Adam
Learning rate = 0.009 and 0.5 step size step-scheduler
Window size = 5
Convolutional kernel size TCN = 3
Transformer encoders = 2
Layers of the encoder’s hidden units = 1
Encoders dropout = 0.2

A. Dataset sources

We use nine datasets in our experiments (eight public data sets). Table III shows the details of datasets. As an illustration, the SMAP dataset contains 55 distinct entities, each with 25 dimensions.

Table III

Dataset characteristics.


TYPE	DIMENSIONS	TRAIN	VALIDATION	ANOMALIES RATE (%)

MSDS	10 (1)	146430	146430	5.37

SMD	38 (4)	708420	708420	4.16

SWaT	51 (1)	496800	449919	11.98

MSL	55 (3)	58317	73729	10.72

SMAP	25 (55)	135183	427617	13.13

MBA	2 (8)	100000	100000	0.14

UCR	1 (4)	1600	5900	1.88

NAB	1 (6)	4033	4033	0.92

WADI	123 (1)	1048571	172801	5.99

Numenta Anomaly Benchmark (NAB) is an actual data stream containing marked exceptions from various sources, ranging from social media to temperature sensors to server network utilization (). We removed incorrectly tagged sequences of anomalies from this dataset for our performed tests.
HexagonML (UCR) is a multivariate time series dataset used in the KDD 2021 cup (). We only used the portion of the dataset obtained from the real world.
MIT-BIH Supraventricular Arrhythmia Database (MBA) contains standard test materials for arrhythmia detectors (). This dataset has been used in around 500 studies of cardiac dynamics.
Soil Moisture Active Passive (SMAP) is a 25-dimensional dataset collected by NASA that contains telemetry information anomaly data extracted from Anomalous Event Anomaly (ISA) reports from spacecraft monitoring systems ().
Mars Science Laboratory (MSL) is a SMAP-like dataset that includes actuator and sensor data from the Mars rover itself. We used only three non-trivial sequences (A4, C2, and T1) dataset in Hundman et al. ().
Secure Water Treatment (SWaT) consists of data obtained from 51 sensors in a continuously operating water treatment system (). The data includes water level, flow rate, and other sensor readings.
Server Machine Dataset (SMD) was gathered over five weeks from a major internet company (). SMD was split into two sets of the same size, one used for training and the other for testing. Only the four non-trivial sequences from this dataset were utilized.
Multi-Source Distributed System (MSDS) consists of application logs, metrics, and distributed traces from a multi-source distributed system ().
Water Distribution (WADI) refers to an expansion of the SWaT system, which includes over two times the sensors and actuators compared to the original SWaT model. Additionally, the dataset was obtained over a longer period of time, covering 14 days for normal scenarios and two days for attack scenarios system ().

B. Result and analysis

We comprehensively compared our newly proposed algorithm, KBJNet, and several state-of-the-art algorithms in the field, such as MSCRED, MAD-GAN, USAD, MTAD-GAT, CAE-M, GDN, and DTAAD. To evaluate the performance of these algorithms, we employed a set of relevant metrics, including Precision (P), Recall (R), Area Under Curve (AUC), and F1 scores. We partition the data into 80% and 20% subsets for training purposes, respectively. This division allows us to examine how the models perform when provided with limited training examples and when trained on a larger volume of data. By assessing the model’s behavior in these contrasting scenarios, we can gain valuable insights into its scalability and generalization capabilities and identify potential challenges that may arise in real-world applications with varying data availability. This evaluation provides a comprehensive understanding of how our models perform with substantial data and a limited dataset, allowing us to make informed decisions regarding their suitability for different operational environments.

1) Performance with 20% of the training dataset: Recently developed models, including unsupervised anomaly detection (USAD), multivariate time-series anomaly detection via graph attention networks (MTAD-GAT), and graph deviation networks (GDN), utilize attention mechanisms to concentrate on particular features of the data and capture long-term trends by adjusting neural network weights. However, KBJNet, which utilizes self-attention, outperforms USAD, MTAD-GAT, and GDN across all datasets as shown Table V. USAD and MTAD-GAT have constraints when classifying anomalies that occur over an extended period because they only consider a local contextual window. To surpass this restriction, KBJNet utilizes self-conditioning on embedding the entire trace along with position encoding, which enhances temporal attention, except for DTAAD on the MBA dataset. The utilization of a meta-learning strategy with MAML enables KBJNet to swiftly acquire anomaly features within sequential data, even with a limited dataset volume (Figure 5). By employing only 20% of the available data, the performance of TranAD and DTAAD closely approaches that of KBJNet, primarily due to their utilization of a generative adversarial training approach for training the encoder-decoder structure. In general, KBJNet demonstrates better performance compared to all other methods.

Table V

Comparison of KBJNet model with baseline methods with 20% of anomalies dataset.


METHOD	NAB		UCR		MBA		SMAP		MSL		SWAT		SMD		MSDS		WADI

	AUC*	F1*	AUC*	F1*	AUC*	F1*	AUC*	F1*	AUC*	F1*	AUC*	F1*	AUC*	F1*	AUC*	F1*	AUC*	F1*

MSCRED	0.8298	0.7012	0.9636	0.4928	0.9498	0.9107	0.9810	0.8049	0.9796	0.8231	0.8384	0.7921	0.9767	0.8003	0.7715	0.8282	0.6028	0.0412

MAD-GAN	0.8193	0.7108	0.9958	0.8215	0.9549	0.9191	0.9876	0.8467	0.9648	0.8189	0.8455	0.8011	0.8634	0.9317	0.5001	0.7389	0.5382	0.0936

USAD	0.7268	0.6782	0.9968	0.8539	0.9698	0.9426	0.9884	0.8380	0.9650	0.8191	0.8439	0.8088	0.9855	0.9214	0.7614	0.8390	0.7012	0.0734

MTAD-GAT	0.6957	0.7012	0.9975	0.8672	0.9689	0.9426	0.9815	0.8226	0.9783	0.8025	0.8460	0.8080	0.9799	0.6662	0.6123	0.8249	0.6268	0.0521

CAE-M	0.7313	0.7127	0.9927	0.7526	0.9617	0.9003	0.9893	0.8313	0.9837	0.7304	0.8459	0.7842	0.9570	0.9319	0.6002	0.8390	0.6110	0.0782

GDN	0.8300	0.7014	0.9938	0.8030	0.9672	0.9317	0.9888	0.8412	0.9415	0.8960	0.8391	0.8073	0.9812	0.7108	0.6820	0.8390	0.6122	0.0413

TranAD	0.9216	0.8420	0.9983	0.9211	0.9946	0.9897	0.9884	0.8936	0.9856	0.9171	0.8461	0.8093	0.9847	0.8794	0.8112	0.8389	0.6852	0.0698

DTAAD	0.9330	0.9057	0.9984	0.9220	0.9955	0.9912	0.9894	0.8996	0.9864	0.9212	0.8460	0.8087	0.9866	0.8941	0.8115	0.8390	0.7818	0.0977

KBJNet	0.9999	0.9231	0.9999	0.9328	0.9932	0.9869	0.9894	0.9007	0.9907	0.9451	0.8460	0.8087	0.9986	0.9983	0.9829	0.9107	0.8453	0.1511

Figure 5

Results in UCR.

2) Performance with 80% of the training dataset:Table IV provided illustrates a comparison between the KBJNet approach and other baseline methods in terms of performance metrics related to anomaly detection.

Table IV

Comparison of KBJNet model with baseline methods with 80% of the training dataset.


METHOD	NAB				UCR				MBA				SMAP				SWaT

	P	R	AUC	F1	P	R	AUC	F1	P	R	AUC	F1	P	R	AUC	F1	P	R	AUC	F1

MSCRED	0.8521	0.6700	0.8400	0.7501	0.5440	0.9717	0.9919	0.6975	0.9271	1.0000	0.9798	0.9622	0.8174	0.9215	0.9820	0.8663	0.9991	0.6769	0.8432	0.8071

MAD-GAN	0.8665	0.7011	0.8477	0.7751	0.8537	0.9890	0.9983	0.9164	0.9395	1.0000	0.9835	0.9688	0.8156	0.9215	0.9890	0.8653	0.9592	0.6956	0.8462	0.8064

USAD	0.8421	0.6667	0.8332	0.7443	0.8953	1.0000	0.9990	0.8953	0.8954	0.9990	0.9702	0.9444	0.7481	0.9628	0.9890	0.8419	0.9977	0.6879	0.8460	0.8143

MTAD-GAT	0.8422	0.7273	0.8222	0.7803	0.7813	0.9973	0.9979	0.8762	0.9019	1.0000	0.9720	0.9483	0.7992	0.9992	0.9846	0.8882	0.9719	0.6958	0.8465	0.8110

CAE-M	0.7919	0.8020	0.8020	0.7969	0.6982	1.0000	0.9958	0.8223	0.8443	0.9998	0.9662	0.9155	0.8194	0.9568	0.9902	0.8828	0.9698	0.6958	0.8465	0.8102

GDN	0.8130	0.7873	0.8543	0.7999	0.6895	0.9989	0.9960	0.8159	0.8833	0.9893	0.9529	0.9333	0.7481	0.9892	0.9865	0.8519	0.9698	0.6958	0.8463	0.8102

TranAD	0.8889	0.9892	0.9541	0.9364	0.9407	1.0000	0.9994	0.9694	0.9576	1.0000	0.9886	0.9783	0.8104	0.9998	0.9887	0.8953	0.9977	0.6879	0.8438	0.8143

DTAAD	0.8889	0.9999	0.9996	0.9412	0.8880	1.0000	0.9988	0.9407	0.9608	1.0000	0.9896	0.9800	0.8220	0.9999	0.9911	0.9023	0.9697	0.6957	0.8462	0.8101

KBJNet	0.8889	0.9999	0.9996	0.9412	0.9999	1.0000	0.9999	0.9999	0.9805	1.0000	0.9898	0.9805	0.8302	0.9999	0.9901	0.9072	0.9718	0.6957	0.8463	0.8109

METHOD	SMD				MSL				MSDS				WADI

	P	R	AUC	F1	P	R	AUC	F1	P	R	AUC	F1	P	R	AUC	F1

MSCRED	0.7275	0.9973	0.9920	0.8413	0.8911	0.9861	0.9806	0.9362	0.9998	0.7982	0.8942	0.8878	0.2512	0.7318	0.8411	0.3740

MAD-GAN	0.9990	0.8439	0.9932	0.9149	0.8515	0.9929	0.9861	0.9168	0.9981	0.6106	0.8053	0.7578	0.2232	0.9123	0.8025	0.3587

USAD	0.9061	0.9975	0.9934	0.9496	0.7949	0.9912	0.9795	0.8822	0.9913	0.7960	0.8980	0.8829	0.1874	0.8297	0.8724	0.3057

MTAD-GAT	0.8211	0.9216	0.9922	0.8684	0.7918	0.9825	0.9890	0.8769	0.9920	0.7965	0.8983	0.8835	0.2819	0.8013	0.8822	0.4170

CAE-M	0.9081	0.9670	0.9782	0.9368	0.7752	1.0000	0.9904	0.8734	0.9909	0.8440	0.9014	0.9115	0.2783	0.7917	0.8727	0.4118

GDN	0.7171	0.9975	0.9925	0.8343	0.9309	0.9893	0.9815	0.9592	0.9990	0.8027	0.9106	0.8900	0.2913	0.7932	0.8778	0.4261

TranAD	0.9051	0.9973	0.9933	0.9490	0.9037	0.9999	0.9915	0.9493	0.9998	0.8625	0.9012	0.8904	0.3959	0.8295	0.8998	0.5360

DTAAD	0.8463	0.9974	0.9892	0.9147	0.9038	0.9999	0.9918	0.9495	0.9999	0.8026	0.9013	0.8905	0.9017	0.3910	0.6950	0.5455

KBJNet	0.9985	0.9974	0.9987	0.9985	0.9038	0.9999	0.9916	0.9496	0.9592	0.9554	0.9248	0.9573	0.8465	0.8296	0.9130	0.8379

The POT method is used in models such as TranAD, DTAAD, and KBJNet to determine more precise threshold values by considering localized peak values in data sequences. Models like MSCRED use sequential observations as input and retain temporal information, but they may not detect anomalies close to normal trends. KBJNet addresses this issue by amplifying errors using a bi-joint network, enabling it to detect even mild anomalies in datasets such as SMD, where abnormal data is relatively close to regular data, shown in Figure 10.

Figure 10

Ground truth and predicted for the SMD using the KBJNet.

MSCRED is effective in storing time information due to its continuous observation and good performance on partial datasets, but it struggles to identify anomalies close to normal and operates at a lower speed. The KBJNet architecture can effectively capture information from various dimensions simultaneously. At the same time, KBJNet can efficiently track input and capture long-range dependencies due to Position Encoding and residual connections. As seen in Figure 8, TranAD, DTAAD, and KBJNet demonstrate advantages over other models because they utilize meta-learning to accelerate model training. Among other models, MSCRED and GRU from the MTAD-GAT model make their operation speed quite inefficient as they are not executed in parallel. On large-volume datasets, their training time is slower than KBJNet. Apart from KBJNet, USAD considers time performance optimization with limited effect. Therefore, USAD and MAD-GAN adopt generative adversarial training, making USAD less computationally intensive than MAD-GAN. Figure 6 and Figure 7 illustrate the training time and inference time in all datasets.

Figure 6

Training time in all datasets.

Figure 7

Inference time in all datasets.

Figure 8

Sensitivity to window size.

3) Sensitivity to the number of training epoch: The correlation between the performance of the anomaly detection model and the number of training epochs is illustrated in Table VI. It reveals that the model’s recall rate remains consistently high at 0.9974 across all training epochs. This indicates that the model can accurately identify the significance of the true positive cases and has a low rate of false negatives, which is important for effectively detecting anomalies in non-normal datasets. The AUC score, which evaluates the model’s performance, increases from 0.9200 in the first epoch to 0.9985 in the tenth. This indicates that the model’s ability to accurately differentiate between anomalies and normal data points improves with increased training epochs. The F1-Score shows an increasing trend from 0.9393 in the second epoch to 0.9972 in the tenth. This suggests that the model achieves a better balance between precision and recall as the number of training epochs increases, which is important for an effective anomaly detection model.

Table VI

The connection epochs and the performance on SMD datasets.


EPOCH	PRECISION	RECALL	AUC	F1-SCORE

1	0.9567	0.8440	0.9200	0.8968

2	0.8876	0.9974	0.9922	0.9393

3	0.8831	0.9974	0.9919	0.9368

4	0.8996	0.9974	0.9929	0.9460

5	0.9662	0.9974	0.9969	0.9815

6	0.9985	0.9974	0.9986	0.9979

7	0.9996	0.9974	0.9987	0.9985

8	0.9992	0.9974	0.9986	0.9983

9	0.9985	0.9974	0.9986	0.9979

10	0.9970	0.9974	0.9985	0.9972

4) Sensitivity to window size: In this study, we present our findings derived from three multivariate datasets: SMD, MSDS, and WADI. This choice is based on the consistently better performance demonstrated by KBJNet across diverse datasets. Increasing the window size can affect the time dependency values in the data. A larger window size will result in increased dependency on other data points. This enhancement also impacts the speed of anomaly detection. Figure 8 illustrates the detection results for four window sizes across three datasets. Better performance is observed with window sizes of 5 and 20 for SMD and 20 for WADI. The results suggest that smaller windows are more suitable for datasets with weak dependencies. In the case of the SMD dataset, a decrease in performance is evident when the window size reduces the model’s generalization ability. Moreover, larger windows increase memory and computational requirements, thus slowing down the training process.

5) Sensitivity to MAML: The utilization of MAML enables KBJNet to swiftly discern unusual patterns in sequential data, even when dealing with a limited dataset (Table VII). The response of KBJNet to different datasets with varying K values in a sensitivity analysis is contingent upon the specific dataset under consideration. The effectiveness of MAML varies based on the degree of similarity between the meta-tasks and the target task. The findings suggest that selecting smaller K values in MAML is more suitable. In the case of the MSL dataset, we observe a deterioration in performance as K increases in SMD, impacting both computational efficiency and overall performance. Furthermore, larger K values impose greater computational demands and result in a slowdown of the training process.

Table VII

Sensitivity KBJNet to MAML 20% datasets according to meta step-size.


METHODS	5	10	15	20

NAB	0.9231	0.9057	0.9057	0.9231

UCR	0.9328	0.9328	0.9328	0.9328

MBA	0.9869	0.9871	0.9867	0.9871

SMAP	0.9007	0.8926	0.8926	0.9338

MSL	0.9451	0.8998	0.8998	0.8998

SWaT	0.8087	0.8087	0.8094	0.8087

SMD	0.9983	0.9970	0.9820	0.9983

MSDS	0.9107	0.9107	0.9107	0.9107

WADI	0.1511	0.1104	0.1208	0.1071

6) Sensitivity to kernel size: In these findings, we maintained the global TCN layer and adjusted the filter size by altering the receptive field. Once again, we experimented using SMD, MSDS, and WADI datasets. The results are presented in Figure 9. Optimal performance was achieved for the SMD and MSDS datasets, with a slight decrease observed for WADI. Therefore, kernel size becomes a consideration. However, due to the consistent expansion factor, kernel size changes do not significantly impact the final results.

Figure 9

Sensitivity to kernel size.

7) Ablation analysis:Table VIII summarizes the F1 scores and AUC values for KBJNet and its ablated versions, each with 80% of the training dataset. First, our proposed KBJNet model has proven effective as it achieves the highest performance regarding both AUC and F1 scores on most datasets.

Table VIII

F1 scores and AUC for KBJNet with 80% of the training datasets.


Component	NAB		UCR		MBA

	AUC	F1	AUC	F1	AUC	F1

KBJNet	0.9996	0.9412	0.9999	0.9999	0.9898	0.9805

(-)Bi-Joint TCN	0.9996	0.9411	0.9986	0.9327	0.9898	0.9787

(-)MAML	0.9996	0.9412	0.9990	0.9527	0.9889	0.9787

(-)Transformer	0.9325	0.9050	0.9980	0.9188	0.9926	0.9858

COMPONENT	SMAP		MSL		SWAT

	AUC	F1	AUC	F1	AUC	F1

KBJNet	0.9901	0.9072	0.9916	0.9496	0.8463	0.8109

(-)Bi-Joint TCN	0.9903	0.9083	0.9565	0.7848	0.8462	0.8101

(-)MAML	0.9890	0.8974	0.9573	0.7878	0.8462	0.8101

(-)Transformer	0.9853	0.8682	0.9700	0.8412	0.8459	0.8086

COMPONENT	SMD		MSDS

	AUC	F1	AUC	F1

KBJNet	0.9987	0.9985	0.9248	0.9573

(-)Bi-Joint TCN	0.9911	0.8732	0.9809	0.8991

(-)MAML	0.9923	0.8790	0.9784	0.8872

(-)Transformer	0.9852	0.8582	0.9789	0.8937

We conducted ablation experiments on the KBJNet model to evaluate the impact of each component by removing the bi-joint TCN, MAML, and transformer from the KBJNet model. From Table VIII, by observing the results, it is evident that eliminating the bi-joint TCN module slightly reduces the F1 scores for most datasets. However, its effect on the AUC scores of the UCR, MBA, SMAP, and MSL datasets is more pronounced. This indicates that the bi-joint TCN module contributes significantly to capturing temporal dependencies and enhancing the overall effectiveness of the KBJNet model.

Next, we observe that removing the MAML module has a greater impact on the F1 scores than on the AUC values of most datasets, indicating that the MAML module contributes to improving the model’s ability to adapt to new tasks and data distributions. Finally, removing the transformer module exerts the greatest influence on the AUC values of the NAB and MSL datasets. This suggests that the transformer module is essential for capturing global contextual information and enhancing the model’s discriminative power. Figure 6 reveals that KBJNet requires significantly less time than the baseline methods. These findings indicate the lightweight nature of our model and highlight the benefits of incorporating positional encoding.

In summary The Table VIII, our ablation study confirms that the KBJNet model’s component contributes to performance as a whole in anomaly detection, with the bi-joint TCN module playing the most critical role in capturing temporal dependencies, followed by the MAML module for better adaptation to new tasks and the transformer module for capturing global contextual information.

V. Conclusion

This research developed the KBJNet, a novel anomaly detection model based on bi-joint TCN, which accurately identifies anomalies within multivariate time series data. Leveraging the power of the transformer architecture, our model adeptly handles lengthy data sequences.

Through rigorous experimentation across nine benchmark datasets, KBJNet outperforms established state-of-the-art methods, yielding substantial enhancements in F1 and F1* scores, ranging from 2% to 9%, for complete and compact datasets, respectively. We noticed that our algorithm did not surpass all aspects of the other algorithms. However, it is worth highlighting that KBJNet exhibited superior performance to most algorithms under consideration. Furthermore, KBJNet is versatile and can adapt for deployment across diverse devices, making it particularly well-suited for contemporary industrial and embedded systems demanding accurate and efficient anomaly detection.

To ensure a more comprehensive assessment of its efficacy, further experimentation with datasets from diverse fields will be beneficial. This broader testing approach will enable us to determine the model’s applicability and performance in various contexts beyond the industrial domain. Optimizing our model’s efficiency remains open to further research, potentially enhancing processing speed and resource utilization.

Data Science Journal

Research Papers