1 Introduction

To build trust with customers who prioritize security, organizations must prioritize transparency in three key areas: gaining permission to store private data, following privacy regulations, and organizing the collected data. By being transparent in these aspects, organizations can demonstrate their commitment to protecting customer data and complying with legal requirements. Failure to adhere to these regulations can result in significant fines and reputational damage (). In addition to legal compliance, organizations need to address the potential risks posed by external threats like malware or hackers. These attacks can lead to financial losses and a loss of customer confidence. To mitigate such risks and maintain customer trust, organizations must adopt transparent practices in data collection, data handling, and data security measures (). When it comes to data privacy in the context of machine learning, the Differential Privacy technique plays a crucial role. It ensures that individual identities within a dataset remain anonymous, preventing viewers from associating specific individuals with the results. By introducing random noise through a distribution, the technique protects individuals’ privacy by obscuring their genuine answers (; ). To ensure accurate and reliable data analysis, organizations can employ data cleaning techniques and leverage various machine learning frameworks that offer methods and APIs for data imputation. Missing values can be assigned using statistical measures such as medians, means, standard deviations, or utilizing techniques like k-nearest neighbors (k-NNs) (; ). Machine learning algorithms, such as Support Vector Machines (SVM), clustering, and neural networks, are employed to analyze and find patterns in large datasets. Clustering allows the grouping of similar data pieces, enabling the discovery and examination of patterns across multiple datasets (; ). Neural networks, inspired by cognitive processes, excel at identifying complex patterns (; ). Convolution is a technique used in neural networks where each layer focuses on specific features, gradually capturing higher-level properties (). To perform secure computations on encrypted data, organizations can utilize homomorphic encryption. This technique enables calculations to be carried out on encrypted data, producing results that resemble what would have been obtained from plaintext data (; ).

Federated learning addresses the challenge of handling heterogeneous data. Instead of locally storing or transmitting raw data, individual client’s data remains private. Analysts aggregate client data instead of accessing specific communications, ensuring privacy while enabling rapid analysis ().

While differential privacy protects individual privacy, it introduces a tradeoff with accuracy, such as differential privacy techniques introduce noise or perturbations to protect individual privacy, but this can impact the accuracy of the analysis and make it challenging to draw conclusions from individual samples. Alterations made during the randomization process slightly impact the outcome distribution, making it difficult to draw conclusions from individual samples. Gradient-based learning systems can achieve differential privacy by introducing random perturbations to intermediate outputs, such as using Gaussian noise (). In federated learning, FedAvg is a distributed averaging approach that ensures communication efficiency. It involves training multiple clients and aggregating their models to achieve the desired outcome (). Federated learning distinguishes between local and global privacy. Global privacy ensures modifications made to the model at each round remain secret from all external parties, preserving worldwide anonymity. Local privacy safeguards changes from being visible to the server as well. Minimizing data on the server helps reduce memory and computation requirements during training iterations, which is known as data normalization ().

Blockchain technology provides a solution to reduce privacy erosion while enabling controlled data sharing. Users can selectively disclose parts of their personal data on a blockchain to access specific services. The transparency and decentralization of blockchain, exemplified by cryptocurrencies like Bitcoin, have demonstrated their reliability in managing information (; ).

Consensus algorithms play a crucial role in ensuring agreement among participants in a blockchain network. These algorithms enhance network stability and foster decentralized trust among anonymous peers (). Proof of work is a cryptographic mechanism that demonstrates a participant’s computational effort, while proof of stake selects stakeholders based on their holdings of the cryptocurrency involved ().

The main objective of this work is to enhance privacy-preserving techniques in intrusion detection systems deployed in blockchain-based networks. Leveraging federated deep learning, the proposed model ensures efficient and accurate intrusion detection while maintaining data privacy. By utilizing federated learning techniques, individual client data remains confidential, allowing collaborative model training and analysis.

2 Literature Survey

Qiang, Liu & Jin () indicate that convolutional neural networks (CNNs) and binary neural networks (BNNs) can be used for collecting data, encrypting it before storing it in the cloud, and training and testing without additional decryption. Their system might be dangerous since they store all of their data in one place before encrypting it. Iterative search technique was used by Rui Hu et al. () in order to propose a personalised federated learning strategy that offers strong privacy for user data regardless of user heterogeneity. However, because of the heterogeneity of the devices, this method’s training process is very difficult and complex. Data science and machine learning methods such as homomorphic encryption and dimensionality reduction can be utilized to guarantee the confidentiality of data. Rahman et al. () presented such a system by which dimensionality reduction and homomorphic encryption can be used to guarantee data confidentiality. The suggested method is made to provide users confidence that machine learning would be used to preserve their data privacy and prevent their personal information from being used for commercial gain. Only certain illnesses and medical conditions are treated using this technique. A machine learning strategy was described by S. Shaham et al. () with the aim of releasing the location of data while maintaining anonymity. Its strategy incorporates K-means algorithms as well as clustering, alignment, and generalization methods. By using MLA, users’ privacy is protected while geographical itinerary datasets can be made available, an anonymization framework based on machine learning. Tanwar et al. () methodology for creating intelligent blockchain-based apps has been described. It involves the use of machine learning methods. Secure Hash Algorithm (SHA), the consensus algorithm, is used in this process. A smart city, healthcare system, smart grid, or unmanned aerial vehicle (UAV) could all benefit from this methodology, which combines machine learning and blockchain technology. Due to a high demand for internet bandwidth and an increase in chain, performance appears to be hindered. The privacy of datasets can be modified based on the distribution of data, according to Wang et al. (). Government transmission, storage, and learning training efficiency has improved as well as the security of client data. Sparse differential gradients increase gearbox efficiency, but their accuracy declines by 0.03%. Using heterogeneous data rather than homogeneous data, Decentralised Federated Learning through Mutual Knowledge Transfer is proposed by Shayan et al. After a certain number of cycles, this technique is more accurate than baseline techniques. For this approach to perform better, additional theoretical research has to be done. Based on a variety of datasets, experimental setups, and privacy budgets, Simonyan () found that logistic regression had better performance than differential regression. It has been discovered, however, that differential privacy causes significantly worse performance degradation in federated learning. In a framework called PPSF (), Srivastava et al. () propose IoT-driven smart cities employing blockchain and machine learning. Based on LightGBM, e-PoW, and Principal Component Analysis (PCA), this system can be used to perform the analysis. With PPSF, smart cities powered by IoT can maintain privacy and security using blockchain technology and machine learning. Without the need of a centralized model coordinator, Szegedy et al. () developed a decentralized, trustworthy, and secure technique for federated learning. This improves model update privacy security and successfully thwarts concerns of data poisoning. A slower convergence rate is observed for the proposed model compared with the SAE model. A strategy for implementing design using the present blockchain technology and compensating employees with bitcoin for following the protocols was put out by Toyoda, Zhao & Zhang (). While this solution does not use blockchains in its implementation, it does use an open network. A technique for differential privacy publication of a medical data model was proposed by Z. Sun et al. (). Algorithms including Mini Batch Gradient Descent (MBGD), Differentially Private Mini Batch (DPMB), Gradient Descent (GD), and Back Propagation (BP) are used by their prosed system. When it comes to releasing and training data, it offers adequate privacy guarantees and privacy protection. This approach relies on a small number of datasets, so using several datasets at once reduces accuracy. Zhao et al. () have developed a system that utilizes a stochastic gradient descent algorithm, a method for generating communication policies, and heterogeneous networks based on machine learning approaches to increase communication effectiveness through decentralized networks. NetMax is a decentralized and communication-efficient method for accelerating distributed machine learning over heterogeneous networks. Their system malfunctions when a primary server is present, which raises concerns about data privacy. As in the literature deep learning based approaches are employed to secure environement (; ), CNN is the most common deep learning approach (). The existing literature primarily focuses on individual privacy-preserving techniques or strategies, neglecting the potential synergies that can be achieved by combining multiple approaches. However, it is crucial to explore the benefits and challenges associated with integrating various privacy-preserving methods to provide stronger assurances of data confidentiality and privacy. There is a research gap in understanding how these different techniques can work together cohesively to enhance overall privacy protection. Therefore, there is a need for further investigation to explore the potential advantages and complexities of leveraging a holistic approach that combines multiple privacy-preserving approaches. By addressing the research gap, the aim is to provide more robust and comprehensive solutions for ensuring data confidentiality and privacy in this study.

3 Proposed Model

The system model depicted in Figure 1 demonstrates the process of data sharing using a combination of federated learning and blockchain technology. The data requester initiates the process by publishing a task on the blockchain, indicating the need for data sharing. Relevant data nodes receive this request and respond accordingly. Through consensus mechanisms, the participating nodes reach an agreement, and rewards are allocated based on their contributions. Additionally, data contributors register their data sharing request task on the alliance chain, further promoting the consensus process of the blockchain.

Figure 1 

Proposed Model.

Blockchain (BC) serves as a distributed, open, and decentralized ledger that enables secure storage and transmission of data to cloud servers (). Each block in the blockchain contains transaction information, including timestamps and hash values of previous and succeeding blocks. The cryptographic properties of the blockchain ensure the immutability of data, making it tamper-proof. This decentralized and trustworthy nature of blockchain technology facilitates the distribution of information in a secure and shared manner.

Federated learning, as part of this system model, addresses privacy preservation. It ensures that data remains on local nodes or devices instead of being transferred to a central server. By performing model training locally on each node, federated learning significantly reduces the risk of exposing sensitive information during data transfer. Instead of sharing raw data, only model updates are exchanged, thereby protecting individual data privacy. To further enhance privacy, differential privacy techniques can be applied. These techniques introduce noise or perturbations to the shared model updates, making it challenging to infer or reconstruct individual data.

To confirm secure data transmission during the exchange of model updates in federated learning, encryption and secure communication protocols are used. These measures preserve the transmission from unauthorized access or tampering, minimizing the risks of data leaks and malicious attacks. By leveraging encryption and secure protocols, the confidentiality and integrity of the shared data and model updates are maintained throughout the data sharing process. Furthermore, models that have been trained globally through federated learning can be packaged and recorded on the blockchain. This allows for transparent verification and auditing of the training process. By recording the models on the blockchain, their integrity and authenticity can be ensured, adding an extra layer of trust and accountability to the system.

3.1 LSTM-GRU architecture

In this study, long short-term memory (LSTM) and gated recurrent unit (GRU) approaches are combined to employ predictive analytics. The LSTM-GRU architecture itself does not directly contribute to privacy preservation or secure data transfer. Instead, it is a model architecture commonly used in federated learning for its ability to extract useful patterns. Each of the six hidden levels in the suggested architecture contains 256 hidden units. While the other three are made up of GRU units, three of these hidden layers are made up of LSTM units. Leaky ReLU, a popular non-linear activation function is the activation function employed in each of these hidden layers. The output layer uses one dense layer and one unit with a linear activation function. The output from the dense layer is conveyed to the output layer by lowering the output dimension from the preceding levels. The output values are not constrained to a particular range because of the linear activation function in the output layer, which is useful in some applications.

3.1.1 Gated Recurrent Unit (GRU)

The GRU was designed to address the issue of bursting or disappearing gradients. It is an enhanced version of the LSTM model that also uses gate structures to regulate information flow. It is significant to note that GRU lacks an output gate, making all data accessible to anybody. The input and forget gates are combined in the LSTM, whereas the reset and update gates are the only two gates in GRUs. GRUs perform better because they have fewer parameters and a more straightforward structure. The following equations represent GRU reset and update gates:

(1)
rt=σ(Wr[h{t1},xt]+Urh{t1}+br)
(2)
zt=σ(Wz[h{t1},xt]+Uzh{t1}+bz)
(3)
(ht)^=tanh(Wh[rt*h{t1},xt]+bh)
(4)
ht=(1zt)*h{t1}+zt(ht)^

Where w

rt: reset gate at time step t.

zt: update gate at time step t.

h{t–1}: hidden state at time step t-1.

xt: input at time step t.

wr, wz, wh: weight matrices for the reset, update, and candidate hidden state calculation.

ur, uz, uh: weight matrices applied to the hidden state for the reset, update, and candidate hidden state calculation.

br, bz, bh: biases for the reset, update, and candidate hidden state calculation.

σ: sigmoid activation function.

(ht)^: candidate hidden state at time step t.

W: weight matrix for the interpolation calculation.

b: bias label for the interpolation calculation.

ht: hidden state at time step t.

3.1.2 Long Short-Term Memory (LSTM)

An LSTM, a kind of recurrent neural network, has three gates. The forget, input, and output gates are some of them. The LSTM’s vanishing gradient technique causes the gradient in conventional RNNs to disappear (). The forget gate is a crucial factor in determining whether to preserve or delete previously learned information. It assesses the significance of the data from the cell state of the previous time step and decides whether to keep or delete it. The Oblivion Gate’s mathematical formula is as follows:

(5)
pt=σ(Wp[h{t1}, xt]+bp)

The performance of the input gate can be calculated through the following formula given below:

(6)
qt=σ(Wq[h{t1}, xt]+bq)
(7)
vt=tanh(Wv[h{t1}, xt]+ bv)

The computational equation for the output gate is as:

(8)
ft=σ(Wf[h{t1}, xt]+bf)
(9)
ht=ft*vt+(1ft)*h{t1}

Where

pt: forget gate activation vector at time step t

xt: input at time step t

bp: bias vector for the forget gate calculation

wp: weight matrix for the forget gate calculation

h{t-1}: previous hidden state at time step t-1

qt: input gate activation vector at time step t

vt: vector of new candidate cell state values at time step t

bq, bv: bias vectors for the input gate and candidate cell state calculation

wq,wv: weight matrices for the input gate and candidate cell state calculation

ft: output gate activation vector at time step t

ht: output at time step t

The federated learning and the LSTM-GRU architecture contribute to privacy preservation and secure data transfer. Differential privacy protection () and mutual supervision mechanisms have been implemented to mitigate risks associated with data leaks and malicious attacks during data sharing. Federated learning ensures that data remains on the local nodes, and only the model updates are exchanged, reducing the risk of exposing sensitive information. The LSTM-GRU architecture, with its gated structures and secure data transmission protocols, further enhances the privacy and security of the learning process.

4 Implementation

An experiment was conducted on the NSL-KDD dataset to test the effectiveness of the proposed network model. Experiments were conducted to determine the optimal sizes of the training and test sets based on the randomization of the dataset. A performance assessment was then conducted on the trained model using the test set.

4.1 Dataset

Intrusion detection models are commonly tested using the NSL-KDD 2015 dataset. There are 12,5973 samples in this dataset, which are divided into normal samples and anomalous samples. The dataset contains 41 different characteristics that are used to describe the samples. Of the total samples in the dataset, 67,343 are classified as normal and 58,630 are classified as anomalous.

4.2 Configuration setup

Local training in the client utilized PyTorch () for implementing the deep learning (DL) algorithm. To enable the development of the federated learning (FL) algorithm, PySyft (), a Python extension library compatible with major DL frameworks like PyTorch and TensorFlow (), was employed. PySyft provides the necessary requirements for FL algorithms and facilitates the development of secure and private DL algorithms. The implementation was conducted on the Google Colab platform () with GPU acceleration for efficient processing.

Regarding data preprocessing, the data was initially cleaned and then normalized using StandardScaler. The train set was assigned 80% of the data, while the test set received 20%, following a widely used method. Adaptive FL algorithm optimization was performed using the SGD () optimizer. In the proposed LSTM-GRU model, hyperparameters such as 200 epochs, 0.1 learning rate, and 128 batch sizes were set using a checkpoint to identify the most effective values.

4.3 Evaluation metrics

To evaluate the prediction accuracy, five distinct regression evaluation metrics were utilized: Sensitivity (TPR), Specificity (SPC), Precision (PPV), Accuracy (ACC), F1 Score (F1), and Matthews Correlation Coefficient (MCC).

(10)
TPR=TP/(TP+FN)
(11)
SPC=TN/(FP+TN)
(12)
PPV=TP/(TP+FP)
(13)
ACC=(TP+TN)/(TP+TN+FP+FN)
(14)
F1= 2TP/(2TP+FP+FN)
(15)
MC=(TP*TNFP*FN)sqrt((TP+FP)(TP+FN)(TN+FP)*(TN+FN))

5. Results and Discussions

This work used the NSL-KDD 2015 dataset to validate the proposed model’s performance in blockchain based privacy prediction. Based on 20% ratios of testing sets, there were 11,726 anomalous samples and 13,468 normal samples and the proposed model produced results as presented in the Table 1.

Table 1

Model Assessment.


SENSITIVITY (TPR)SPECIFICITY (SPC)PRECISION (PPV)ACCURACY (ACC)F1 SCORE (F1)MATTHEWS CORRELATION COEFFICIENT (MCC)

0.98720.99270.99160.99010.98940.9802

Table 1. shows the evaluation metrics of a proposed blockchain-based model that uses the NSL-KDD 2015 dataset and LSTM-GRU architecture. Evaluation metrics were computed for the testing data, which comprised 80% training data and 20% testing data. In addition to Sensitivity, Specificity, Precision, Accuracy, and Matthews Correlation Coefficient (MCC), we measured Sensitivity (TPR), Specificity (SPC), Precision (PPV), Accuracy (ACC), and F1 Score (F1). As a ratio of total positive samples to the number of true positives, sensitivity (TPR) measures the true positive rate of a model. It was found that 98.72% of anomalous samples in the testing data were accurately identified by the proposed model, whose sensitivity is 0.9872. Specificity (SPC) measures the true negative rate of the model and is calculated as the ratio of true negatives to the total number of actual negative samples. The specificity of the proposed model is 0.9927, indicating that it accurately identified 99.27% of the normal samples in the testing data. PPV measures the ratio of true positives to predicted positives, which is the positive predictive value of the model. The precision of the proposed model is 0.9916, which indicates that when it predicted an anomalous sample, it was correct 99.16% of the time. As the ratio of correctly classified samples to the total number of samples, accuracy (ACC) measures the overall correctness of the model. As a result of the proposed model’s accuracy, 99.01% of samples were correctly classified. This metric provides a balanced measure by combining precision and recall with the F1 Score (F1). This model scores 0.9894 on the F1 test, which indicates a good balance between precision and recall. The Matthews Correlation Coefficient (MCC) measures how closely classes are related when taken into account the imbalance in class distributions. In this case, the MCC is 0.9802, which indicates a strong correlation between the actual and predicted classes.

Table 2 displays the confusion matrix generated from the classification of the proposed technique utilizing the NSL-KDD intrusion dataset. The confusion matrix indicates that out of the total samples, the model classified 11,628 samples as anomalies and 13,317 samples as normals. The table also provides additional information on the true positive, false positive, true negative, and false negative values of the classification. Specifically, the model correctly classified 11,628 attack samples as attacks (TP), but incorrectly classified 98 attack samples as normals (FN). Additionally, the model correctly classified 13,317 normal samples as normals (TN), but incorrectly classified 151 normal samples as attacks (FP).

Table 2

Result of Confusion matrix.


ACTUALPREDICTED

AttackNormal

Attack1162898

Normal15113317

Apart from demonstrating the classification accuracy results, the performance of the proposed technique was evaluated using the ROC Curve. Figure 2 presents a visual representation of the classification accuracy results through a ROC curve, which effectively depicts the correlation between the amount of training data and the performance of the technique.

Figure 2 

ROC Curve.

Table 3 highlights and compares the performance of the proposed model with other recent models in terms of accuracy. It demonstrates that the Privacy-Preserving Secure Framework using LSTM-GRU achieved a higher accuracy rate of 99.01% compared to the other models.

Table 3

Comparative analysis.


REFERENCEMODELACCURACY (%)

K. Pradeep Mohan Kumar et al ()PPSF-BODL97.46

Alatawi, Mohammed Naif, et al ()PSO-GA followed by ELM-BA96.04

ProposedPrivacy-Preserving Secure Framework using LSTM-GRU99.01

6. Conclusion

In this study, a novel framework is presented that utilizes blockchain technology in collaboration with multiple contributors, incorporating federated learning for secure and privacy-preserving model training without centralized data storage. Through this approach, models can be trained simultaneously by multiple parties while retaining their local privacy. The framework utilizes federated learning to improve the accuracy of the results, boost model performance, and enhance overall model performance. Transactions are stored in a decentralized, distributed digital ledger, and data privacy and security are ensured through various methods. The LSTM-GRU model, included in the framework, facilitates primary data collection using sensing tools. Experimental results demonstrate the superiority of this approach over existing methods, with an accuracy of 99.01%. The research focused on the NSL-KDD dataset, a widely accepted benchmark for evaluating intrusion detection models, due to its suitable size and characteristics for initial experimentation and proof-of-concept studies. However, it is important to acknowledge that latency in network communication can impact the efficiency of the training process in federated learning. Additionally, blockchain technology may face scalability challenges, which need to be addressed. Future work aims to explore additional datasets with larger sample sizes and a wider range of intrusion scenarios, employing advanced deep learning algorithms to further enhance detection results.

Data Accessibility statement

Available upon reasonable request.