A Study on the Application of Data Mining Techniques in the Management of Sustainable Education for Employment

Fang Fang

I. Introduction

Career education, which focuses on career development, job selection and workplace planning, is an important part of the educational work being carried out in universities (). With the development of modernisation and information technology, universities are gradually implementing information management in education, i.e., building information systems for students and teachers, and employment education is no exception. In today’s employment education managerial system, the information associated with it is huge and complex, which directly affects the upbringing of students’ employability and the sustainable advancement of employment education (). As a data processing tool, data mining technology can be applied to education management by selecting appropriate analysis tools to process the information in the database to obtain useful and valuable information, which has a broad development prospect. Data mining commonly used methods include clustering, association rules and regression analysis, etc. In practical use, the appropriate method must be selected according to the characteristics of the database (). The K-means algorithm typically utilized in clustering and the classical Apriori algorithm in association rules both face problems such as low efficiency and need some improvement. Therefore, the research is based on improving the K-means algorithm and the Apriori algorithm, and applying them together in the employment education management system with a view to improving its data management capabilities. The first part of the article is a literature review on data mining technology in educational data management, including the improvement and application of K-means algorithm and Apriori algorithm (). The second part describes the improved K-means and Apriori algorithms in detail, and the third part is the verification of the application effect of data mining technology in employment education management, including the respective verification of the improved algorithms and the combined practical application effect verification. By verifying the application effect of data mining technology in employment education management, we hope to obtain more effective methods to further optimise management.

The key to sustainable education management in employment is the efficient processing of all relevant data and the mining of valuable information. In recent years, improvements to the K-means algorithm have received much attention from professionals and the research outcomes have been very fruitful. Lakshmi K’s research team, to cope with the local optimal solution caused by the casual election of the incipient prime of the K-means algorithm, proposed to apply a population-based metaheuristic optimisation algorithm to the upgraded K-means algorithm, which is according to the intelligent behaviour of crows and is able to find the global top-notch key. Experiments in the benchmark dataset the outcomes demonstrate that the upgraded K-means algorithm has high accuracy (). Alguliyev, et al. () developed a parallel clustering technique according to the K-means algorithm to increase the powerful computational power required for big data, which upgrades the clustering speed while maximising the preservation of the initial dataset characteristics and enables the clustering of the nearest centre of mass according to the obtained pivot of mass position. The effectiveness of this algorithm was verified in a comparison with the pre-improvement algorithm. Hossain, et al () addressed the problem that the K-means algorithm has a high probability of grouping different items into the same group and designed a dynamic method for data clustering in which the K-means centre of mass is obtained by threshold calculation and the amount of clusters are formed with this value, thus enabling the data to be classified according to the comparison between the threshold and the Euclidean distance. The outcomes demonstrate that this method outperforms the pre-improvement method. Shrifan, et al. () research team has optimised the K-means algorithm using Tukey’s ordination in combination with a novel range calculation, to address the problem of large differences in data clustering accuracy due to different range calculations in classical K-means algorithm, which minimises the impact by eliminating outliers, and the outcomes demonstrate that the method significantly improves the convergence of the prime and increases the overall clustering accuracy by a value of 80.57%. Laxmi Lydia, et al. () designed a new K-mean non-negative matrix decomposition method for the retrieval of valid information in big data, which incorporates a keyword extraction algorithm, and the outcomes demonstrate that the method reduces the error rate by 5%.

The team of Neysiani proposed to apply the butterfly optimisation algorithm, to association rule mining, to address the problem of low efficiency of existing data mining techniques, which used a parallel strategy of one CPU and three GPUs to run association rule mining, and used the CPU as a synchronizer (). Wang and Zheng () designed an upgraded Apriori for frequent itemset time series to address the problem of a large candidate set in the Apriori algorithm in data mining. Their outcomes demonstrated that it outperformed the traditional Apriori algorithm in terms of storage space based on the analysis of the time series relevance laws excavation process. Liu et al. () for solving the problem that traditional data mining algorithms are difficult to mine large-scale data in a timely manner, they combined it with the frequent pattern growth algorithm in distributed parallel algorithms to achieve parallel mining of frequent itemsets and association rules, and the outcomes verified the efficient performance of the method (). Subha () developed a distributed association rule mining algorithm for P-trees, which preserves transactional data through a special data structure P-trees. The experimental outcomes demonstrate that the method simplifies message exchange and database scanning and achieves lossless preservation of stored data. Sun () applies data mining techniques to a university academic affairs management system and makes dynamic improvements based on the characteristics of the technique for mining potential information. The improved method improves the accuracy and constriction tempo of clustering and confirms the feasibility of the clustering method in computer network education management. In view of the sustainable development of college education, Wang and Soo-Jin () used association rules to mine hidden data in student achievement information, and analyzed the influencing factors through classified decision tree Analysis of algorithms. The results show that the method can effectively optimize the teaching management system.

In summary, most researchers have improved the selection of initial centroids for the K-means algorithm, while optimising the Apriori algorithm accordingly. However, the clustering accuracy achieved is still low, and it is still difficult to meet the needs of educational data management. Therefore, K-means algorithm and Apriori algorithm are combined to further utilize data mining technology to process employment education data to achieve sustainable development of employment education management. Therefore, the improved Apriori algorithm is eventually combined with the SA-K-means method. Firstly, the SA-K-means algorithm is used to cluster the data to achieve pre-processing, and then the improved Apriori algorithm is used to mine the associated data in the data to better complete the employment education data classification and relationship mining.

III. Data Mining-Based Sustainable Education Management for Employment

A. Clustering of employment education management data based on the K-means algorithm

Employment education in colleges usually involves employment education courses, the quality of employment education and students’ professional performance, and the relationship between them is intricate and complex. At the same time, at this stage, universities are comprehensively strengthening employment and entrepreneurship education, focusing on cultivating students’ comprehensive practical ability, and taking the improvement of comprehensive quality as the fundamental goal, tending to guide students to engage in self-employment and high-quality employment. The factors that need to be considered are gradually becoming complicated, and the huge amount of data and information generated makes it difficult for the employment education management system to handle efficiently (). In terms of the content of employment education alone, it mainly includes the establishment of career awareness, the development of employability, guidance on job search, guidance on self-employment education and so on. Therefore, according to the current educational development requirements, combined with expert consultation and theoretical analysis, and through the research of 32 famous schools in China, the study first established the employment education management system, and summarised the aspects of it that need to be processed by data, as demonstrated in Figure 1.

Figure 1

Main data processing of employment education management.

Before proceeding, the data is first clustered to be able to better mine the useful value information of employment education data (). Clustering analysis is a notable excavation of data excavation, and as one of the classical algorithms, the K-means is scalable, simple in principle, uncomplicated to exert, and has obvious behaviour advantages in data integration. The algorithm clusters all samples into the shortest distance clusters after determining k initial clustering centres, and achieves the optimum of the overall objective function under the action of continuous iteration (). However, the K-means algorithm usually requires human specification in determining the value of k, is highly dependent on the initial clustering centres, and has a high frequency of local optima. Therefore, the study incorporates splitting and aggregation operations into the K-means to form the SA-K-means algorithm to further improve the K-means method and make the data clustering outcomes more representative. the primordial framework of the K-means is demonstrated in Figure 2.

Figure 2

Basic principle of K-means algorithm.

The K-means for cluster analysis is an iterative process in which k sample data points are first selected in a random way to form an initial cluster centre in a certain data set (). The gap between the incipient cluster centre and the sample data points is the basis for the classification of the classes, i.e., the sample data points are classified into the nearest cluster centre according to the proximity principle. The new cluster centres are obtained by taking the mean of the attribute values of all data points in each class, and the amount of cluster centres are still k at this point. The final judgement is made on whether the clustering evaluation criterion function has reached the optimum, and if it has, then the class division continues, and if not, then it is iterated again. The objective criterion function is calculated as demonstrated in equation (1).

(1)

E = ∑ j = 1 c ∑ k = 1 n j ‖ x k − m j ‖ 2

M1 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[E = {\sum\limits_{j = 1}^c {\sum\limits_{k = 1}^{{n_j}} {\left\| {{x_k} - {m_j}} \right\|} } ^2}\] \end{document}

In equation (1), E is the totality of the mean squared differences calculated from the attribute values of the data points, m_j is the cluster centre of the j cluster, and x_k is the individual data points in the sample data. For the classification criteria of the data samples, the similarity is followed and the Euclidean distance is used to determine the similarity, as demonstrated in equation (2).

(2)

d (x i, x j) = ∑ k − 1 n (x i k − x j k) 2

M2 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[d({x_i},{x_j}) = \sqrt {\sum\limits_{k - 1}^n {{{({x_{ik}} - {x_{jk}})}^2}} } \] \end{document}

In equation (2), x_i, x_j are samples contained in the dataset, and x_ik and x_jk are samples contained in the k cluster.

With the continuous execution of clustering, an optimal set of divisions is obtained. Not only does it maintain a maximum degree of independence between clusters, but a high degree of compactness is also maintained within individual clusters. Cluster analysis is judged by equation (3).

(3)

{J = ∑ i = 1 k ∑ j = 1 x j ∈ C i n dis (x j, c i) 2 c i = 1 N i ∑ j = 1 x j ∈ C i n x j

M3 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[\begin{cases} J={\sum\limits_{i=1}^{k} \sum\limits_{\displaystyle{\mathop{j=1}_{x_{j}\in C_{i}}}}dis(x_{j}, c_{i})}^{2}\\ c_{i}= \frac{1}{N_{i}} \sum\limits_{\displaystyle{\mathop{j=1}_{x_{j}\in C_{i}}}} x_{j} \end{cases}\] \end{document}

In equation (3), c_i is the average of data, c_j is the data samples contained in the class C_i, the Euclidean distance between x_i and c_i is dis(x_i, c_i) and N_i is the amount of data contained in the first i cluster. The vector of cluster centres needs to be corrected after the first clustering is completed, as demonstrated in equation (4).

(4)

z j = ∑ x ∈ s j x N j, j = 1, 2 …, k

M4 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[z_{j} = \frac{\sum\nolimits_{x\in s_{j}}x} {N_{j}}, j= 1,2 \ldots, k\] \end{document}

In equation (4), z_j is cluster centre, N_j is the amount of samples contained in each of the different groups, and S_j is the group. The mean distance between the cluster centres and the samples is calculated by equation (5).

(5)

D j = ∑ x ∈ s x − z j N j, j = 1, 2, …, k

M5 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[D_{j} = \frac{\sum\nolimits_{x \in s} \left\Vert x - z_{j} \right\Vert} {N_{j}}, j = 1,2,\ldots, k\] \end{document}

In equation (5), D_j is the average distance found. This gives the total average distance, see equation (6).

(6)

D ¯ = 1 N ∑ i = 1 k ∑ x ∈ s j x − z j

M6 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{\overline D} = \frac{1}{N} \sum\limits_{i = 1}^k \sum\nolimits_{x \in s_{j}} \left\Vert x - z_{j} \right\Vert\] \end{document}

In equation (6), $D ¯$ M22 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[\overline D \] \end{document} is the total mean distance. At this point a split or merge operation is added, which focuses on the outcomes of the previous clustering. The purpose of the splitting process is to increase the number of clustering centres as much as possible, while keeping the original clustering centres intact. A merge step is also added to deal with the problem of too close distances between data samples of different categories. The standard deviation between centroids and data samples is a necessary step for the added splitting operation, as demonstrated in equation (7).

(7)

σ j = ∑ x ∈ s j (x − z j) 2 N j

M7 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[\sigma_{j} = \sqrt {\frac{\sum\nolimits_{x \in s_{j}} {(x - z_{j})}^{2}} {N_{j}}}\] \end{document}

In equation (7), σ is the standard deviation. For each grouping, there is a corresponding standard deviation, at which point the maximum of these is selected. When this maximum is bigger than the maximum of the standard deviation of the samples in the kind, the amount of samples in the kind exceeds the maximum value and the average distance is greater than the total average distance, or the number of clusters is less than one half of the required number, the classification operation is performed and the two sets of clustering centres are obtained, see equation (8).

(8)

{z j + = z j − ρ σ max, 0 < ρ < 1 z j − = ρ σ max + z j

M8 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[\left\{ \begin{array}{l} z_j^ + = {z_j} - \rho {\sigma _{\max }},0 < \rho < 1\\ z_j^ - = \rho {\sigma _{\max }} + {z_j} \end{array} \right.\] \end{document}

In equation (8), $z j +$ M16 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[z_j^ + \] \end{document} and $z j −$ M17 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[z_j^ - \] \end{document} are the cluster centres obtained after splitting, and $σ max$ M18 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{\sigma _{\max }}\] \end{document} is the maximum standard deviation value. When performing the merge operation, the comparison of the cluster centres between the two groups is performed by equation (9).

(9)

D ij = ‖ z i − z j ‖, i = 1, 2, 3, …, k − 1, j = i + 1, …, k

M9 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{D_{ij}} = \left\| {{z_i} - {z_j}} \right\|,i = 1,2,3,\ldots,k - 1,j = i + 1,\ldots,k\] \end{document}

When the minimum mean distance is smaller than the minimum value of the distance from the cluster centre, a merge operation is performed to obtain a new cluster centre as demonstrated in (10).

(10)

z i * = N j z j + N i z i N j + N i

M10 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[z_i^* = \frac{{{N_j}{z_j} + {N_i}{z_i}}}{{{N_j} + {N_i}}}\] \end{document}

In equation (10), $z i *$ M19 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[z_i^*\] \end{document} is the new clustering centre. When the merging operation is completed, the number of corresponding clusters is subtracted by one, and finally the corresponding centroid vectors and groupings are output once the stop iteration condition is satisfied.

B. Analysis of employment education management information based on association rules

After clustering the data information in employment education using the upgraded K-means method, the correlation between the information is further identified through association rules to facilitate educational information analysis, provide more scientific and effective guidance for employment education, and promote sustainable education development. The purpose of relevance laws is to extract all the strong correlation laws that exist in the target transaction database, i.e., the support of the association rules mined must meet the necessary conditions of greater than, or equal to, the least confidence, and the confidence level must be greater than, or equal to, the minimum confidence level (). Finding frequent itemsets and computing deep correlation laws are the two cardinal procedures of relevance laws excavation, where the efficiency of correlation laws excavation is closely related to the efficiency of mining frequent itemsets (). The Apriori algorithm, as a classical method in relevant laws excavation, is primitive and uncomplicated to employ, and is widely used in transactional databases. The Apriori algorithm first sets the least support threshold and least confidence threshold based on the strength of the relevance rule, and then reads all the transaction data, in which all items are candidate 1 itemset C1. The support of all C1 items is then obtained from the total number of transactions and compared with the least support threshold one by one (). The one that is smaller than the minimum support threshold is removed and the one that is greater than or equal to it is kept as the frequent 1 itemset L1 and the candidate 2 itemset C2 is obtained by linking L1 with itself. A second scan of the database is then performed and the support of the C2 items is calculated and L3 is obtained and C3 is generated by following the steps after the first scan. The flow of Apriori algorithm prosperous itemset mining is demonstrated in Figure 3.

Figure 3

Mining process of complex itemsets in Apriori algorithm.

The calculation of the support of individual items is one of the key steps in the Apriori algorithm. The support actually represents the frequency of the set of items in a transaction and is calculated as demonstrated in equation (11).

(11)

support (x) = number (x ⊆ T) | D |

M11 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[\sup port(x) = \frac{{number(x \subseteq T)}}{{\left| D \right|}}\] \end{document}

In equation (11), D is the transaction-specific database, T is the specific transaction, and Support is the support level and represents the subset of the outcoming itemset. After eliminating the unqualified items from the itemset by the support calculation, the frequent itemsets are merged to obtain the new set of options, as demonstrated in equation (12).

(12)

C k = L (k − 1) n ∪ L (k − 1) m

M12 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{C_k} = {L_{(k - 1)n}} \cup {L_{(k - 1)m}}\] \end{document}

In Equation (12), C_k is the set of candidate items, L is the set of frequent items, and n and m represent the set of them. The new set of frequent items is then formed by combining the two sets and traversing them as demonstrated in equation (13).

(13)

C j = C k (j)

M13 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{C_j} = {C_k}(j)\] \end{document}

In equation (13), C_j is a subset of the outcoming new set of options and j is the j the subset of C_j. The size of the number of subsets is obtained by the calculation procedure demonstrated in equation (14).

(14)

count = {count, C j ⊄ I i count + 1, C j ⊂ I i (j = 0, 1, 2, …) (i = 1, 2, 3, … N)

M14 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[count = \left\{ \begin{array}{l} count,{C_j} \not\subset {I_i}\\ count + 1,{C_j} \subset {I_i}(j = 0,1,2,\ldots)(i = 1,2,3,\ldots N) \end{array} \right.\] \end{document}

In equation (14), C_j is a subset of C_k, N is the order, and i is the itemset from i. count Starting from 0, k is the maximum value of j. The confidence level is calculated as demonstrated in equation (15).

(15)

con (x ⇒ y) = support (x ⇒ y) support (x) = P (xy) P (x) = P (y | x)

M15 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[con(x \Rightarrow y) = \frac{{\sup port(x \Rightarrow y)}}{{\sup port(x)}} = \frac{{P(xy)}}{{P(x)}} = P(y\left| x \right.)\] \end{document}

In equation (15), x and y are both itemsets and are not equal, $con (x ⇒ y)$ M20 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[con(x \Rightarrow y)\] \end{document} is the odds of the make an appearance of the itemset y in the itemset x and $support (x ⇒ y)$ M21 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[\sup port(x \Rightarrow y)\] \end{document} is the probability of the data set containing both x and y, i.e., The probability of an itemset P(xy) x in the database is P(x) and the odds of itemset x containing itemset y is P(y|x). It can be found that the Apriori algorithm still generates more non-essential candidate sets when the size of the candidate set is increasing (). Therefore, a parallel algorithm based on Matrix and Weight (MW-Apriori) is proposed to improve it. The algorithm introduces parallel computing and chunking of Boolean matrices according to the MapReduce framework, thus improving the productivity of the algorithm. The MW-Apriori algorithm also dwindles the amount of candidate sets by eliminating items that do not satisfy the conditions before performing join operations on the data, thus compressing the database.

From Figure 4, the MW-Apriori algorithm stores database transactions based on the construction of a Boolean matrix, and uses frequency statistics to remove unqualified itemsets in advance, thereby constructing a new matrix. The matrix is then further compressed by finding different transactions with the same itemset and performing a merge weighting. The remaining matrix elements are then ‘summed’ two by two to obtain a matrix that meets the requirements, and finally the intersection of all the outcomes is used to obtain the association rules between the different data sets.

Figure 4

The specific mining process of MW Apriori algorithm.

IV. Analysis of the Effectiveness of Data Mining in the Management of Sustainable Education for Employment

The research mainly applied the clustering algorithm and association rules in data mining for the processing of data related to employment education management. The SA-K-means algorithm was from improving the K-means for data clustering, and the upgraded Apriori method was to unearth the hidden relationships between data to provide guidance on employment education. The upgraded SA-K-means algorithm was therefore first analysed for performance, and the standard data, Iris, was selected to test it and compare it with the K-means method before the improvement. Iris dataset is a commonly used clustering experiment dataset, also known as iris flower dataset, which is a class of multivariate analysis dataset. The dataset contains 150 data samples, divided into three categories, with 50 data in each category, and each data contains four attributes. Fifty experiments were conducted on the upgraded K-means method before and after the experiment, and the amount of misclassified individuals and clustering outcomes were recorded. The clustering outcomes of the before and after algorithm are demonstrated in Figure 5. There are two types of clustering outcomes for the classical K-means algorithm, containing 37 test outcomes in Figure 5(a) and 13 test outcomes in Figure 5(b), and only one type of 50 test outcomes for the upgraded SA-K-means, which is the outcome in Figure 5(a). The K-means method is less stable, with two clustering outcomes, while the SA-K-means algorithm is relatively stable.

Figure 5

Clustering outcomes of K-means algorithm before and after improvement.

The outcomes of the wrong score rate of the K-means before and after the improvement are demonstrated in Figure 6. Figure 6 demonstrates the outcomes of the clustering error rate for the K-means method and the SA-K-means method. The horizontal coordinates are the batches of experiments, while the vertical coordinates represent the error rate. The error rate of the K-means ranges from 10% to 60 as the amount of experimental batches increases, with the error rate exceeding 50% in 13 instances and fluctuating widely. In contrast, the error rate of the optimised SA-K-means method was stable at around 10%, and did not change with the change of experimental batches, which was more steady and more pinpoint than the K-means. The effectiveness of the upgraded MW-Apriori algorithm was then verified.

Figure 6

Outcomes of misclassification rate of K-means algorithm before and after improvement.

To elevate the credibility of the comparison, the upgraded method was compared with the classical Apriori method and the CM-Apriori means ground on the clustering matrix. All three methods were guaranteed to be tested and compared in the same experimental environment, as demonstrated in Table 1. The Mushroom dataset (Mushroom) was selected for the study to checkout the behaviour of the three means. The dataset contains 8,124 transactions, the amount of itemsets is 119, and the maximum length of the transactions is 23. The experiment consists of two parts; the first part is a comparison of the running times of the three methods in the dataset Mushroom for different minimum support degrees. To avoid memory overflow due to too small support threshold settings, the minimum support levels set in the study were 25%, 30%, 35%, 40%, 45%, 50%, and 55% respectively. The second part demonstrates the runtime discrepancy of the three methods for different amount of transactions in the dataset Mushroom with a support threshold of 30% and an increasing number of transactions in the dataset ranging from 2,500 to 5,500.

Table 1

Experimental software and hardware configuration of three methods.


SOFTWARE AND HARDWARE ENVIRONMENT CONFIGURATION	CONCRETE CONTENT

Operating system	Windows 10

Development platform	IntelliJ IDEA

Internal storage	4GB

Graphical tools	Matlab 2017b

The runtime outcomes for the three methods in the Mushroom dataset are demonstrated in Figure 7. Figure 7(a) demonstrates the runtime discrepancy of the three selected methods with different minimum support degrees, and Figure 7(b) demonstrates the runtime discrepancy of the three means with changing amount of transactions. From Figure 7(a), the runtime of the classical Apriori method fluctuates from 5s to 30s as the support level changes, with the longest running time of about 28s when the support level is 25% and the shortest running time occurring at 55% support level, which is roughly 8s. The longest and shortest running times of the CM-Apriori algorithm are 20s and 5s respectively, while the MW- Apriori algorithm has a maximum run time of 5s and a minimum time within 1s. The maximum runtime of the Apriori method and the CM-Apriori method are 30s and 22s respectively, while the maximum running time of the MW-Apriori algorithm is 10s, and compared to the first two methods, the upgraded MW-Apriori method is shorter and more efficient. Compared to the first two means, the upgraded MW-Apriori method has shorter time and higher running efficiency, with greater improvement in both time and space efficiency.

Figure 7

Running time outcomes of three methods in Mushroom dataset.

Finally, the SA-K-means method and MW-Apriori method were combined, named SAK-MWA method, and jointly applied to the employment education management, and the data mining effect of the combined SAK-MWA algorithm on the employment education-related data was first tested. The selected employment education management system was established by a well-known university in China itself, and involved various data of employment education of the university, including employment education contents, employment courses, etc. The mining outcomes of the three methods in this database are demonstrated in Figure 8. Figure 8 demonstrates the data mining accuracy outcomes of K-means method, Apriori method and the research combined SAK-MWA method in the employment education management system, the horizontal coordinates in Figure 8 are the database numbers consisting of randomly selected relevant data in this employment education management system. In Figure 8, the sort precision of the K-means method has large fluctuations and relatively poor stability, concentrated around 92%, while the accuracy of the Apriori method is relatively low, mostly at 88% and below. The accuracy of the combined SAK-MWA algorithm was maintained at 96% and above, with less fluctuation and better stability.

Figure 8

Comparison of data mining outcomes of three algorithms in employment education management system.

Finally, the SAK-MWA algorithm was applied to the employment education management system to evaluate the four aspects of employment knowledge mastery, employment practice ability, employment awareness, employment courses, and compared with the outcomes before use, as demonstrated in Figure 9. The horizontal coordinates in Figure 9 represent the four main aspects of employment education and the vertical coordinates are the ratings. From Figure 9, before the application of the SAK-MWA algorithm to the employment education management system, students’ employment knowledge mastery, practical skills, employment awareness and employment course-related ratings were all at a low level, with only the employment course rating being close to 85. After the SAK-MWA algorithm was applied to the employment education management system, however, students’ employability and knowledge mastery were significantly improved, with ratings above 90 in all four categories, including a score of 95 or more in employability practice, indicating that the method is conducive to a more efficient management of the employment education management system and provides effective assistance to students, which can promote the sustainable development of employment education.

Figure 9

Comparison of the effectiveness of employment education management before and after using SAK-MWA algorithm.

V. Conclusion

Employment education is an indispensable element for universities to achieve quality development and is matter to the sustainable development of education. The study uses data excavation techniques to process information related to employment education by analysing the peculiarity of employment education management system. Firstly, the K-means is upgraded for clustering analysis of employment education data, and secondly, the Apriori algorithm is upgraded and the two are combined and applied together in the employment education management system. The outcomes demonstrate that the improved K-means algorithm has high stability with only one kind of clusters obtained in 50 tests, and its error score rate is kept near 10% with high accuracy; the optimised Apriori algorithm has a shortest running time of no more than 1s with different minimum support, and a longest running time of 10s when dealing with different number of transactions, which has a high running efficiency; the two improved algorithms were applied to employment education management, the students’ employment practice ability was up to 95 points or more, and the employment knowledge acquisition, employment awareness and employment course evaluation were all above 90 points, indicating that the method has improved the efficiency of employment education management. However, the study did not optimise the self-linking and pruning steps of the Apriori algorithm when improving it, so further exploration in this area is needed.

Data Science Journal

Research Papers

A Study on the Application of Data Mining Techniques in the Management of Sustainable Education for Employment

Abstract

I. Introduction

III. Data Mining-Based Sustainable Education Management for Employment

A. Clustering of employment education management data based on the K-means algorithm

B. Analysis of employment education management information based on association rules

IV. Analysis of the Effectiveness of Data Mining in the Management of Sustainable Education for Employment

V. Conclusion

Funding Information

Competing Interests

References

Research Papers

A Study on the Application of Data Mining Techniques in the Management of Sustainable Education for Employment

Abstract

I. Introduction

II. Related Work

III. Data Mining-Based Sustainable Education Management for Employment

A. Clustering of employment education management data based on the K-means algorithm

B. Analysis of employment education management information based on association rules

IV. Analysis of the Effectiveness of Data Mining in the Management of Sustainable Education for Employment

V. Conclusion

Funding Information

Competing Interests

References