Abnormal pattern prediction has received a great deal of attention from both academia and industry, with various applications (e.g., fraud, terrorism, intrusion detection, etc.). In practice, many abnormal pattern prediction problems are characterized by the simultaneous presence of skewed data, a large number of unlabeled data and a dynamic and changing pattern. In this paper, we propose a methodology based on semi-supervised techniques and we introduce a new metric – the Cluster-Score – for fraud detection which can deal with these practical challenges. Specifically, the methodology involves transmuting unsupervised models into supervised models using the Cluster-Score metric, which defines an objective boundary between clusters and evaluates the homogeneity of the abnormalities in the cluster construction. The objectives are to increase the number of fraudulent claims detected and to reduce the proportion of claims investigated that are, in fact, non-fraudulent. The results from applying our methodology considerably improved these objectives. The experiments were performed on a real world data-set and are the results of building a fraud detection system.

Predicting abnormalities in environments with highly unbalanced samples and a huge mass of unlabeled data is receiving more attention as new technologies are developed (e.g., time-series monitoring, medical conditions, intrusion detection, detecting patters in images, etc.). A typical example of such a situation is provided by fraud detection (

To represent this typical case we apply an innovative semi-supervised methodology to a real fraud case. Specifically, we draw on information provided by a leading insurance company as we seek to predict fraudulent insurance claims.

In the sector, the main services contracted are automobile and property insurance, representing 76% of total claim costs. However, while many studies have examined automobile fraud detection (see, for example, Artís et al., 1999 and 2002;

In addition, private companies rarely share real fraud datasets and keep this information private to not reveal competitive details. Very small number of studies have therefore been implemented as fraud systems in insurance companies (few examples are

Our main objective is therefore to present a variety of semi-supervised machine learning models applied to a fraud insurance detection problem. In so doing, we aim to develop a methodology capable of improving results in classification anomaly problems of this type. The key being to avoid making assumptions about the unknown fraud cases when resolving reoccurring practical problems (skewed data, unlabeled data, dynamic and changing patterns) since this can bias results.

Our reasoning for using semi-supervised models is best explained as follows. First, as pointed out by Phua, et al. (

Second, supervised models are inappropriate because, in general, we face a major problem of claim misclassifications when dealing with fraud detection (Artís et al., 2002) which could generate a substantial mass of

Finally, when fraud investigators analyze claims, they base their analysis on a small suspicious subset from previous experience and tend to compare cases to what they consider to be “normal” transactions. As data volume and the velocity of operative processes increases exponentially, human analysis becomes poorly adapted to

Clearly, the information provided in relation to cases considered suspicious is more likely to be specified correctly once we have passed the first stage in the fraud detection process. This information will be useful for a part of the distribution (i.e., it will reveal if a fraudulent claim has been submitted), which is why it is very important this information be taken into account. For this reason, fraud detection in insurance claims can be considered a semi-supervised problem because the ground truth labeling of the data is partially known. Not many studies have used hybrids of supervised/unsupervised models. Williams and Huang (

Other semi-supervised models use normal observable data to define abnormal behavioral patterns: Aleskerov et al. (

We therefore seek to make three contributions to the literature: First, we apply semi-supervised techniques to an anomaly detection problem while trying to solve three combined problems: skewed data, unlabeled data and change in patterns,

We use an insurance fraud data-set provided by a leading insurance company in Spain, initially for the period 2015–2016. After sanitization, our main sample consists of 303,166 property claims, some of which have been analyzed as possible cases of fraud by the Investigation Office (IO).

Of the cases analyzed by the IO, 48% proved to be fraudulent. A total of 2,641 cases were resolved as true positives (0.8% of total claims) during the period under study. This means we do not know which class the remaining 99.2% of cases belong to. However, the fraud cases detected provide very powerful information, as they reveal the way in which fraudulent claims behave. Essentially, they serve as the pivotal cluster for separating normal from abnormal data.

A data lake was constructed during the process to generate sanitized data. A data lake is a repository of stored raw data, which includes structured and unstructured data in addition to transformed data used to perform tasks such as visualizing, analyzing, etc. From the data lake, we obtain 20 bottles containing different types of information related to claims. A bottle is a subset of transformed data which comes from an extract-transform-load (ETL) process preparing data for analysis. These bottles contain variables derived from the company’s daily operations, which are transformed in several aspects. In total we have almost 1,300 variables. We briefly present them in Table

The 20 Data Bottles and their descriptions extracted from a Data Lake created for this particular case study.

Bottles | Descriptions |
---|---|

ID | ID about claims, policy, person, etc. |

CUSTOMER | Policyholder’s attributes embodied in insurance policies: name, sex, age, address, etc. |

CUSTOMER_PROPERTY | Customer related with the property data. |

DATES | Dates of about claims, policy, visits, etc. |

GUARANTEES | Coverage and guarantees of the subscribed policy. |

ASSISTANCE | Call center claim assistance. |

PROPERTY | Data related to the insured object. |

PAYMENTS | Policy payments made by the insured. |

POLICY | Policy contract data, including changes, duration, etc. |

LOSS ADJUSTER | Information about the process of the investigation but also about the loss adjuster. |

CLAIM | Brief, partial information about the claim, including date and location. |

INTERMEDIARY | Information about the policies’ intermediaries. |

CUSTOMER_OBJECT_RESERVE | The coverage and guarantees involved in the claim. |

HISTORICAL_CLAIM | Historical movements associated with the reference claim. |

HISTORICAL_POLICY | Historical movements associated with the reference policy (the policy involved in the claim). |

HISTORICAL_OTHER_POLICIES | Historical movements of any other policy (property or otherwise) related to the reference policy. |

HISTORICAL_OTHER_CLAIM | Historical claim associated with the reference policy (excluding the claim analyzed). |

HISTORICAL_OTHER_POL_CLAIM | Other claim associated with other policies not in the reference policy (but related to the customer). |

BLACK_LIST | Every participant involved in a fraudulent claim (insured, loss-adjuster, intermediary, other professionals, etc.) |

CROSS VARIABLES | Several variables constructed with the interaction between the bottles. |

If we have labeled data, the easiest way to proceed is to separate regular from outlier observations by employing a supervised algorithm. However, in the case of fraud, this implies that we know everything about the two classes of observation, i.e., we would know exactly who did and did not commit fraud, a situation that is extremely rare. In contrast, if we know nothing about the labeling, that is, we do not know who did and did not commit fraud, several unsupervised methods of outlier detection can be employed, e.g., isolation forest (

If, however, we have some label data about each class, we can implement a semi-supervised algorithm, such as label propagation (

In the light of these issues, we propose a semi-supervised technique that can assess not only a skewed data-set problem but also one for which we have no information about certain classes. In this regard, fraud detection represents an outlier problem for which we can usually identify some, but not all, of the cases. We might, for example, have information about false positives, i.e., investigated cases that proved not to be fraudulent. However, simply because they have raised suspicions mean they cannot be considered representative of non-fraudulent cases. In short, what we usually have are some cases of fraud and a large volume of unknown cases (among which it is highly likely cases of fraud are lurking).

Bearing this in mind, we propose the application of unsupervised models so as to relabel the target variable. To do this, we use a new metric that measures how well we approximate the minority class. We can then transform the model to a semi-supervised algorithm. On completion of the relabeling process, our problem can be simplified to a supervised model. This allows us not only to set an objective boundary but to obtain a gain in accuracy when using partial information, as Trivedi et al. (

We start with a data-set of 303,166 cases. The original data was collected for business purposes, therefore a lot of time was put into sanitizing the data-set. It is important to remark that we set aside a 10% random subset for final evaluation. Hence, our data-set consists of 270,479 non-identified cases and 2,370 cases of fraud.

The main problem we face in this unsupervised model is having to define a subjective boundary. We have partial information about fraud cases, but have to determine an acceptable threshold at which an unknown case can be considered fraudulent. When calculating unsupervised classification models, we reduce the dimensions to clusters. Almost every algorithm will return several clusters containing mixed-type data (fraud and unknown). Intuitively, we would want the fraud points revealed to be highly concentrated into just a few clusters. Likewise, we would expect some non-revealed cases to be included with them, as in Figure

Possible clusters.

A boundary line might easily be drawn so that we accept only cases of detected fraud or we accept every possible case as fraudulent. Yet, we know this to be unrealistic. If we seek to operate between these two extremes, intuition tells us that we need to stay closer to the lower threshold, accepting only cases of fraud and very few more, as Figure

Schematic representation of the desired threshold which is expected to split high fraud probability cases from low fraud probability cases.

But once more, we do not know exactly what the correct limit is. In this way, however, we have created an experimental metric that can help us assign a score and, subsequently, define the threshold. This metric, which we shall refer to as the cluster score (CS), calculates the weighted homogeneity of clusters based on the minority and majority classes.

Essentially, it assigns a score to both the minority-class (C1) and the majority-class (C2) clusters based on the weighted conditional probability of each point. The CS expression clearly resembles the well-known F-Score,

Moreover the

Suppose an unsupervised model generates J clusters: {^{1}, ^{2}, …, ^{J}^{j}^{j}

The C1 score calculates the probability that a revealed (i.e., confirmed) fraud case belongs to cluster ^{j}^{j}_{fraud}

Basically, we calculate the fraction of fraud cases in each cluster

Our objective is to maximize C1. This means ensuring all revealed fraud cases are in the same clusters. The limit C1 = 1 implies that all J clusters only contain revealed fraud cases. Therefore, we have to balance this function with another function.

C2 is the counterpart of C1. The C2 score calculates the probability that an “unknown” case belongs to cluster ^{j}^{j}_{unknown}

Notice that

Individually maximizing C1 and C2 leaves us in an unwanted situation. Basically, they are both trying to be split. Therefore, when we maximize one, we minimize the other. If we maximize both together, this results in a trade-off between C1 and C2, a trade-off in which we can choose. Moreover, as pointed out above, we actually want to maximize C1 subject to C2. Consequently, the fraud score is constructed as follows:

If

In conclusion, with this CS we have an objective parameter to tune the unsupervised model because it permits us to homogeneously evaluate not only different algorithms but also their parameters. While it is true that there exists a variety of internal validation indices, this metric differs in that it can also exploit information about the revealed fraud cases. That is, we take advantage of the sample that is labeled fraud to choose the best algorithm, something that internal validation indices are not able to accomplish. The only decision that remains for us is to determine the relevance of

We should stress that each time we retrieve more information about the one-class cases that have been revealed, this threshold improves. This is precisely where the entropy process of machine learning appears. As fraud is a dynamic process that changes patterns over the time, using this approach the algorithm is capable of adapting to those changes. In the one-class fraud problem discussed above, we start with an unknown distribution for which some data points are known (i.e., the fraud sample). Our algorithms, using the proposed CS metric, will gradually get closer to the best model that can fit these cases of fraud, while maintaining a margin for undiscovered cases. Now, if we obtain new information about fraud cases, our algorithms will readjust to provide the maximum CS again. As the algorithms work with notions based on density and distances, they change their shapes to regularize this new information.

Once the best unsupervised model is attained (i.e., the model that reaches the maximum CS), we need to decide what to do with the clusters generated. Basically, we need to determine which clusters comprise fraudulent and which comprise non-fraudulent cases. The difficulty is that several clusters will be of mixed-type: e.g., minority-class points (fraud cases) and unidentified cases, as in Figure

Cluster Example Output.

In defining a threshold for a fraud case, we make our strongest assumption. Here, we assume that if a cluster is made up of more than 50% of fraud cases, this cluster is a

As Figure

As mentioned, before applying the unsupervised algorithm, we had to make a huge effort to sanitize the original data since it was collected for business purposes. This included: handling categorized data, transforming variables, bad imputation, filtering, etc. at each bottle level. Finally, we transformed the 20 bottles at a claim level and put them together in a unique table which formed our model’s input.

After that, before using this data as input, we made some important transformations. First, we filled the missing values given that many models are unable to work with them. There are simple ways to solve this, like using the mean or the median value of the distribution. Since we did not want to modify the original distribution, we implemented a multi-output Random Forest regressor (

We iterated this process in every column that had missing values (0.058% of the total values were missing). We also measured the performance of this technique using the R-squared, which is based on the residual sum of squares. Our R-squared was 89%.

Second, we normalized the data to, later, be able to apply a Principal Component Analysis (PCA), and also because many machine-learning algorithms are sensible to scale effects. Those using Euclidean distance are particularly sensitive to high variation in the magnitudes of the features. In this case, we used a robust scale approach

Third, we applied Principal Component Analysis to resolve the high dimensionality problem (we had almost 1,300 variables). This method reduces confusion in the algorithms and solves any possible collinearity problems. PCA decomposes the data-set in a set of successive orthogonal components that explain a maximum amount of the data-set’s variance. When setting a data-set’s variance threshold, a trade-off between over-fitting and getting the variation in the data-set is made. We chose a threshold of 95% (recommended threshold is between 95% and 99%), which resulted in 324 components. After this transformations, the unsupervised algorithm can thus be summarized as seen in Algorithm 1.

Unsupervised algorithm

1 | _{1}, _{2}, …} |

2 | |

3 | We fit the model |

4 | We get the ^{1}, ^{2}, …, ^{J} |

5 | For _{k}_{,}_{i}_{k}_{,}_{i}^{*}. |

6 | Save the cluster score result _{k}_{i}_{K}_{,}_{I}_{K}_{,}_{I} |

7 | |

8 | |

9 | Choose the optimal ^{*} where ^{*} = _{K}_{,}_{I} |

10 | Relabel the fraud variable using the optimal clustering model derived from ^{*}. Each unknown case in a fraud cluster is now equal to 1, known fraud cases are equal to 1 and remaining cases are equal to 0. |

The main reason is that it has a low noise sensitivity as it ignores small variations in the background (based on a maximum variation basis). While it is true that there are several non-linear formulations for dimensionality reduction that may get better results, some studies have actually found that non-linear techniques are often not capable of outperforming PCA. For instance Van Der Maaten et al. (

We now have a redefined target variable that we can continue working with by applying an easy-to-handle supervised model. The first step involves re-sampling the fraud class to avoid unbalanced sample problems. Omitting this step, means that our model could be affected by the distribution of classes, the reason being that classifiers are in general more prone to detect the majority class rather than the minority class. We, therefore, oversample the data-set to obtain a 50/50 balanced sample. We use two oversampling methods, Adaptive Synthetic Sampling Approach (ADASYN) by He et al. (

The second step, involves conducting a grid search and a Stratified 5-fold cross-validation (CV) based on the F-Score

We have to be careful not to over-fit the model during the cross-validation process, particularly when using oversampling methods. Step one and step two, therefore have to be executed simultaneously. Oversampling before cross-validating would generate samples that are based on the total data-set. Consequently, for each

Additionally, we combine the supervised models using stacking models. Stacking models is combining different classifiers, applied to the same data-set, and getting different predictions that can be “stacked” up to produce one final prediction model. The idea is very similar to k-fold cross validation, dividing the training set into several subsets or folds. For all

Once we have the optimal parameters for each model, we calculate the optimal threshold that defines the probability of a case being fraudulent or non-fraudulent, respectively.

Finally, we identify the two models that perform best on the data-set – the best acting as our main model implementation, the other controlling that the predicted claims are generally consistent. The algorithm can be summarized as seen in Algorithm 2.

Supervised algorithm.

1 | _{i} |

2 | _{k}_{k} |

3 | We apply PCA to folder traink and save the weights/parameters. |

4 | _{k}_{k} |

5 | _{k}_{k} |

6 | Fit the _{i}_{k}_{i} |

7 | Transform _{k}_{k}_{k}_{i} |

8 | Save the probabilities _{k}_{i}_{i}_{i} |

9 | |

10 | _{i}_{i} to consider a case as fraudulent |

11 | _{i}_{i}_{i} |

12 | _{i} |

13 | Using _{i}_{i} |

14 | Save _{i,t}_{i}_{i} |

15 | |

16 | We get _{i}_{i} |

17 |

Table

Unsupervised model results.

Model | n Clusters | C1 | C2 | CS ( |
---|---|---|---|---|

Mini-Batch K-Means | 4 | 96.6% | 96.6% | 96.6% |

Isolation Forest | 2 | 51.5% | 51.1% | 51.4% |

DBSCAN | 2 | 50.2% | 49.8% | 50.1% |

Gaussian Mixture | 5 | 95.0% | 95.0% | 96.3% |

Bayesian Mixture | 6 | 96.5% | 96.4% | 96.5% |

C1 indicates that the minority-class (fraud) clusters comprise approximately 96.59% of minority data points on a weighted average. In contrast, C2 indicates they are made up of 96.59% of unknown cases. As can be seen in Table

Oversampled Unsupervised Mini-Batch K-Means.

Clusters | Fraud | Percentage |
---|---|---|

0 | 0 | 2% |

0 | 1 | 98% |

1 | 0 | 99% |

1 | 1 | 1% |

2 | 0 | 100% |

2 | 1 | 0% |

3 | 0 | 1% |

3 | 1 | 99% |

After relabeling the target variable (with the Mini-Batch K-Means output), we calculate the supervised models performance using Stratified 5-Fold CV on the data-set. The results of each of the supervised models and of the stacking models is shown in Table

Supervised model results.

Model | Cluster Recall | Original Recall | Precision | F-Score |
---|---|---|---|---|

ERT-ss | 0.9734 | 0.9840 | 0.6718 | 0.8932 |

ERT-os | 0.9647 | 0.9819 | 0.6937 | 0.8948 |

GB | 0.9092 | 0.9376 | 0.6350 | 0.8369 |

LXGB | 0.8901 | 0.9249 | 0.7484 | 0.8576 |

Stacked-ERT | 0.8901 | 0.9283 | 0.7524 | 0.8587 |

Stacked-GB | 0.8947 | 0.9287 | 0.7630 | 0.8649 |

Stacked-LXGB | 0.9180 | 0.9464 | 0.6825 | 0.8588 |

As can be appreciated, we have two recall values. The cluster recall is the metric derived when using the relabeling target variable. The original recall emerges when we recover the prior labeling (1 if it was fraud, 0 otherwise). As can be seen, the results are strikingly consistent. We are able to predict fraud cluster with a recall of up to 89–97% in every case. But, more impressively yet, we can capture the original fraud cases with a recall close to 98%. The precision is slightly lower, but in almost all cases it is higher than 67%. These are particularly good results for a problem that began as an unsupervised high-dimensional problem with an extremely unbalanced data-set.

The two best models are both extreme randomized trees: the first uses balanced subsampling -ERT-ss- (i.e., for every random sample used during the iteration of the trees, the sample is balanced by using weights inversely proportional to class frequencies), and serves here as our base model; the second uses an ADASYN oversampling method-ERT-os- and serves as our control model.

At the outset, we randomly set aside 10% of the data (30,317 claims). In this final step, we want to go further and examine these initial claims as test data. Our results are shown in Table

Model Robustness Check.

Original Value | Prediction | Cases |
---|---|---|

Non-Investigated | Non-Fraud | 29.631 |

Fraud | Non-Fraud | 0 |

Non-Investigated | Fraud | 415 |

Fraud | Fraud | 271 |

(a) ERT-ss Robustness Check | ||

Non-Investigated | Non-Fraud | 29.656 |

Fraud | Non-Fraud | 8 |

Non-Investigated | Fraud | 390 |

Fraud | Fraud | 263 |

(b) ERT-os Robustness Check |

As can be appreciated, the control model (Table

The IO investigated 367 cases (at the intersection between the ERT-ss and ERT-os models). Two fraud investigators analyzed each of these cases, none of which they had previously seen as the rule model had not detected them.

Of these 367 cases, 333 were found to present a very high probability of being fraudulent. This means that only 34 could be ruled out as not being fraudulent. Recall that from the original sample of 415 cases, the fact that 333 presented indications of fraud means we have a precision of 88%. In short, we managed to increase the efficiency of fraud detection by 122.8%. These final outcomes are summarized in Table

Base Model Final Results.

Original Value | Prediction | Cases |
---|---|---|

Non-Investigated | Non-Fraud | 29.631 |

Fraud | Non-Fraud | 0 |

Non-Fraud | Fraud | (415 – 333) = 82 |

Fraud | Fraud | (271 + 333) = 604 |

One of the challenges in fraud detection is that it is a dynamic process which can change its patterns over time. A year later, we retest the model with new data. We now have 519,921 claims to evaluate. We initially start out with a similar proportion of fraud cases (0.88%)- we are now able to train with 4,623 fraud cases to further improve results.

First, we recalculate the unsupervised algorithm, getting a Cluster-Score of 96.89%. As can be seen in Table

Oversampled Unsupervised Mini-Batch K-Means.

Clusters | Fraud | Percentage |
---|---|---|

0 | 0 | 99.4% |

0 | 1 | 0.6% |

1 | 0 | 0.7% |

1 | 1 | 99.3% |

2 | 0 | 2.6% |

2 | 1 | 97.4% |

Using the Extreme Randomized Subsampled approach (ERT-ss) and the Extreme Randomized oversampled with ADASYN (ERT-os), and the Stratified 5-fold cross validation approach we retrain the model. Table

Base Model with the machine-learning process applied.

Period | Jan 15–Jan 17 | Jan 15–Jan 18 |
---|---|---|

Claims | 303,166 | 519,921 |

Observed Fraud | 2,641 | 4,623 |

Cluster Score | 96.59% | 96.89% |

Recall Score ERT-ss | 97.34% | 96.31% |

Precision Score ERT-ss | 67.18% | 89.35% |

F-Score ERT-ss | 89.32% | 94.84% |

Recall Score ERT-os | 96.47% | 96.44% |

Precision Score ERT-os | 69.37% | 92.18% |

F-Score ERT-os | 89.48% | 95.56% |

The base model greatly improves the homogeneity of the fraud and non-fraud clusters. In particular, it provides a gain of 33% in the precision score and of 6.2–6.8% in the F-Score.

This paper has sought to offer a solution to the problems that arise when working with highly unbalanced data-sets for which the labeling of the majority of cases is unknown. In such cases, we may dispose of a few small samples that contain highly valuable information. Here, we have presented a fraud detection case, drawing on the data provided by a leading insurance company, and have tested a new methodology based on semi-supervised fundamentals to predict fraudulent property claims.

At the outset, the Investigation Office (IO) did not investigate many cases (around 7,000 cases from a total of 303,166). Of these, only 2,641 were actually true positives (0.8% of total claims), with a success rate of 48%. Thanks to the methodology devised herein, which continuously readapts to dynamic and changing patterns, we can now investigate the whole spectrum of cases automatically, obtaining a total recall of 96% and a precision of 89–92%. In spite of the complexity of the initial problem, where the challenge was to detect fraud dynamically without knowing anything about 99.2% of the sample, the methodology described has been shown to be capable of solving the problem with great success.

The additional file for this article can be found as follows:

Practical Example. DOI:

The study is part of the development of a fraud detection system that was implemented in 2018.

The system applied before to detect fraud corresponds to a rule based methodology.

F-Score is defined as

We use the formulation _{median}_{90} – _{10}).

The F-Score was constructed using

The author would like to thank to Cristina Rata and Joan-Ramon Borrell for constructive criticism of the manuscript.

The author has no competing interests to declare.