1 Introduction

Reducing student dropout rates is one of the challenges facing in the education sector globally. The problem has brought a major concern in the field of education and policy-making communities (). A growing body of literature indicates high rates of students dropout of school especially pronounced in the developing world; with higher rates for girls compared to boys in most parts of the world (). In Tanzania, for example, student dropout is higher in lower secondary education compared to higher level where girls are much less likely to finish secondary education comparing to boys; 30% of girls dropout before reaching form 4 as compared to 15% percent for boys (). The scenario is different in primary education, where by boys tend to drop-out of school more compared to girls. Besides, to the knowledge of searchers, developing countries lack enough researches on addressing this problem in higher level education. Finding and implementing solutions to this problem has implications well beyond the benefits to individual students. Moreover, enabling students to complete their education means investing in future progress and better standards of life with multiplier effects. To effectively address this problem, it is crucial to ensure that all students finish their school on time through early intervention on students who might be at risk of dropping classes. This require data-driven predictive techniques that can facilitate determination of at-risk students and timely planning for interventions ().

Machine learning approaches are one of the well sought solutions to addressing school dropout challenge. Various studies have been conducted in developed countries on developing student predictive algorithms (; ; ). Moreover, there exist quite a significant body of literature on machine learning based approaches associated with fighting dropouts (; ; ). The knowledge embodied in literature has the potential to transform the fight against dropout from reactive to proactive. This is a more feasible now than ever because the Information and Communication Technologies (ICTs) have already transformed the way data has been collected and managed, which is a key ingredient to any intelligent harnessing of useful patterns of recorded events. Despite several efforts done by previous researchers, there are still challenges which need to be addressed. Most of the widely used datasets are generated from developed countries. However, developing countries are facing several challenges on generating public datasets to be used on addressing this problem. Cost and time consuming are factors that led data collection process to be very difficult. The study conducted by Mgala () used the primary education data collected in Kenya, although the dataset is not publicly available. Besides that, Uwezo data on learning is the publicly available dataset which was collected countrywide for primary schools in Tanzania. The dataset focused on individual household data, including education.

In developing countries, prospects of dropout-free education system are still slim considering the scale of socio economic challenges, which are deemed central to the retention of students in schools. Increasingly, communities of practitioners and researchers are looking at machine learning approaches as a likely solution for achieving dropout-free schools. Student dropout has been a serious problem that adversely affects the development of the education sector, this is due to a complex interplay of socio-cultural, economic and structural factors (). Schooling, according to the human capital theory, is an investment that generates higher future income for individuals (). Many developing countries are experiencing high dropout rate of secondary school students as a big challenge which has been considered as a problem for the individual and society (). However, less attention is paid to improve quality of education to people belongs to any class. In this regard, a UNESCO () report points out, that about one thirty million children in the developing world denied their right to education through dropping out ().

In responding to this problem of dropping out and other challenges facing secondary schools, Tanzania as one among developing countries introduced an Education Training Policy (ETP) and Education Sector Development Plan (ESDP) (). These were established to focus on access, quality improvement, capacity development and direct funding to secondary schools. The combined effort was expected to improve the overall status of secondary education, but still the problem is far from over.

Therefore, in this article a survey of how machine learning techniques have been used in the fight against dropouts is presented. The purpose of conducted survey is to provide a stepping-stone for students, researchers and developers who aspire to apply the techniques. Key intervention points that were identified during our preliminary survey guided the herein presented survey. The intervention points included issues related to algorithms for predicting dropouts.

2 Method of study

This paper surveys the literature in academic journals, books, and case studies. The objective is to collect, organize, and synthesize existing knowledge relating to machine learning approaches on student dropout prediction. The surveyed papers focused on several works which have been done on machine learning in education such as student dropout prediction, student academic performance prediction, student final result prediction etc. The findings of these studies are very useful on understanding the problem and improving measures to address solution. We searched several databases such as ResearchGate, Elsevier, Association for Computing Machinery (ACM), Science Direct, Springer Link, IEEE Xplore, and other computer science journals. In searching sentences and keywords we used predicting student dropout, predicting student dropout using machine learning techniques, application of machine learning in education and student dropout prediction using machine learning techniques. We examined each article’s reference list to identify any potentially relevant research or journal title. The publication periods taken into consideration is 2013 to 2017. On types of text searched we use PDF, Documents and Full length paper with abstract and keywords. Furthermore, in search items we used journal articles, conferences paper, workshop papers, topics related blogs, expert lectures or talks and other topic related communities such as educational machine learning community. A substantial subset of the culled articles contributed to warrant inclusion in this study.

3 Machine learning in education

Over the past two decades, there has been significant advances in the field of machine learning. This field emerged as the method of choice for developing practical software for computer vision, speech recognition, natural language processing, robot control, and other applications (). There are several areas where machine learning can positively impact education. The study conducted by Center for Digital Technology and Management (), reported on the growth of the use of machine learning in education, this is due to the rise in the amount of education data available through digitization. Various schools have started to create personalized learning experiences through the use of technology in classrooms. Furthermore, Massive open on-line courses (MOOCs) have attracted millions of learners and present an opportunity to apply and develop machine learning methods towards improving student learning outcomes and leveraging the data collected ().

Owing to the advancement of the amount of data collected, machine learning techniques have been applied to improve educational quality including areas related to learning and content analytics (; ), knowledge tracing (), learning material enhancement () and early warning systems (; ; ). The use of these techniques for educational purpose is a promising field aimed at developing methods of exploring data from computational educational settings and discovering meaningful patterns ().

One of the first applications of machine learning in education had been helping quizzes and tests move from multiple choice to fill in the blank answers. The evaluation of students’ free form answers was based on Natural Language Processing (NLP) and machine learning. Various studies on efficacy of automated scoring show better results than human graders in some cases. Furthermore, automated scoring provides more immediate scoring than a human, which helps for use in formative assessment.

A few years ago, prediction has been observed as an application of machine learning in education. A research conducted by Kotsiantis (), presented a novel case study describing the emerging field of educational machine learning. In this study, students’ key demographic characteristic data and grading data were explored as the data set for a machine learning regression method that was used to predict a student’s future performance. In a similar vein, several projects were conducted including a project that aims to develop a prediction model that can be used by educators, schools, and policy makers to predict the risk of a student to drop out of school. Springboarding from these examples, IBM’s Chalapathy Neti shared IBM’s vision of Smart Classrooms using cloud-based learning systems that can help teachers identify students who are most at risk of dropping out, and observe why they are struggling, as well as provide insight into the interventions needed to overcome their learning challenges.

Certainly, machine learning application in education still face several challenges that need to be addressed. There is lack of available open-access datasets especially in developing countries; more data-sets need to be developed, however cost must be acquired. Apart from that, several researchers ignore the fact that evaluation procedures and metrics should be relevant to school administrators. According to Lakkaraju et al. (), the evaluation process should be designed to cater the needs of educators rather than only focused on common used machine learning metrics. In addition to that; the same study reveals that, many studies focused only on providing early prediction. While, a more robust and comprehensive early warning systems should be capable of identifying students at risk in future cohorts, rank students according to their probability of dropping and identifying students who are at risk even before they drop. Therefore, developing countries need to focus on facilitating a more robust and comprehensive early warning systems for students’ dropout. Also, there is need to focus on school level datasets rather than only focusing on student level datasets; this is due to the fact that school districts often have limited resources for assisting students and the availability of these resources varies with time. Therefore, identifying at risk schools will help the authorities to plan for resource allocation before the risk.

Furthermore, in the context of education data imbalance is very common classification problem in the field of student retention, mainly because the number of registered students is large compared to the number of dropout students (). According to Gao (), the imbalanced ratio is about at least 1:10. Besides, the minority class usually represents the most important concept to be learned, it is difficult to identify it due to exceptional and significant cases (). Since accuracy as a widely used metric has less effect on minority class than majority class (; ), several researchers applied other metrics such as F-measure (; ; ), Mean Absolute Error (MAE) (; ; ; ), Area Under the curve (AUC) (; ; ; ; ; ), mean squared error (; ), Root-Mean-Square Error (RMSE) (), error residuals (), and misclassification rates () on addressing the problem of student dropout.

The power of machine learning can step in building better data to help authorities draw out crucial insights that change outcomes. When students drop out of school instead of continuing with education, both students and communities lose out on skills, talent and innovation. On addressing student dropout problem, several predictive models were developed in developed countries to process complex data sets that include details about enrollment, student performance, gender and socio-economic demographics, school infrastructure and teacher skills to find predictive patterns. Although on developing predictive models, developing countries need to consider other factors such as school distance which has been ignored by several researchers but matters in the developing countries’ scenario. Despite the fact that, evaluation of developed predictive models tend to differ but the focus remains on supporting administrators and educators to intervene and target the most at-risk students so as to invest and prevent dropouts in order to keep young people learning.

4 Machine learning techniques on addressing student dropout

In the context of education on addressing student dropout prediction, the techniques for learning can be supervised or unsupervised.

Supervised learning is based on learning from a set of labeled examples in the training set so that it can identify unlabeled examples in the test set with the highest possible accuracy (). The paradigm of this learning is efficient and it always finds solutions to several linear and non-linear problems such as classification, plant control, forecasting, prediction, robotics and so many others ().

Several existing works have focused on supervised learning algorithms such as Naive Bayesian Algorithm, Association rules mining, ANN based algorithm, Logistic Regression, CART, C4.5, J48, (BayesNet), SimpleLogistic, JRip, RandomForest, Logistic regression analysis, ICRM2 for the classification of the educational dropout student (). However, under the classification techniques, Neural Network and Decision Tree are the two methods highly used by the researchers for predicting students’ performance (; ). The advantage of neural network is that, it has the ability to detect all possible interactions between predictors variables () and could also perform a complete detection without having any doubt even in complex nonlinear relationship between dependent and independent variables (), while decision tree had been used because of its simplicity and comprehensibility to uncover small or large data structure and predict the value ().

Unlike supervised, unsupervised learning algorithm is used to identify hidden patterns in unlabeled input data. It refers to provide ability to learn and organize information without an error signal and be able to evaluate the potential solution. The lack of direction for the learning algorithm in unsupervised learning can sometime be advantageous, since it lets the algorithm to look back for patterns that have not been previously considered ().

Several techniques have been proposed on addressing this problem of student dropout using different approaches such as Survival Analysis (; ), Matrix Factorization (; ; ; ; ), and Deep Neural Network (; ). Other approaches such as time series clustering (; ) were presented to perform clustering, which are extensively used in recommender systems ().

Survival analysis is used to analyze data in which the time until the event is of interest (). It provides various mechanisms to handle such censored data problems that arise in modeling such as longitudinal data (also referred as time-to-event data when modeling a particular event of interest is the main objective of the problem) which occurs ubiquitously in various real-world application domains ().

In the context of education, the use of survival analysis modeling to study student retention was developed. Ameri et al. () developed a survival analysis framework with the aim of identifying at-risk students using Cox proportional hazards model (Cox) and applied time-dependent Cox (TD-Cox). This approach captures time-varying factors and leverage those information to provide more accurate prediction of student dropout, using the dataset of students enrolled at Wayne State University (WSU) starting from 2002 until 2009. Certainly, subjects in survival analysis are usually followed over a specified period of time and the focus is on the time at which the event of interest occurs (). Thus, the benefit of using survival analysis over other methods is the ability to add the time component into the model and also effectively handle censored data. In spite of the success of survival analysis methods in other domains such as health care, engineering, etc., there is only a limited attempt of using these methods in student retention problem ().

Matrix factorization is a clustering machine learning methods that can accommodate framework with some variations (). The study presented by Hu and Rangwala (); Elbadrawy et al. (), described matrix factorization. In Elbadrawy et al. () study, two classes of methods for building the prediction models were presented. The aim of the conducted study was to facilitate a degree planning and determine who might be at risk of failing or dropping a class. The first class builds models using linear regression approaches and the second class used matrix factorization approaches. Regression-based methods describe course-specific regression (CSpR) and personalized linear multi-regression (PLMR) while matrix factorization based methods associate standard Matrix Factorization (MF) approach. The mentioned approach was applied on the dataset generated from George Mason University (GMU) transcript data, University of Minnesota (UMN) transcript data, UMN LMS data, and Stanford University MOOC data. One limitation of the standard MF method is that, it ignores the sequence in which the students have taken the various courses. Besides, the latent representation of a course can potentially be influenced by the performance of the students in courses that were taken afterward.

Furthermore, the work present in Iam-On and Boongoen () study, proposed a new data transformation model, which is built upon the summarized data matrix of link-based cluster ensembles (LCE). The aim of the conducted study was to establish the clustering approach as a practical guideline for exploring student categories and characteristics. This was accomplished using educational dataset obtained from the operational database system at Mae Fah Luang University, Chiang Rai, Thailand. Like several existing dimension reduction techniques such as Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA), this method aims to achieve high classification accuracy by transforming the original data to a new form. However, the common limitation of these new techniques is the demanding time complexity, such that it may not scale up well to a very large dataset. Whilst worst-Case Traversal Time (WCT-T) is not quite for a highly time-critical application, it can be an attractive candidate for those quality-led works, such as the identification of those students at risk of under achievement.

Deep neural network (DNN) is an approach based on Artificial Neural Networks (ANN) with multiple hidden layers between the input and output layers (). While, Probabilistic Graphical Model (PGM) combine probability theory and graph theory to offer a compact graph-based representation of joint probability distributions exploiting conditional independences among the random variables (). Similar to shallow ANNs, DNNs can model complex non-linear relationships (; ). Different deep learning architecture such as Recurrent Neural Network (RNN) and other probabilistic graphical model such as Hidden Markov Model (HMM) have been employed on the problem of student dropout ().

The study presented by Fei and Yeung (), considered two temporal models which are state space models and recurrent neural networks. These approaches were applied in two MOOCs datasets, one offered on the Coursera platform, called “The Science of Gastronomy”, and the other on the edX platform, called “Introduction to Java Programming”. The aim of the conducted study was to identify students at risk of dropping out. State space models describe two variants of Input Output Hidden Markov Model (IOHMM) with continuous state space while, recurrent neural networks describe vanilla RNN and RNN with Long Short Term Memory (LSTM) cells as hidden units. IOHMM was proposed by for learning problems involving sequentially structured data. As it was originated from HMM, it learned to map input sequences to output sequences. Moreover, unlike the standard discrete-state HMM, the state space in described IOHMM formulation is continuous so the state space can bear more representation power compared with enumerating discrete states. Furthermore, Vanilla Recurrent Neural Network (Vanilla RNN), unlike feed forward neural networks such as the Multi Layer Perceptron (MLP), allows the network connections to form cycles.

The limitation of the conducted study was vanishing gradient problem. While an important property of RNNs is their ability to use contextual information in learning the mapping between the input and output sequences. A subtlety is that, for basic RNN models, the range of temporality that can be accessed in practice is usually quite limited so that the dynamic states of RNNs are considered as short term memory. This is because of the influence of a given input on the hidden layer. Therefore, on the network output it will either decays or blows up exponentially so as to cycles around the recurrent connections. To handle short-term memory of RNNs last for longer so as to tackle the vanishing gradient problem, Long Short-Term Memory RNN (LSTM Network) was introduced.

On addressing the problem of student dropout, machine learning techniques have been applied in various platforms such as Massive Open On-line Course (MOOC) (; ; ; ) and other Learning Management System (LMS) such as Moodle (; ; ). These platforms generated datasets which contain information that can be categorized into academic performance, socio-economic and personal information (). MOOC platforms such as Coursera and edX is among popular used platforms for generating datasets to be used in student dropout prediction (). While, Moodle as a popular Learning Management System (), provides public datasets such as UMN LMS (). Furthermore, on identifying at risk students for early interventions, other researchers collected data from an on-line graduate program in the United States and validation was conducted by using Fall 2014 data set ().

5 Open Challenges for Future Research

On previous sections we have presented an overview of machine learning techniques on addressing student dropout problem and highlighting the gaps and limitations. Despite several efforts done by previous researchers, there are still some challenges which need to be addressed.

It has been observed that, most of the algorithms have been developed and tested in developed countries using existing datasets generated from developed countries. Furthermore, MOOC and Moodle are among the most used platforms which offer public datasets to be used on addressing the student dropout problem. The limitation of public datasets from developing countries (), brought the need to develop more datasets from different geographical location. This may include transforming registration information of students with ongoing academic progress from paper based approach into electronic storage. However, cost and time must be acquired to accommodate the process. Furthermore, to the knowledge of researchers, there are only few researches which has been conducted in developing countries. Thus, further research is needed to explore the value of machine learning algorithms in cubing dropout in the context of developing countries with inclusion of factors that applied in the scenario.

Second, most of the presented works have focused on providing early prediction only (). Therefore, developing countries’ research should focus on facilitating a more robust and comprehensive early warning systems for students dropout which can identify students at risk in future cohorts (early warning mechanism), rank students according to their probability of dropping (ranking mechanism) and identifying students who are at risk even before they drop (forecasting mechanism).

Third, most existing studies ignore the fact that dropout rate is often low in existing datasets. This is a serious problem especially in the context of student retention (), with dropout students significantly less than those who stay and thus future research should consider developing a student dropout algorithm with consideration of data imbalance problem.

Fourth, many studies focus on addressing student dropout using student level datasets. However, developing countries need to include school level datasets on addressing this problem due to the issue of limited resources which face many school districts (). This will involve the use of new sources school level data, that will consider school needs related features and applying additional machine learning approaches to improve predictive power of the proposed algorithm. The algorithm will enable relevant authorities to plan effectively and accurately, formulate policies, and make decisions on measures to address the problem; with concern of school level factors such as Pupil Teacher Ratio (PTR) which can be monitored by the authorities.

6 Conclusions

A survey of machine learning techniques on addressing student dropout problem is presented. The survey draws several conclusions; First, while several techniques have been proposed for addressing student dropout in developed countries, there is lack of research on the use of machine learning for addressing this problem in developing countries. Second, despite the major efforts on using machine learning in education, data imbalance problem has been ignored by many researchers. This facilitate using improper evaluation metrics on analyzing performance of the algorithms. Third, many researches focus on providing early prediction rather than including ranking and forecasting mechanisms on addressing the problem of student dropout. Lastly, school level datasets must be considered when addressing this problem, in order to come up with the proposed solutions to facilitate the authorities on identifying at risk schools for early intervention.