CRIME PREDICTION AND FORECASTING: FEATURE SELECTION AND VULNERABLE REGION DETECTION MODELS

Bekmaganbet, Galym

NUR Home
→
01.NU Schools
→
School of Engineering and Digital Sciences
→
Theses and Dissertations
→
View Item

Advanced Search

CRIME PREDICTION AND FORECASTING: FEATURE SELECTION AND VULNERABLE REGION DETECTION MODELS

Bekmaganbet, Galym

URI: http://nur.nu.edu.kz/handle/123456789/5603

Date: 2021-07

Abstract:

Crime is one of the most negatively affecting destructive factor for society. The efforts of law enforcement bodies are mostly oriented to determine the criminals post factum. However, in order to reduce the crime growth tendency proactive measures are essential. Therefore, constructing an effective crime sensitive region prediction model along with identifying proper features (factors) would concentrate the efforts of governmental bodies on most vulnerable areas. The objective of this research is to apply a suitable machine learning algorithm on crime, economic and social data to predict the likelihood of particular regions having low or high crimes levels with further defining main social and economical factors that correlate with crime growth in order to assist not only law enforcement bodies but whole governmental programs to solve related issues and improve crime prevention measures. In current work the most accurate prediction models were compared and investigated. Tests on available open source data were made and acquired models were applied to available data from Kazakhstani officials. During evaluation two main issues were faced: inconsistency and inadequacy of data. Consequently, data collection, exploration, preprocessing and normalization were significant steps. Furthermore, the number of popular models with efficient methodology were compared, combined and the one, that proved to be appropriate for Kazakhstani situation was figured out. Main prediction models based on Classification, Regression and Clustering techniques: Decision Tree, Random Forest, Naïve Bayesian, K-means, Support Vector Machine algorithms were selected. They were tested applying both - data available from opensource materials and collected from Kazakhstani state bodies. As a result of tuning parameters and testing various types of feature selection techniques Random Forest model proved to be the most accurate (UCI Repository materials, Accuracy: 0.837, Precision: 0.884, Recall: 0.872, F1 score: 0.868) among listed models, whereas Decision Tree achieved the best result on Kazakhstani data (govstat.kz materials, Accuracy: 0.781, Precision: 0.801, Recall: 0.767, F1 score: 0.784). Furthermore, statistical analysis were performed to define an appropriate threshold for classifying the high and low crime rate groups. At final stage hypothesis of importance of a certain feature was tested and model proved that this feature correlates with target (crime rate) and its inclusion positively affected the accuracy of result. Therefore, it can be claimed that the more we acquire expertise in the field of important features, the better selected model will perform.

Show full item record