Abstract:
Crime is one of the most negatively affecting destructive factor for society. The efforts of
law enforcement bodies are mostly oriented to determine the criminals post factum.
However, in order to reduce the crime growth tendency proactive measures are essential.
Therefore, constructing an effective crime sensitive region prediction model along with
identifying proper features (factors) would concentrate the efforts of governmental
bodies on most vulnerable areas.
The objective of this research is to apply a suitable machine learning algorithm on
crime, economic and social data to predict the likelihood of particular regions having low
or high crimes levels with further defining main social and economical factors that
correlate with crime growth in order to assist not only law enforcement bodies but whole
governmental programs to solve related issues and improve crime prevention measures.
In current work the most accurate prediction models were compared and
investigated. Tests on available open source data were made and acquired models were
applied to available data from Kazakhstani officials.
During evaluation two main issues were faced: inconsistency and inadequacy of
data. Consequently, data collection, exploration, preprocessing and normalization were
significant steps.
Furthermore, the number of popular models with efficient methodology were
compared, combined and the one, that proved to be appropriate for Kazakhstani situation
was figured out. Main prediction models based on Classification, Regression and Clustering
techniques: Decision Tree, Random Forest, Naïve Bayesian, K-means, Support Vector
Machine algorithms were selected.
They were tested applying both - data available from opensource materials and
collected from Kazakhstani state bodies. As a result of tuning parameters and testing
various types of feature selection techniques Random Forest model proved to be the most
accurate (UCI Repository materials, Accuracy: 0.837, Precision: 0.884, Recall: 0.872, F1
score: 0.868) among listed models, whereas Decision Tree achieved the best result on
Kazakhstani data (govstat.kz materials, Accuracy: 0.781, Precision: 0.801, Recall: 0.767,
F1 score: 0.784).
Furthermore, statistical analysis were performed to define an appropriate
threshold for classifying the high and low crime rate groups.
At final stage hypothesis of importance of a certain feature was tested and model
proved that this feature correlates with target (crime rate) and its inclusion positively
affected the accuracy of result. Therefore, it can be claimed that the more we acquire
expertise in the field of important features, the better selected model will perform.