Capturing Hidden Geochemical Anomalies in Scarce Data by Fractal Analysis and Stochastic Modeling

Fractal/multifractal modeling is a widely used geomathematical approach to capturing different populations in geochemical mapping. The rationale of this methodology is based on empirical frequency density functions attained from global or local distributions. This approach is quite popular because of its simplicity and versatility; it accounts for the frequency and spatial distribution of geochemical data considering self-similarity across a range of scales. Using this technique for detection of geochemical anomalies in scarce data, however, is problematic and can lead to systematic bias in the characterization of the underlying populations. In this paper, an innovative technique is presented that provides good results without a priori assumptions. A simulation approach is adopted for fractal analysis by generating different possible distribution scenarios for the variable under study to reveal the underlying populations that are frequently hidden due to lack of data. The proposed technique is called the global simulated size–number method, and it is validated in a case study with two synthetic datasets and another case study with real dataset from the Ushtagan gold deposit in northeast Kazakhstan.


INTRODUCTION
Various mathematical and statistical methods have been applied to generate accurate geochemical or geophysical anomaly maps using the data provided and based on the frequency and spatial distribution of geochemical data. For example, traditional statistical methods of exploratory data analysis (EDA) (Tennant and White 1959;Hawkes and Webb 1962;Tukey 1977) and modern techniques such as fractal analysis (Mandelbrot 1983) have been used. Among them, fractal/multifractal modeling can be more robust in identifying significant geochemical anomalies because traditional statistical methods such as EDA are based mainly on geochemical value frequency distributions, but neglect the spatial variation of geochemical data.
Fractal/multifractal modeling accounts for both the frequency distribution and spatial variation of the geochemical values (Mandelbrot 1983;Feder 1988;Cheng 2012;Zuo and Wang 2016). In a nutshell, fractal/multifractal modeling considers both statistical and spatial distributions of geochemical data (e.g., Sadeghi et al. 2012Sadeghi et al. , 2015He et al. 2013;Luz et al. 2014). These properties are based mainly on self-similarity or self-affinity (i.e., statistical selfsimilarity at different scales (Mandelbrot 1983)). The most significant fractal models that have been developed to delineate geochemical anomalies can be classified into two families. The first family is based on posterior analysis or conditional distribution, such as concentration-area (C-A) and concentration-perimeter (C-P) fractal models (Cheng et al. 1994(Cheng et al. , 1996, spectrum-area (S-A) fractal models (Cheng et al. 1999(Cheng et al. , 2000, concentrationvolume (C-V) fractal models (Afzal et al. 2010), spectrum-volume (S-V) fractal models (Afzal et al. 2011), simulated size-number (SS-N) fractal models (Sadeghi et al. 2015) and wavelet-number (W-N) fractal model (Chen and Cheng 2018). In these methodologies, the spatial geometry of the geochemical landscape is considered for either deterministic or stochastic mapping to show the local distribution of the attribute under study and whether it is applicable for detecting different anomalies based on further fractal analysis. The second family involves the analysis of a priori or global distributions, such as the number-size (N-S) fractal model (Mandelbrot 1983), regarding a density function with statistical parameters informed by a sample histogram. In this case, sample locations are not important; one needs to analyze only the global distribution of the samples and variables.
In general, fractal methods tend to be erratic when data are limited and result in some sawtoothlike spikes in the distribution that make it non-representative. These artifacts lead to further suspicion of the N-S fractal model (Mandelbrot 1983) because they may potentially mask the actual thresholds and populations of interest. To circumvent this problem, one idea is to smooth out the histogram, particularly to reform the shape of the fluctuations and increase the resolution of the distribution (Pyrcz and Deutsch 2014;Rossi and Deutsch 2014). Certain techniques are available for this type of smoothing. The most straightforward approach is to fit a predefined parametric density function such as a power, normal or lognormal distribution to the sample data distribution (Johnson and Kotz 1970;Scott 1992;Borradaile 2003). Although these parametric models resolve spikes in a sample histogram, earth-related data in fact rarely display parametric behavior. Other solutions include nonparametric approaches, such as simulated annealing (Journel and Xu 1994;Deutsch 1996) or quadratic programming (Xu and Journel 1995) and optimization algorithms, which are based on closeness to input quartiles, target mean and variance values (Deutsch and Journel 1998). Kernel density estimation (Fix and Hodges 1951) is an alternative nonparametric approach that fits smoothed probability and cumulative distribu-tion functions (pdf and cdf) to empirical distribution functions (Silverman 1986;Scott 1992;Altman and Leger 1995;Sheikhpour et al. 2017). This technique requires a kernel density function, an optimal measure of bandwidth and a dataset (Wolfgang et al. 2004;Alexandre 2009;Samawi et al. 2016).
The algorithm proposed in this paper is based on the estimation of the density function model for scarce data using a kernel estimator. Using the estimated function, the possible scenarios of the underlying distribution can then be generated by applying Monte Carlo simulation. Once the simulated values are statistically qualified, the fractal analysis paradigm (SS-N), following Sadeghi et al. (2015), can be implemented to differentiate the geochemical anomalies. The resulting technique is called the global simulated size-number (GSS-N) method. The main difference between the SS-N method (Sadeghi et al. 2015) and the algorithm (GSS-N) is that the GSS-N method is based on global distribution, which is independent in terms of spatial continuity, whereas the SS-N method is based on local distribution, in which it is necessary to account for the spatial geometry of the geochemical landscape.
This paper will first show how the proposed innovative algorithm works by using two synthetic case studies, and then apply it to an actual case study from Kazakhstan to capture from just a few trench samples the alternative populations associated with a gold deposit. These case studies show that the GSS-N method can be applied to the pre-feasibility or feasibility study stages of a project that is faced with a problem of scarce data. The results of this study are intended to be useful guiding further exploration.

Number-Size (N-S) Fractal Method
The N-S method, proposed by Benoit Mandelbrot (1983), is a model based on the relationship between the cumulative number of samples and their related size, the latter representing metal concentration in this research (Li et al. 1994;Turcotte 1996;Shi and Wang 1998;Sanderson et al. 1994;Zuo et al. 2009;Sadeghi et al. 2012Sadeghi et al. , 2015. Considering Eq. 1, the number of samples with higher concentrations is less than the number of the samples with lower concentrations. Monecke et al. (2005) used the N-S method to describe enrichment of minerals based on replacement by metasomatic processes that cause the formation of hydrothermal deposits in the Waterloo massive-sulfide deposit in Australia. Sadeghi et al. (2012) applied this model in 3D for the first time to separate mineralized zones and wall rocks and then in 2015 improved the model by combining the N-S model with simulation methods, resulting in the proposal of the SS-N model.
In the equation developed by Mandelbrot (1983), the minus sign denotes the inverse relationship between the number of samples and their associated concentrations: where q is element concentration, N( ‡ q) is cumulative number of samples with concentration values greater than or equal to q; F is a constant, and D is the scaling exponent or fractal dimension of the spatial distribution of element concentrations. Loglog plots of N vs. q are used because if logarithms are applied to Eq. 1, the outcome is a straight-line equation. After generating log-log plots, they are fitted with straight lines to find the intersection points, which represent thresholds. The final interpolated map is classified using the obtained thresholds.

Kernel Density Estimator
Kernel densities are nonparametric techniques for smoothing histograms (Scott 1992). A kernel density estimator is applicable whenever sawtoothlike spikes appear in the distribution due to data scarcity. The concept of this estimator is to ''fill in'' gaps in raw data distributions (Leuangthong and Deutsch 2003). Given this background, a kernel density estimator can predict the pdf and cdf of a set of random data. The estimator formulas for any real values of u are given in the literature (Hill 1985;Silverman 1986;Jones 1993;Bowman and Azzalini 1997). In where u 1 , u 2 ,…,u n are random samples from an unknown distribution, n is number of data points, and Dz is bandwidth obtained by splitting the range of the data between the maximum and minimum observed values (Izenman 1991), which controls the smoothness of the density curve. Attention must be paid to defining the bandwidth because if it is large it generates a very smooth kernel function. K(.) is the kernel function, which is nonnegative and linked to a subset of a particular density function such as a normal, box, triangle or Epanechnikov function (Epanechnikov 1969).f u ð Þ is an estimate of the density function obtained by placing the underlying kernel function K(.) over each random sample u i in a dataset. Therefore, the kernel function K(.) is a density function with location parameter u and the bandwidth of interest. Because the resulting estimatef u ð Þ is obtained from summation of sub-kernel functions K(.), the selection of K(.) is not critical and quite often a normal function can be used for the sake of simplicity (Wand and Jones 1995). In other words, by defining each of those mentioned underlying functions, the final kernel density estimate will be closely related to the desired histograms and similarly will represent the probability distribution of the sample data with small differences. A kernel distribution results in a continuous and smooth probability curve, in contrast to a histogram, which builds discrete bins, and places each data value in the relevant bin. Through a small example, Figure 1 shows the visual comparison between a kernel fitted distribution and the related histogram from three arbitrary sample data: u 1 = 13, u 2 = 29, and u 3 = 29. For construction of a typical global distribution, the abscissa of the histogram is divided into subintervals or bins, which span the range of the data (Fig. 1a). As seen in Figure 1a, with only a few data points, the histogram generates a discrete probability function unsuitable for specific applications such as differentiating the alternative populations by fractal analysis. Figure 1b shows the overall fitted probability distribution function (solid line) by a normal kernel function that creates an individual probability density curve (dashed lines) for each data value and then sums the tiny smooth curves to form a unique, smooth, continuous probability density function for the entire data set. The tiny kernel functions in this case are normal, but their summations according to Eq. 2 result in a bimodal distribution, which is not necessarily normal but follows the desired distribution.

Monte Carlo Simulation
Monte Carlo simulation is a broad term for computational algorithms that generates random numbers (realizations) given a specific density function (Pyrcz and Deutsch 2014). The Monte Carlo simulation of random numbers from an arbitrary probability or cumulative distribution function can be obtained by Deutsch and Journel (1998) (Fig. 2): 1) Generation of a random number q, which is uniformly distributed between zero and one, q 0; 1 ½ 2) Retrieving the inverse of the cumulative probability distribution (cdf) function: where y is the inverse cumulative probability distribution f À1 (.). According to the ergodic theory (Chilè s and Delfiner 2012), in order to converge the statistical parameters (such as the mean and variance) of simulated values to the original dataset, it is typically necessary to simulate a large number of values.

Global Simulated Size-Number (GSS-N) Method
As mentioned previously, scarce dataset will result in a poorly formed global distribution function, leading to general problems when using further analysis. However, fractal and multifractal techniques need a trustworthy distribution to precisely delineate the domain of geochemical populations. The innovative GSS-N method in this paper is based on applying a smoothing technique (kernel density) to circumvent the problem of discrete probability functions, which stems from scarce data. The proposed algorithm is described as follows: a) Generating the histogram of the data In this step, the histogram of the available dataset is checked to determine whether the distribution is representative. To implement this, some criteria such as the coefficient of variation and kurtosis of the data are used. The former is the standard deviation normalized by a mean; it is a unitless measure of the variability (Davis 1986;Rossi and Deutsch 2014). The kurtosis is a good measure to ascertain whether the data distribution is flat or peaked. A negative kurtosis indicates that most samples are concentrated in the center of the distribution, making the shape of the histogram in the form of a sawtooth. b) Empirical cumulative distribution function This step includes calculation of a primary nonparametric estimate of the true cdf of a variable. In this graph, the abscissa shows data values ordered from smallest to largest and the ordinate represents the cumulative probability assigned to each data point (Davis 1986). When the sample size of the data is small, the curve gives a sawtooth-like shape and has a very spiky stair-step appearance. c) Kernel density function modeling This step applies one of the kernel functions to fit a cdf model to the obtained experimental cdf from the previous step. In this context, three items are significant: type of function, bandwidth and range of the data. The commonly used functions are the normal, box, triangle and Epanechnikov functions. For instance, the box kernel produces a density curve that is less smooth than the others. For the selection of the optimum function and the bandwidth, one can apply the cross-validation technique (Liu et al. 2014), plug-in methods (Tenreiro 2017) or even visual inspection, which provides a qualitative benchmark and can be accomplished through trial-and-error. The tentative range of the data in a model can be either identical to the original dataset or be chosen arbitrarily according to geological constraints. All these parameters should be considered cautiously to prevent over-and under-smoothing. that the SS-N fractal model (a combination of geostatistical simulation and fractal/multifractal analysis) is capable of delineating mineralized zones better than deterministic paradigms. This approach employs different realizations of the spatial variability beyond geostatistical simulation of the attribute under study. The main difference of the algorithm proposed in this paper and the SS-N algorithm is that the GSS-N is implemented on the realizations obtained from the global distribution of the underlying variable but not on the local distribution as in the SS-N method. In the GSS-N method, each realization results in one fractal curve and once the average of those curves is known, one is able to differentiate the possible thresholds on the averaged curve.

CASE STUDY
To demonstrate the capability of the proposed method, it is tested in two case studies. The first is based on two synthetic datasets, and the second employs an actual case study. In both case studies, relative discussions are provided.

Case Study with Synthetic Datasets
Two different distributions were considered to generate synthetic datasets suitable for applying the proposed algorithm. The first distribution (case I) consists of 200 random numbers that follow an exponential distribution, composed of two major populations with different mean (i.e., 0.5 and 5000). The second distribution (case II) consists of 900 normal random values composed of two major populations with different mean (i.e., 4 and 9). Aside from having different distribution types, the two major populations were deemed to be hidden in cases I and II but are distinguishable through visual inspection of the histogram (Figs. 3a and 4a). These two distributions were selected as references, and then 20 samples were drawn randomly from each distribution to produce two series of scarce data called ''experimental data''. Using this method, one can ensure that the histogram of either set of experimental data is not continuous; besides, scarce data dramatically impact the shapes of the distribu-tion (Figs. 3b and 4b). As a consequence, the empirical cdf is expected to display a sawtooth shape. Figures 3c and 4c (red solid lines) show that the resulting probability distribution functions computed from those scarce data are not trustworthy and justify the need to use some smoothing technique for fitting a theoretical cdf model (Figs. 3c and 4c: blue solid lines). The Epanechnikov kernel function (a nonparametric approach) was then considered to fit a model to an empirical cdf. As already discussed, this function does not have a distinct difference compared with other functions. Following the algorithm steps described earlier, 1000 numbers between 0 and 1 uniformly generated, and then equivalently, 1000 simulated values were inversely drawn from the kernel fitted model by Monte Carlo simulation. For fast, reliable convergence of the simulated values to the model, 100 total realizations were considered a sufficient number for such an examination. Figures 3d and 4d illustrate the empirical cumulative distribution functions for all the realizations (dashed green lines) and their respective average (blue solid line) that fairly converges to the fitted model (Figs. 3c and 4c: blue solid line). This convergence ensures that the realizations are statistically sound and they can be taken into account for further processing. Fractal analysis was implemented in each realization separately to obtain their relevant curve. Those curves were then averaged to obtain one unique curve, so as to account for defining the thresholds in each population (Figs. 3e and 4e). As shown in this figure, the shape of the fractal curve in the simulated results mimics the reference models and, consequently, the thresholds retrieved from the average realization curve are similar to those from the reference models. However, fractal analysis of scarce experimental data does not provide adequate knowledge for differentiating the populations (note the red points in the fractal sheet). Two synthetic case studies verify that the proposed algorithm comprises a safe paradigm to tackle the problem of data scarcity for fractal/multifractal analysis considering two different common global distributions (exponential and normal). A real dataset from Ushtagan gold deposit in Kazakhstan is used here. This case study is presented as an interesting example of scarce data because this phenomenon (lack of data) is a very common characteristic of the majority of gold deposits, particularly in the feasibility study phase of a project.
The Ushtagan gold deposit is located in northeast Central Kazakhstan and administratively is situated in the Bayanaul district of the Pavlodar region (Fig. 5). The Pavlodar region is one of the main industrialized regions of Kazakhstan with developed mining, fuel and energy sectors, and a multisector industrial complex. From a geological setting standpoint, the Ushtagan deposit is located in the middle-Devonian volcanic system. The central part of the volcanic system is represented by a stock of plagiogranite-porphyrites (Fig. 6). Different fragmental tuffogenic rocks with rare interlayers of effusive formations of acid and intermediate composition are common in the volcanogenic strata (Fig. 6), which are sometimes strongly silicified, tourmalinitized and pyritized. In present-day terrain, this structure is represented by small bald peaks formed by quartz-sericite-tourmaline and quartztourmaline rocks. Practically, the rock-type boundaries are not obvious macroscopically but are determined only by microscopic study. The fault tectonics are rather distinct. The following types of faults were recognized: (a) curved (semi-circular); (b) faults and diagonal slip faults; (c) accompanying echelon fractures. A curved fault restricts the neck of the plagiogranite-porphyrite from the northeast. This fault is rather steep, but with a weak incline to the periphery of the volcanic tectonic structure. The fault is accompanied by intensive milling of rocks and quartz-gold-sulfide mineralization. Faults and diagonal slip faults most commonly have submeridian and northeast directions; more rarely they have sub-latitudinal direction. One fault (in the b Figure 4. Synthetic case II, normal distribution; (a) Reference distribution, (b) experimental data, (c) cumulative distribution function, (d) cumulative distribution function of the realizations and their checking, (e) fractal analysis on the simulated values (blue points), reference model (black points) and the scarce data (red points). northeast) cuts the volcanic structure and restricts ore-bearing metasomatites from northwest in the region (Fig. 6). Mineralized rocks are characterized by intensive pyritization and vein-veinlet silicification with tourmaline. They are accompanied by intensive jarositization and limonitization from the surface. Gold-bearing bodies are represented by tourmaline-quartz metasomatites with breccia texture developed after igneous-sedimentary rocks.
There is less developed mineralization in metasomatites of breccia plagiogranite-porphyrite. The mineralized zone represents a linear stockwork with uneven, complex internal structure. The highest gold and silver grades are limited to quartz-sulfide and quartz-chalcedonic veinlet zones which differ in their intense rust color compared to the rocks on the surface. Gold-hosting tourmalinized rocks of the Ushtagan area were discovered in 1953. In 1955, during prospecting, two lenses of secondary quartzite were mapped. The database in this study was obtained from eight trenches made by Goldbelt Resources LTD in 1966 on the deposit with a total volume of 3746 m 3 . The level of trench deepening is 0.3-0.4 m (Fig. 6). The majority of trenches (due to the complexity of the visual identification of ore intervals) were tested along their entire length by channel sampling and using gold spectral or atomic absorption analyses followed by fire assay test. According to the observations in the trenches, the tuffs, breccias and mineralized bodies with quartz and quartztourmaline veins dip sub-vertically (70-80°) in the northeastern part. The initial statistical analysis on 26 gold accumulation samples meter Â gram ton abreviated as mg t taken from the eight trenches revealed that the lack of data potentially results in an erratic global distribution, a characteristic which makes it a suitable target for implementing the proposed algorithm (Fig. 7). The histogram in Figure 7 displays some discontinuity from approximately 40 to 90 mg=t. A moderately high coefficient of variation and negative kurtosis can also be another set of clues for this scarce dataset (Table 1).
To capture the populations in this deposit, taking the proposed algorithm into account, we recall the previously mentioned instructions from Sect. 3.3: a) Generating the histogram of the data As shown in Figure 7 and Table 1, the scarcity of the data is clearly apparent and motivates one to use the proposed algorithm. b) Empirical cumulative distribution function Once the probability distribution function is known, the empirical cdf can then be computed. This function, as explained before, is a step function that for these data increases by 1/26 for each of the 26 samples. c) Kernel density function modeling the tentative theoretical function model for fitting to the empirical cdf has been considered ''normal'' with bandwidth ''0.4'' (Fig. 8a). To check the accuracy of the fitted model, crossvalidation was performed by omitting one actual data point and re-estimating it using the kernel fitted model obtained from the remaining data (Fig. 8b). Two values for the underlying datum were scanned, which all are uniformly distributed between 0 and 1. The first value gives the original empirical cdf (abscissa in Fig. 8b), the second value gives the empirical cdf (ordinate, Fig. 8b), and they were measured from the raw dataset and the kernel fitted model, respectively. The cross-validation results illustrate that the scattered points, composed of first and second values, are well distributed along the diagonal and are in agreement with a small error. d) Monte Carlo simulation In this step, 1000 samples within 100 realizations were drawn from the ''normal'' kernel fitted model. To check whether the simulated results were statistically valid, the empirical cdf curve over each realization was computed (green solid line) and then their related average (solid blue line) ( Fig. 9) was generated (Fig. 8a). This comparison shows that the simulated results on average converge to the model and can be employed for use in further analysis. e) Fractal analysis The posterior simulated values were obtained by the process of N-S fractal modeling to capture the concealed populations for both sets of fractal analysis obtained from the raw dataset (Fig. 10a) and the average of the results from the realizations (Fig. 10b). Detecting the thresholds through fractal analysis of the raw data is somewhat complex, tricky and needs particular interpretation. Nevertheless, through the GSS-N method, besides the previous steps and all the statistical checking during the simulation, it was verified that the simulated values on average follow the model. Therefore, one is able to define the thresh-   olds on this graph (Fig. 10b) without any loss of information. For each population, Table 2 shows the ranges for gold accumulation variability and further, the statistical parameters obtained from fitting the linear functions (solid lines) by least squares technique to each population in both cases (original data and GSS-N method) over the fractal points. According to Table 2, the R 2 , adjusted R 2 and root-mean-squared error (RMSE) for the GSS-N method indicate valid results. High values of R 2 , adjusted R 2 and the closeness of RMSE to zero imply that more points incorporated to construct such an acceptable fit lead to producing a more reliable range of geochemical populations. However, care must be taken to examine the populations derived from the original data (26 samples) according to the statistical validation parameters, as the population ranges seem to be suspicious.

DISCUSSION AND CONCLUSION
We presented an innovative algorithm for geochemical anomaly identification for variables of interest in scarce data. The difference between this technique and the methodology applied by Sadeghi et al. (2015) for anomaly identification is explained in this section. Sadeghi et al. (2015) presented the simulated size-number (SS-N) approach considering a Monte Carlo simulation over the local distribution. For instance, turning band simulation (Emery and Lantuejoul 2006) or sequential Gaussian simulation (Almeida and Journel 1996) has been recommended to use for construction of such a stochastic local distribution that can be used subsequently for differentiating deposit anomalies. However, this approach is robust when sufficient data are available and can be considered subject to the proper spatial continuity analysis (variogram) and further proba-  bilistic modeling (i.e., generating the different realization maps). Reliable maps produced from a large dataset are predictable; the analysis showed that the anomalies acquired from these maps by fractal analysis are more trustworthy than those obtained by applying deterministic solutions (such as kriging). However, when data are scarce, fractal analysis does not give correct local distributions because non-robust variogram models often lead to biased thresholds. The GSS-N method as offshoot of the SS-N method is proposed as an innovative algorithm for dealing with geochemical mapping with scarce data, and it does not require consideration of local distributions. The case studies using synthetic and real dataset confirm the capability of the proposed algorithm. Fractal/multifractal methods of analysis have been widely used to delineate anomalies in geochemical and ore deposit modeling because of their practical benefit and versatility. These methods are prone to potential biases due to the non-representative distribution that results from scarce data. A global Monte Carlo simulation algorithm is proposed to address this problem of non-representativeness of the distribution due to scarce data. Fitting straight lines on N-S multifractal log-log plots would be problematic and uncertain in case of scarce data; however, using the GSS-N method overcomes the problem of scarce data and thus facilitates straight-line fitting on log-log plots. Therefore, in case of scarce data, thresholds can be easily recognized with higher certainty through the proposed GSS-N method compared to the N-S fractal model. The proposed algorithm employs a simulation approach to calculate possible distribution scenarios of the variable under study based on fitting a kernel density function to the empirical cdf. The theoretical results from using two synthetic datasets and real  dataset from a gold deposit demonstrate that new algorithm removes possible biases, insures reproduction of the actual hidden distribution of scarce data and improves the capture of the underlying populations even when the data are scarce.

ACKNOWLEDGMENT
The first author acknowledges Nazarbayev University for funding this work via Social Policy Grant. The authors also appreciate Prof. Priscilla P. Nelson, head of the Department of Mining Engineering in Colorado School of Mines, for providing the data for real case study for this paper. The authors also appreciate Dr. Masoumeh Khalajmasoumi for her kind supports. The authors are appreciated the constructive comments from two anonymous reviewers, and also we are grateful to Dr. John Carranza for the valuable comments which substantially helped improving the final version of the manuscript.