Methods to Assess Groundwater Potential by Spring Locations

Abstract

Regarding the ever increasing issue of water scarcity in different countries, the current study plans to apply support vector machine (SVM), random forest (RF), and genetic algorithm optimized random forest (RFGA) methods to assess groundwater potential by spring locations. To this end, 14 effective variables including DEM-derived, river-based, fault-based, land use, and lithology factors were provided. Of 842 spring locations found, 70% (589) were implemented for model training, and the rest of them were used to evaluate the models. The mentioned models were run and groundwater potential maps (GPMs) were produced. At last, receiver operating characteristics (ROC) curve was plotted to evaluate the efficiency of the methods. The results of the current study denoted that RFGA, and RF methods had better efficacy than different kernels of SVM model. Area under curve (AUC) of ROC value for RF and RFGA was estimated as 84.6, and 85.6%, respectively. AUC of ROC was computed as SVM- linear (78.6%), SVM-polynomial (76.8%), SVM-sigmoid (77.1%), and SVM- radial based function (77%). Furthermore, the results represented higher importance of altitude, TWI, and slope angle in groundwater potential. The methodology produced in the current study could be transferred to other places with water scarcity issues for groundwater potential assessment and management.

Key words: Geographic information system, Ardebil, Iran, Support vector machine, Random forest, Genetic algorithm

Introduction

Water scarcity is regarded as one of the most substantial soicio-environmental challenges in different countries. The demand on groundwater is increasing, and the overexploitation of this valuable resource is threatening future generations (Todd and Mays 2005; Rekha and Thomas 2007); Thus, its management is believed to be vital. A better water resources management plan would be possible when there is enough knowledge about the resources (i.e. high potential and susceptible zones).

In recent years, researchers have made use of a variety of models to map groundwater potential such as frequency ratio (FR), weight of evidence (WofE), logistic regression (LR), index of entropy, evidential belief function (Oh et al. 2011; Ozdemir 2011a, b; Pourtaghi and Pourghasemi 2014; Davoodi Moghaddam et al. 2015; Naghibi and Pourghasemi 2015; Naghibi et al. 2015). Also, some researchers used machine learning methods including boosted regression tree (BRT), classification and regression (CART), general linear model (GLM), and RF algorithms in this field of study (Naghibi and Pourghasemi 2015; Rahmati et al. 2016). Lee et al (2012) employed artificial neural network (ANN) to assess groundwater productivity. Their results showed satisfactory performance of ANN. Recently [M1]Naghibi et al. (2017) used four recently developed data mining models including AdaBoost, Bagging generalized additive model, and naÃ¯ve bayes for groundwater potential mapping. They have also introduced a novel ensemble method from combination of the mentioned models and FR. In addition, Magaji et al. (2016) used geographic information system and evidential belief function model to produce groundwater recharge potential zones map. Theodossiou (2004) investigated how climate change influences the sustainability of groundwater in watershed-scale in Greece. Furthermore, Thivya et al. (2016) conducted a study to identify recharge mechanisms of groundwater in hard rock aquifers implementing stable isotopes.

Support vector machine (SVM) algorithm has been employed in different fields of study such as flood susceptibility assessment (Tehrany et al. 2014; Tehrany et al. 2015), and landslide susceptibility investigation (Brenning 2005; Kavzogluetal 2014; Tien Bui et al. 2012; Yao et al. 2008; Yilmaz 2010; Tien Bui et al. 2015; Chen et al. 2017) with suitable efficacy. Genetic algorithm is one of the most advanced and pervasive developed heuristic search techniques in artificial intelligence and its application has been done in many fields of study including urban planning, ecological, climatic modelling, and remote sensing studies (Hasegava et al. 2013; Termansen et al. 2006; Chang et al. 2006; Chen et al. 2009).

In the current study, we aim to investigate the performance of a novel method for optimization of random forest and its results are compared with RF and SVM models in groundwater potential mapping. Based on the literature review, application of different kernels of SVM and RFGA in groundwater potential mapping are two main novelties of this study. Also, the importance of different effective factors in groundwater potential is discussed. The results of the current study could determine high potential and susceptible groundwater potential zones and be used by water resource managers.

Material and Methods

Figure 1 shows the methods and the flowchart implemented in the current study.

Study Area:

The study area lies from 48Â° 18â€² 26â€³ to 48Â° 53â€² 16â€³ eastern longitudes and from 37Â° 41â€² 23â€³ to 37Â° 09â€² 26â€³ northern latitudes in Ardebil Province, Iran (Fig. 2). It covers an area of 1,524 km². The elevation in the study area ranges from 840 to 3,320 m above sea level with an average of 1,930 m. The mean annual precipitation of Khalkhal region is measured as 345 mm. The mean annual temperature of Khalkhal region is 12 degrees Centigrade. In the respect of land use, 89.69% of Khalkhal region is covered by rangeland, and other land use classes are forest, agriculture, orchard, and residential areas. In the respect of lithology, Khalkhal region comprises of 14 lithological categories. Eav class (andesitic volcanic) covers most of the study area. Khalkhal region is located in Ardebil province of Iran which includes 14 hydrological watersheds. These watersheds are located in three main parts including central part, Khoresh Rostam, and Shahrood areas. In this area people exploit water resources by wells (42%), springs (47%), and qanats (11%); therefore, it can be seen that a high percent of the water requirement is obtained by springs.

Data preparation

Spring characteristics

The spring’s location map was prepared for the study area using national reports (Iranian Department of Water Resources Management) and extensive field surveys in 1:50,000 scale. From 842 springs identified in the study area, 70% (589 springs) were considered for training purpose, and 30% (253 springs) were used as validation dataset (Fig. 2). Approximately ninety percent of the springs are permanent and ten percent of them are seasonal. Discharge of the springs in Khalkhal region alters between 0.1 and 100 liters per second having an average of 1 liter per second. It can be seen that there are different kinds of spring in the study area such as contrast, drainage, and fracture springs with 5.34%, 29.81%, 58.08%, and 6.77% of the springs, respectively. The average pH of the springs is measured as 6.68. The average electric conductivity (EC) of the springs is measured as 470 .

Groundwater effective factors

In this study, based on the literature review (Ozdemir 2011a, b; Oh et al. 2011; Naghibi et al. 2017), fourteen groundwater effective factors such as altitude, slope angle, slope aspect, plan curvature, profile curvature, slope length (LS), SPI, TWI, distance from rivers, river density, distance from faults, fault density, land use, and lithology were provided and mapped.

The digital elevation model (DEM) of the Khalkhal region was created using the 1:50,000-scale topographic maps in 20 m resolution. Groundwater effective-factors such as altitude, slope angle, and slope aspect were prepared using DEM in ArcGIS 9.3 and represented in Fig. 3a-c.

Plan curvature describes the divergence and convergence of flow and discriminates among basins (Fig. 3d). Profile curvature shows the rate at which the slope gradient alters in the direction of maximum slope (Catani et al. 2013) (Fig. 3e). Slope length is the combination of the slope length and slope steepness that shows soil loss potential from the combined slope features (Fig. 3f). SPI is a measure of the erosive power of flowing water based on the assumption that discharge is relative to specific catchment area (Moore et al. 1991) (Fig. 3g). The TWI affects accumulation and movement of surface runoff over the land surface (Elmahdy and Mostafa Mohamed 2014) (Fig. 3h).

Distance from rivers and river density were created using topographical map of Khalkhal region (Fig. 3i, j). Also, distance from fault and fault density layers were produced using geological map (Fig. 3k, l).

The land use map was created using Landsat images (Fig 3m). There are five land use classes in the study area such as agriculture, forest, orchard, rangeland, and residential area. Most of the study area is covered by the rangeland land use class. The lithology map was acquired using a 1:100,000-scale geological map and the lithological units were grouped into fourteen classes (GSI 1997, Fig. 3n, Table 1).

Support vector machines (SVM)

SVM is known as a supervised machine learning technique that is performed based on the (SRM: structural risk minimization) principle and statistical learning theory (Tien Bui et al. 2012). SVM transforms original input space into a higher-dimensional feature space to find an optimum separating hyper plane. MarjanovicÂ´ et al (2011) affirmed that separating hyper-plane is built in the original space of n coordinates between the points of two distinct classes. If the point is situated over the hyper-plane it will be classified as positive 1, if not, it will be classified as negative 1.

Penalty (C) controls the trade-off between margin and training errors, which assists to prevent the model’s over-fitting (MarjanovicÂ´ et al. 2011). The kernel width (É¤) controls the degree of nonlinearity of the model (Tien Bui et al. 2012). Parameter (d) is the polynomial degree in the PL kernel function and (r) is the bias term in the kernel function for two kernels of SVM including PL and SIG kernels (Tehrany et al. 2014). In the current study, the 10-fold cross-validation was used to select the optimal kernel parameters of SVM (Pradhan 2013; Zhuang and Dai 2006).

Random forest (RF) model

Random forests (RFs) are very flexible and powerful ensemble classifiers based on decision trees which were firstly developed by Breiman (2001). RF constructs multiple trees based on random bootstrapped samples of the training dataset (Breiman 2001). The algorithm runs random binary trees that implement a subset of the observations over bootstrapping approach, of the initial dataset a random choice of the training data is selected and implement to create the model, the data which is not included are described as out of bag (OOB) (Catani et al. 2013). The RF predicts the importance of a variables by looking at how much the error of prediction increases when out of bag data for that variable is permuted while all others are left fixed (Liaw and Wiener 2002; Catani et al. 2013). Random forests need two parameters to be tuned including the number of trees (ntree), and the number of variables (mtry).

Genetic algorithm (GA) model

A genetic algorithm (GA) is a search heuristic which mimics the natural selection process in the field of artificial intelligence. GA beings with a population of presented random solutions in some structure series. Then, a number of operators are repeatedly implemented, until convergence is obtained. As a matter of fact, the optimization strategy in GA could be described as a global optimization procedure with the benefit of not being dependent on the initial value to gain the convergence. Crossover and mutation are implemented to produce newer and better chromosomes populations (Yetilmezsoy and Demirel 2008).

Random forest optimization methods

In this study, we used two different methods for RF parameter optimization including caret package and genetic algorithm. Both of the models were applied in the R software.

At first, we presented a hybrid RFGA model to predict groundwater potential which was firstly introduced by Hasegawa et al (2013) in the field of commute mode choice analysis. A simple method is trial and error, but there are many mixtures of parameters, and it needs much iteration to evaluate the options. Another method for optimization of these parameters is to use caret package. So, we proposed a practical method for optimizing the parameters of RF by meta- heuristic optimization using GAs. The rgenoud package of the R program (R Core Team 2012); Mebane and Sekhon (2011) were used to implement the optimizing process of RF parameters ntree and mtry. Input parameters of the RFGA model are subject to the GA-based parameter optimization process. Only that pair of parameters that minimizes the OOB error rate in this step is used as input to the RFGA model. For running RFGA, maximum number of generations was considered as 100, the population size was 300 and the domain of allowable values for each parameter of the function being optimized (mtry values between 1 and 14, ntree values between 1 and 2000). The run time of this process till the calculation is complete was approximately 2 h 20 min.

Validation of groundwater potential maps (GPM)

In the current study, receiver operating characteristics (ROC) curve was used to determine the performance of the GPMs produced using the implemented models. The area under the ROC curve (AUC) shows the quality of a forecast system by representing the ability of the system to predict correctly the occurrence or nonâ€occurrence of specific “events” (Negnevitsky 2002). The area under the curve of ROC ranges from 0 to 1. The qualitative relationship between AUC and prediction accuracy could be classified as excellent (0.9-1), very good (0.8-0.9), good (0.7-0.8), average (0.6-0.7), and poor (0.5-0.6). Based on the reviewer comment, and in order to consider the discharge values of the springs, two weights were assigned to the springs to take their discharge into account in the evaluation process. For conducting this idea, median was calculated for discharge values of the springs. Then, weight 2 was assigned to the springs with greater discharge than the median value, while other springs were assigned to a weight of 1. Finally, for calculating ROC values, values of the springs with weight 2 were considered twice in the analysis, while other springs were considered once. This procedure enhances the influence of the springs with higher discharges in the evaluation process.

Results

Support vector machine

In the current study, four kernels of the SVM model were optimized by cross-validation and GPMs were plotted in ArcGIS 9.3. Based on the results, the best SVM with LN kernel had a cost value of 0.001. The results of PL kernel showed that gamma=0.5, cost= 0.1, and degree= 2 had the best performance. In the case of SVM-SIG, best performance was gained by gamma= 1, and c= 0.01. The results of SVM-RBF showed that gamma= 0.5, c= 10 had the best performance.

The resultant GPMs produced using different kernels of the SVM are represented in Fig. 5 and Table 2. According to the results, low, moderate, high, and very high classes in GPM produced by SVM-LN occupy 15.88, 36.05, 33.75, and 14.32% of the study area, respectively. Low, moderate, high, and very high classes in SVM-PL cover 3.38, 22.12, 47.52, and 26.98% of the study area, respectively. In the case of SVM-SIG, 22.87, 32.98, 30.50, and 13.64% of the study area were designated to the low, moderate, high, and very high classes, respectively. The results of SVM-RBF showed that low, moderate, high, and very high classes cover 22.01, 45.85, 22.39, and 9.74% of the study area, respectively.

Random forest (RF), and genetic algorithm optimized random forest (RFGA)

As mentioned in the methods section, two methods were used to optimize RF model including caret and genetic algorithm. Final model by RF-caret had ntree= 1600, and mtry= 2, while final model by RFGA had ntree= 1744, and mtry= 2. The results showed that out of bag error for RFGA (0.316) was lower than its value for RF-caret (0.35%). Also, the results of the ROC analysis showed better performance of RFGA than RF-caret by area under the curve of ROC values of 86.5, and 85.6, respectively. Considering the better performance of the RFGA model, its results about the importance of effective factors and final GPM were represented and the results of RF-caret were ignored.

Figure 4 represents the mean decrease accuracy, and mean decrease Gini obtained by RFGA. According to the mean decrease accuracy, altitude had the highest importance, followed by TWI, slope angle, and aspect, while the profile curvature, and plan curvature had lowest importance. On the other hand, results of the mean decrease Gini depicted that land use, and lithology were the least important factors in groundwater potential mapping. The GPM produced using RFGA is represented in Fig. 5. According to the results, low, moderate, high, and very high classes in GPM produced by RFGA occupy 27.2, 32.4, 25.5, and 14.8% of the study area, respectively.

Validation of the GPMs

The ROC was calculated for all GPMs with spring’s validation dataset. The results of AUC-ROC are represented in Fig. 6. AUC-ROC for GPMs produced by the implemented methods in the current study ranges from 76.9 to 85.5%. AUC-ROC values for RF and RFGA were estimated as 84.6, and 85.5%, respectively. AUC-ROC values were estimated for SVM- LN, SVM-PL, SVM-SIG, and SVM- RBF as 79.3, 77, 77.7, and 76.9%, respectively.

Discussion

In this section, the results are discussed by three parts including (i) the performance of the models, (ii) the importance of the effective factors, and (iii) the precision of the GPMs.

The performance of the models:

The results showed that RFGA represented better performance than RF-caret. One of the advantages of GA is the capability to solve any optimization problem based on chromosome approach; another important characteristic of GA is its capability to handle multiple solution search spaces and solve the problem in such an environment (Tabassum and Mathew 2014). These advantages may have caused RFGA’s better performance in the current study.

Also, it can be seen that both RFs (i.e. RF-caret and RFGA) had better performance than different kernels of SVM model. The results of different SVM kernels showed that SVM-LN had the best performance, followed by SVM-SIG, SVM-RBF, and SVM-PL; However, their performance was similar. Based on the results, it is evident that SVM could be used as an efficient machine learning model in groundwater potential mapping. One of the drawbacks of the SVM relates to the needed time for the analysis. In addition, several criteria should be tested in order to find the optimum values for the modeling process (Tehrany et al. 2015). However, the efficiency of the SVM could be increased by making ensemble models. In a research, Tehrany et al (2015) used an ensemble weights of evidence and SVM model in flood mapping. Their results proved the efficiency and strength of the ensemble method over the individual methods. There are several potential reasons for error in the datasets implemented for groundwater modeling, including measurement errors, limitations in field data collection, sampling bias, etc. The mentioned errors could affect the overall accuracy of the SVM models (Moisen et al. 2006).

The importance of effective factors in groundwater potential mapping

The importance of effective factors was determined using RFGA as the best model in the current study. Based on the results, in total, altitude, TWI, slope angle, and slope aspect were the most effective factors on groundwater potential. On the other hand, plan curvature, profile curvature, land use, and lithology were the least effective factors on groundwater potential. A growing body of literature investigates the importance of different effective factors in groundwater potential mapping (Naghibi and Pourghasemi 2015; Rahmati et al. 2016). The results of Naghibi and Pourghasemi (2015) showed that altitude, distance from faults, SPI, and fault density had the highest importance in groundwater potential mapping. In another research, Rahmati et al (2016) depicted that altitude, drainage density, lithology, and land use were the most influence factors on groundwater potential. Comparing the results of the current study and the results of the two mentioned researches shows that the importance of effective factors in groundwater potential mapping is dependent on the indicator, methods, and hydrological, geological, and climatic conditions of the target area.

The precision of the GPMs:

With this assumption that a better model is the one which determines the high and very high classes more precisely, a model with lower percent of high and very high classes area could be more helpful in water resources planning and management. A more precise GPM could help water resources managers to make better and more accurate decisions about areas for exploitation and even water conservation techniques. According to the results, SVM-RBF, and RFGA models had the lowest percent of the high and very high classes with 32.1, and 40.3% of the study area, respectively.

Conclusion

In general, the water crisis in the 21th century is much more related to management and planning than to a real crisis of scarcity and drought stress. Lack of knowledge of water resources and inappropriate water resources management plans and strategies have made water crisis worse in arid and semi- arid regions. Therefore, the first step in appropriate planning of water resources is to know and gain knowledge of these vital resources. Groundwater is one of the most important water resource supplies, especially in arid and semi- arid countries with extreme lack of water, growing population, and successive droughts. Considering the mentioned problems and issues, in the current study, we evaluated the performance of different kernels of SVM model and two strategies for optimization of RF (i.e. caret and GA). The results of the current study showed that RFGA had the best performance, followed by SVM-LN, SVM-SIG, SVM-RBF, and SVM-PL. The RFGA was successfully implemented in the current study. Also, different kernels of the SVM were used for producing GPMs with acceptable performances. However, their result was not as well as RFs performance. Furthermore, it can be seen that altitude, TWI, slope angle, and slope aspect were the most effective factors in groundwater potential assessment. The methodology produced in the current study could be transferred and tested in other areas for producing GPMs. As a final conclusion, GPMs could significantly help water resources managers and planners for better understanding of water resources conditions, exploitation, and conservation plans.

[M1]check

Order Now