Comparative study of extraction features and regression algorithms for predicting drought rates

ABSTRACT


INTRODUCTION
In Indonesia, rice drought can occur every year due to El Nino and could significantly impact the agricultural sector in several Indonesian regions [1].To measure farmers' and communities' resilience is facing drought and identify the factors influencing it to summarize the policy implications with the various indicators produced.It is also obtained from the application of livelihoods in identifying determining factors.Strengthening farmers' resilience to drought can be strengthened by the ease of credit, easy equipment rental, and technical efficiency in rice production [2].Drought can significantly impact crop yields when production is reduced, leading to price increases to consumers [3].It also increases production costs which can have an impact on the economic sector [4].Drought on agricultural land can significantly impact the economy, politics, and technology, especially in high severity that creates enormous losses [5].In the rice-growing season period, an adequate irrigation system is required, but drought can occur at any time.Climate change is currently impacting different rainfall patterns every year, even in different regions [6].
In this research, a comparison of the moisture content in the vegetative and generative phases was carried out to be predicted in the ripening phase because the water content in each phase was different and TELKOMNIKA Telecommun Comput El Control


Comparative study of extraction features and regression algorithms for … (Irza Hartiantio Rahmana) 639 greatly affected rice growth and ultimately affected grain production.In the vegetative phase, growth has active tillers, a gradual increase in plant height, and leaves begin to increase periodically.Extension stems characterize the reproductive phase, decreasing the number of tillers, booting, the appearance of flag leaves, crowns, and flowering.In the reproduction phase, on average, it can be estimated at 30 days in most cultivars.The initial phase extends the internodes or the grafting stage and varies slightly according to cultivar and weather conditions.During this period, grains' size and weight will increase from starch and sugar sources released from the sheath of leaves and stems.The grain turned to gold, and the rice leaves began to age [7].Drought conditions can decrease the quality of grain yields per clump, especially in chlorophyll content, the ratio of chlorophyll a/b, and increased proline and total sugar accumulation [8].
Remote sensing is the observation of an object using a device remotely [9].Sentinel-2 consists of 13 spectral bands and has an orbital map with a width of 290 km.Each of the Sentinel-2 constellation satellites has a repeating cycle of 10 days, and with both satellites fully operational, a 5-day resolution can be achieved at the equator [10].According to literature research using the Sentinel-2A vegetation value index, five classes were taken from these three main periods: land preparation, early vegetative, late vegetative, generative, and harvest/ripening [11].Thus, to classify the cover of rice fields can use Sentinel-2 imagery [12].Moreover, it supported using the google earth engine for image processing without using clouds.The imagery used comes from the Sentinel-2A satellite because it is more accessible using an earth engine than Google, as it costs nothing and rotates around the earth for ten days [13] so that monitoring is done faster.
The threat of drought has hit some areas in Kebumen Regency, farmers in collaboration with the district government, farmers have started looking for various alternative water sources that will eventually flow using a pumping system.So, it takes a systematic and efficient effort that will have a risk of loss due to these threats by using the normalized difference water index (NDWI) and normalized difference vegetation index (NDVI).NDVI results can describe results with specific cloud-free images that are not available when relying on sensors with a high spatial resolution for a certain period.NDWI is remote sensing, sensitive to water content changes [13].Near-infrared (NIR) and short wave infrared (SWIR) combinations eliminate variations caused by the leaves' inner structure and the leaves' dry matter content, increasing vegetation moisture uptake accuracy [14].
In 1995, Tin Kam Ho proposed the random forest (RF) with his research entitled random decision forest [15], then in 2001, it was redeveloped by Leo Breiman, which was then patented [16].Random forest regression algorithm is an ensemble learning that combines most regression trees.The regression tree can be represented by collecting hierarchically arranged conditions continuously from the root to the tree leaves [17].Logistic regression algorithm (LR) is mathematical modeling with an approach that can describe several variables' relationships.So far, the logistic regression algorithm is the most widely used modeling procedure for epidemiological data analysis [18].As a result, the random forest algorithm consists of trees that have been planted with user values.The result will be obtained from the average error in the numerical predictor results.The random forest predictor is formed by taking the generalization errors over k trees [19], while the logistic regression algorithm describes the relationship of multiple Xs to a dichotomous dependent variable [18].This research contribution compares the extraction features of NDVI and NDWI and compares random forest regression and logistic regression to predict drought in the ripening phase using the Sentinel-2 satellite.

PREVIOUS RESEARCH
Some of the research results can be described as follows: the application of the NDVI method using remote sensing in determining the density of vegetation is widely used as research material.This study aims to explain the phenology of rice using Sentinel 2-A imagery with the NDVI to determine the beginning and end of the rice planting period, making it easier to monitor rice field conditions to improve plant size predictions in a short time [12].Another study that uses the NDVI method aims to estimate rice productivity based on NDVI wave characteristics and regression from NDVI and rice productivity [20].Subsequent research aims to: i) develop a phenology-based Landsat develop a Landsat scheme based on phenology to identify paddy fields during two phenological phases (flooding/transplantation and ripening) at a regional scale; and ii) systematically evaluate the accuracy and resultant uncertainty of the Landsat-based rice field map [21].
Using Landsat 8, NDVI aims to map various irrigated crops, highly fragmented, small in size, and heterogeneous agricultural landscapes [22].The NDVI method is also used to use the Landsat 8 time series variogram, namely operational land imager (OLI), NDVI, NIR, and red images, to model agricultural land's spatial heterogeneity at various stages of growth [23].From related research, five research use the NDWI extraction feature.The first research used NDWI to monitor drought [24].Another research used NDVI for mapping vegetation moisture content [25].NDWI is also used for detecting changes in surface water [26].Another NDWI research was used to evaluate vegetation cover types [27].The NDWI method is also used for monitoring drought in vegetation [28].Furthermore, the contribution of this research also uses the random forest regression algorithm to predict drought inland.NDVI and NDWI need to be compared to this

RESEARCH METHOD 3.1. Research stages
We are estimating the productivity of the approach used to answer the research objectives.Figure 1 explains the identification of rice fields in Kebumen, Central Java.Furthermore, by using data collection from Sentinel-2A data, pre-processing was carried out starting with atmospheric correction followed by Sentinel 2 reflectance, followed by a sampling strategy on the rice field at zoning size 160×154 for sample acquisition.The following process is feature extraction using NDVI and NDWI to find out water indication and drought comparison.Then, the modeling process continues the results of feature extraction using random forest regression and logistic regression.Finally, the evaluation model uses root mean square error (RMSE) and out of bag (OOB) to see the level of accuracy, which results in prediction comparison.So that, in the end, it can show a comparison analysis.

Research area and data collection
The rice field area's research location is located in the Kebumen Regency, Central Java Province, the largest rice-producing area with 2174 hectares.Geographically this location or area of interest (AOI) is at coordinates 109.699456004,109.745133512,-7.772345033, -7.728641145 [EPSG: 4326].This research data and information were conducted with Sentinel 2 imagery from 1 March 2020 to 15 September 2020.From that period used imagery data with areas without cloud cover so that the land can be seen.In 1 period, obtained imagery is processed through the pre-processing stage that clips on land used for research sites.The data did not use in June 2020 and August 2020 because the research object is 80% covered by clouds, the research using an image that at least has a cloud tolerance of up to 10%.
Figure 2 explains the research location in Kebumen, Central Java.The characteristics of the area of Kebumen Regency can be distinguished into alluvial soil, latosol soil, podsolic soil, regosol soil, gray glei humus, and alluvial associations and the litosol and brown mediterranean associations, where the potential of the land can show that some of the areas are classified as fertile enough to be used as agricultural land.However, several sub-districts such as Sempor, Karanganyam, Sadang and Alian have soil characteristics that are less capable of being used as agricultural land [30].

Pre-processing
The pre-processing stage is the stage where data preparation is carried out before the data is processed.A raster data usually has an extensive area coverage not to reflect the area to be researched.It can describe the research area.It is essential to cut data or what is commonly known as clipping.A raster data usually has an extensive area coverage not to reflect the area to be researched.It can describe the research area.It is essential to cut data or what is commonly known as clipping [31].The area of interest defines the clipped region, which can then be defined by points or shapes based on the coordinates.The shape of the defined area will follow the clipping procedure.The steps are carried out using the google earth engine in atmospheric correction, followed by Sentinel 2 reflectance so that a sampling strategy is obtained in the fields.After cutting, the array size at one became smaller, namely 160×54 for sample acquisition.The image is taken from medium satellite imagery because the land used at the research site contained thin clouds to see better results and minimize de-noising from an imbalance dataset.It increases accuracy by reducing errors, especially for predictive models.One challenge is developing a general auto exposure solution that includes a wide range of imaging sensors [32] with a camera's fast and powerful auto-exposure algorithm [33].

Extraction feature 3.4.1. NDVI
NDVI is a vegetation measurement that helps find vegetation density and see the level of plant health.NDVI is also used to measure the greenness of vegetation.NDVI is sensitive to photosynthetic activity by chlorophyll so the NDVI value can be used to make vegetation classifications.NDVI results are obtained from the ratio of red (RED) and NIR [34]: The (1) describes the NDVI calculated from bands 4 RED and 8 (NIR, resolution 10-m) or 8A (NIR, resolution 20-m) obtained from Sentinel-2A [35].NDVI is also commonly used in drought monitoring, agricultural production forecasting, and fire-prone zone forecasts, as well as maps of desert attacks all over the world.The amount of historical data available can affect the forecasting results [36].Since it is easier to adjust for changes in lighting conditions, surface slope, exposure, and other external factors, NDVI is becoming more commonly used in global vegetation monitoring.

NDWI
The NDWI method, which combines NIR and SWIR, is used to determine the water's condition.NDWI is used to determine water status by combining NIR and SWIR because both are located on a high reflectance and have a profound depth in the vegetation canopy [37].NDWI can effectively improve water information in most cases.The (2) describes the RED band as band 4, the NIR band is band 8A, and the SWIR band is band 11 on Sentinel-2A [37].

Modeling and evaluation prediction
The random forest regression algorithm combines many regression trees into an ensemble learning algorithm.A regression tree is a set of boundaries or conditions arranged hierarchically to be extended sequentially from tree roots to leaves [38]- [40] The random forest is a solution to solve this problem.The random forest method is one of the methods in the decision tree.A decision tree is a flowchart shaped like a tree with a root node used to collect data, an inner node located on the root node containing questions about data, and a leaf node used to solve problems and make decisions.Which consists of various decision trees with (3) [41].
The (3) explains that   is a random variable distributed independently,  is the input variable, and  is the total of regression decision trees.The probability of generating a random forest is determined during the process extracted moment.The estimate of the total  of the unselected sample is referred to as the out-of-bag (OOB) result [41].For regression, random forest constructs several  of regression trees and averages the results.After the  like a tree grows, the predictor of random forest regression is explained by the (4) [40].The (4) explains that  is the input variable,  is the tree value (1,2,3, … ), and  is the total number of trees in the random forest (the size of the random forest) [40].Furthermore, the previous stage's performance evaluation of the prediction results used the RMSE model to calculate the prediction error [42].The RMSE has been used as a primary statistical metric to calculate model efficiency in meteorology, air quality, and climate science.Although both have been used to evaluate model efficiency for many people over the years, there is no agreement about model errors' most suitable metrics.
To make it easier, we will say we already have  sample model errors, counting  as ( (, )  = 1,2,5 . . ., ).Uncertainties resulting from observation errors or the methods used to compare models and observations are not considered in this research [43].OOB is data that is not used to develop trees and represents data outside the sample used for cross-validation purposes.It will be easier to determine an indicator that indicates if case is in the bag or OOB [44].
In this research used logistic regression (LR) algorithm is a derivative of the natural algorithm as a regression function of the predictors compared with random forest.Logistic regression is an approach to making predictive models such as linear regression, commonly referred to as ordinary least squares (OLS) regression.The difference is that researchers predict bound variables that scale dichotomy in logistic regression.With one predictor, , this takes the form of equations [45].
The ( 5) explains that ln stands for the natural algorithm,  is the result, and  = 1 when the event occurs (versus  = 0 if it does not),  0 is the intercept term and  1 represents the regression coefficient, change in the event probability algorithm with 1 unit change in predictor  [45].If OLS requires the condition or assumption that residual errors are distributed normally.Conversely, in this regression there is no need for these assumptions because in this type of logistic regression follows the distribution of logistics.Whereas if the dependent variable used consists of more than two categories, then the right logsitic regression model is multinomial logistic regression.

RESULT AND DISCUSSION
Based on the visualization of NDVI shown in Figure 3 Result of visualization, the drought occurred in March 2020.According to the area of interest (AOI) related to drought in the location of Kebumen, Central Java.To make it easier to explain the results of preprocessing carried out with the NDVI and NDWI indexes, it is seen in Figure 3.
Figure 3 shows preprocessing for March, April, May, July, and September 2020, describes preprocessing by clipping according to the research location.Figure 3 is divided into two figures, namely preprocessing NDVI in figure 3(a) and NDWI in figure 3(b).This figure uses band 4, band 8A, and band 11 for the extraction feature and does not use cloud data to better value.
Figure 3(a) shows using NDVI.NDVI is divided into six class categories: non-vegetation, lowest dense, lower dense, dense, higher densities, and highest dense.Vegetation has the potential to store biomass and carbon.So the presence of vegetation can show how much carbon and biomass stocks are [46].Staining on NDVI has a sensitivity index value that tends to be less good for detecting water content.
While Figure 3(b) shows using NDWI.NDWI uses the same categories as Figure 3(b) to obtain preprocessing results, which are compared in parallel to monitor the tested land.From the visualization, it can be seen the results of the comparison between NDVI and NDWI in Figure (4).
The results of preprocessing the vegetation index used are based on the NDVI index with a range of 0 to 1.This index describes the greenish level of a plant.The vegetation index is a mathematical combination of the red band and the NIR band as an indicator of the presence and condition of vegetation; in this case, the index range is used to determine the moisture content at the location being tested and then depicted with graphics to get the actual value in the results of data processing, seen in Figure 4.
Figure 4 describes the comparison of the NDVI and NDWI vegetation index values.The higher the water content, the closer the extraction feature value approaches 1, and vice versa: the lower the water content, the closer the feature extraction value approaches 0. It seems that NDVI is better at predicting the level of dryness in rice fields.The results showed that NDVI did best in drought compared to NDWI.According to Table 1, NDVI is divided into six class categories: non-vegetation, lowest dense, lower dense, dense, higher densities, and highest dense [46].
After get the index value, it needs to evaluate based on statistic to monitoring the drought and show in Table 2.It shows the evaluation results using RMSE to evaluate the error comparison to detect the de-noising value in the dataset used, then use an RF (OOB) and LR to see the percentage of predictions.The scaling factor cannot change the value adaptively after training, but it can learn model patterns and averages in the training set [47].


Comparative study of extraction features and regression algorithms for … (Irza Hartiantio Rahmana)

643
Share training data and testing data with a percentage of 80% and 20% to guide modeling to meet local optimal points better [48].In NDVI, the average value of RF (OOB) is 0.988 with RMSE 0.018, while the average value is LR 0.952 with RMSE 0.346.In NDWI, the average value of RF (OOB) is 0.99 with RMSE 0.012, while the average value is LR 0.946 with RMSE 0.336.Based on these data results, the prediction evaluation results on NDVI are better than NDWI.From the results of the vegetation index and the algorithm that has been made, it can be seen that NDVI is better than high vegetation levels with blue coloring.Furthermore, the algorithm's results indicate that the RF and LR algorithms' average values will be higher with a high index.The RMSE value for NDVI is 0.018, indicating that NDVI is better in terms of evaluation than NDWI, which has an RMSE value of 0.012.In Figure 4, NDVI from March 2020 to September 2020 experienced a decrease in the vegetation index level, while NDWI from March 2020 to September 2020 also decreased, only experiencing a slight increase in July 2020.Furthermore, to clarify the level of vegetation is explained in Table 1.

CONCLUSION
In this research, it can be concluded that the NDVI extraction feature is better than the NDWI extraction feature in predicting drought.Drought prediction is carried out by implementing the feature extraction value on the Sentinel-2 satellite image data.The data that has been feature extracted is then processed using the random forest regression algorithm and logistic regression algorithm to predict the drought of rice fields.Furthermore, the data was tested using RMSE, RF(OOB), and LR accuracy.The results obtained by NDVI have an average RF value (OOB) of 0.988 with an RMSE of 0.018, while the average value of LR is 0.952 with an RMSE of 0.346, while the NDWI average value of RF (OOB) is 0.99 with an RMSE of 0.012, while the average value of LR is 0.99, 0.946 with RMSE 0.336.Based on these data results, the evaluation of NDVI is better than NDWI.For further research, it is necessary to compare with other extraction features such as enhanced vegetation index (EVI), NDMI, soil adjusted vegetation index (SAVI), and other extraction features that are related to the level of the greenness of vegetation and to strengthen the prediction results, and further prediction evaluation is needed, using explained variance score (EVS), R squared ( 2 ), mean squared error (MSE), and mean absolute error (MAE).

Table 2 .
Evaluation prediction of NDVI and NDWI