TELKOMNIKA Telecommunication Computing Electronics and Control

Received Dec 06, 2021 Revised Dec 12, 2021 Accepted Dec 20, 2021 Unsupervised machine learning is one of the accepted platforms for applying a broad data analytics challenge that involves the way to identify secret trends, unexplained associations, and other significant data from a wide dispersed dataset. The precise yield estimate for the various crops involved in the planning is a critical problem for agricultural planning. To achieve realistic and effective solutions to this problem, data mining techniques are an essential approach. Applying distplot combined with kernel density estimate (KDE) in this paper to visualize the probability density of disseminated datasets of vast crop deals for crop planning. This paper focuses on analyzing and segmenting agricultural data and determining optimal parameters to maximize crop yield using data mining techniques such as K-means clustering and principal component analysis (PCA).


INTRODUCTION
India's agricultural history goes back to the Indus Valley Civilization. Agriculture and other related operations in India contribute (17)(18)% to the Gross Domestic Product, which has a significant effect on the Indian economy. Agriculture plays an important part in India's social and economic system and is the largest economic segment in terms of demographics [1]. Crop output prediction can help the government build crop insurance policies and supply chain operation policies using big data analysis [2]. It can also help farmers by supplying them with a prediction of the past crop yield record that decreases risk management [3].
The sum of data is rising exponentially, while the speed of estimation is slowing down. Instances of large data include crop production, the field used, and crop yield. Since the government systematically and continuously gathers data on crop production and yield, the scale of the dataset is known to be big data, which is real-world data that is very difficult to interpret [4]. Statistical methods and data mining can be extended under distributed and parallel computing platforms to analyze big data and often consumes huge processing time and volume of storage to accommodate vast data sets [5]. Data mining technique plays a crucial role in data analysis. Data mining is a subfield of interdisciplinary computer science and analytics with an overall target of identifying trends, patterns, and associations within broad data sets that include strategies at the intersection of machine learning, database systems, and statistics [6]. Data mining utilizes specialized statistical algorithms with the ultimate purpose of collecting data by segmenting the data and converting the information into an understandable framework to determine the possibility of future events [7]. There are two kinds of learning approaches to data mining: unsupervised (clustering) and supervised (classifications) [8]. Clustering is the practice of evaluating a list of "data points" and sorting them according TELKOMNIKA Telecommun Comput El Control  Agriculture data visualization and analysis using … (Kunal Badapanda) 99 to a distance calculation into separate "clusters" [9]. When grouping these data points, the goal should be for data points in the same cluster to be a small distance from each other, whereas data points in separate clusters should be long-distance from each other [10]. Data is grouped into well-formed classes through cluster analysis. The normal data structure can be captured by well-formed clusters [11].
This paper aims to lessen the manual work of applying data mining algorithms by using different python modules. This paper uses python-based libraries (numpy, pandas, seaborn, K-means, principal component analysis (PCA), tools, functions, and methods to quickly analyze, mine, and visualize the agriculture dataset. The dataset is visualized using distplot combined with a kernel density estimate (KDE) plot. K-means clustering technique is used in the current work to form clusters from the agricultural dataset. Compared to other clustering algorithms, the K-means algorithm is extremely simple to implement and is also very effective in computation, which may explain its popularity. The clusters obtained are visualized by reducing their dimensions using principal component analysis. The remainder of this paper is organized as follows: section 2 explains the methodology for visualizing and clustering the dataset. Section 3 presents the results and finally, section 4 concludes with some directions for future work.

RESEARCH METHODOLOGY
This paper aims to propose a method to analyze agricultural data using data mining techniques. Agriculture data has been obtained from credible sources in the proposed work. Input dataset consist of data with following parameters namely: crop name, production (2006-2011), area (2006-2011), yield (2006-2011) [12]. In the proposed work, the K-means clustering method is used to cluster data based on crops with identical output, area, and yield amounts [13]. Distplot combined with Kernel density estimation (KDE) is used for visualizing the probability density at different values in a continuous variable of the dataset which can improve its prediction accuracy. The principal component analysis is used for dimensionality reduction of the dataset at keeping the original information unchanged [14]. The optimum parameters for maximum output can be obtained based on this analysis.
Clustering is the process of dividing a dataset into groups such that entities in each cluster are comparatively more similar to entities of that cluster than those of the other clusters. In a dataset, Clustering can reveal undetected connections. In the proposed work, we have used the K-means algorithm to cluster our agricultural data. The K-means algorithm belongs to the prototype-based clustering group. Prototype-based methods seek to define the data set to be categorized or clustered by a (usually small) set of prototypes, particularly point prototypes, which are simply data space points [15]. Each prototype is intended to capture the distribution of a group of data points based on a definition of similarity to the prototype or closeness to its position that may be affected by the size and shape parameters of the (prototype-specific) [16]. Our goal is to group the dataset based on their similarity in characteristics, which can be accomplished using the algorithm K-means that can be summarised in the following six steps [17] in Figure 1.

Choose number of clusters "K"
Select random K points that are going to be the centroids of each cluster

Calculate a new centroid for each cluster
Assign each datapoint to the nearest centroid,doing so will enable us create "K" number of clusters

Go to step 4 and repeat
Reassign each datapoint to the new closet cetroid Figure 1. Steps for applying K-means clustering Measuring similarity between objects: similarity is defined as the opposite distance, and the squared Euclidean distance between two points p and q in m-dimensional space is a commonly used distance for clustering samples with continuous features [18].
Note that the index i in the preceding equation refers to the i th (feature column) dimension of sample points p and q. The K-means algorithm can be defined as a simple optimization problem based on this Euclidean distance metric, an iterative approach to minimizing the sum of squares within the cluster (WCSS) [19], which is often also called cluster inertia.
where ( ) is the centroid for cluster j, ( , ) is equal to 1 if the sample ( ) is in cluster j, otherwise, its value is equal to 0. One of the disadvantages of this clustering algorithm is that the number of clusters, k, a priori, must be specified. Poor clustering performance may result in an inappropriate option for k. For any unsupervised algorithm, the calculation of the optimal number of clusters into which the data may be clustered is a fundamental step. One of the most common methods for evaluating this optimum k value is the elbow method [20]. Using the K-means clustering method using the sklearn python library, we are now demonstrating the provided method.

Creating and visualizing the data
Data visualization is the representation of the data values in a pictorial format. Visualization of data helps in attaining a better understanding and helps draw out perfect conclusions from the data. Data visualization plays a crucial role in any data analysis [21]. It helps to recognize which variables are important and which variables can influence our prediction model. While preparing any machine learning (ML) model we have to initially discover which characteristics are significant and how they can affect the result. This can be done by analyzing the data through data visualization. − Python seaborn module: The data visualization modules present in Python depends on the Python Matplotlib library. Python seaborn is also one of those data visualization modules which provide functions with better efficiency and plotting features. With seaborn, data can be presented with different visualizations and different features can be added to it to enhance the pictorial representation [22]. − Distplot: A distplot or plot of distribution demonstrates the variance in the distribution of data. The Seaborn distplot can also be clubbed along with the kernel density estimate (KDE) plot to estimate the probability of distribution of continuous variables across various data values. − KDE plot: It is a plot that depicts the probability density function of the continuous or non-parametric data variables, i.e., we can plot for the univariate or multiple variables altogether [23]. − Heatmaps: One of the important built-in functions in the direction of data exploration and visualization in seaborn is heatmaps. Seaborn heatmaps visualize the data and represent it in the form of a summary through the graph/colored maps [24]. Distplot combines two plots. It combines matplotlib. Hist function with seaborn deplot(). We have used heatmap for finding correlations in the dataset. Figure 2 describes the code for creating and visualizing the dataset, which includes 4 blocks representing the code for importing the libraries, loading the dataset, plotting the distplot, and plotting the heatmap respectively.

Finding number of clusters K by elbow method
This is perhaps the best-known means of estimating the optimum number of clusters [25]. In its method, it is also a bit naive. Measure the within-clusters-sum of squares (WCSS) for various k values, and pick the k for which WCSS begins to diminish first. This is evident as an elbow in the plot of WCSS-versusk. Within-cluster-sum of Squares sounds sort of complicated. Let's break down this in Figure 3.
We need to scale the continuous features to give all characteristics equal significance. Scikit-learn's standard scaler will be included. We will initialize K-means for each k value and use the attribute of inertia to define the number of squared sample distances to the nearest cluster core. The squared distance number tends to zero as k increases. Imagine that k is set to its maximal value n (where n is the number of samples) and each sample forms its cluster, meaning the total of square distances is equal to zero. The code used to map the total k square distances is defined in Figure 4. This figure depicts the four blocks representing the code TELKOMNIKA Telecommun Comput El Control  Agriculture data visualization and analysis using … (Kunal Badapanda) 101 for importing the libraries, scaling the dataset, initializing the K-means for each k value, and applying the elbow method, respectively. If the plot looks like an arm, so an ideal k is the elbow on the arm. Using the sklearn library and our feature for calculating WCSS for several values for k, let us implement this in Python.

WCSS
The WCSS score is the sum of these squared Errors for all the points.
The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
Any distance metric like the Euclidean Distance or the Manhattan Distance can be used.

Applying K-means and principal component analysis (PCA)
In the code for applying the K-means algorithm, the K-means object has been created and passed as the number of clusters "K" obtained from the elbow method. In the next line fit method on K-means has been called and the "crop_df_scaled" dataset has been passed through it K-means. Labels_ is used to see the labels for the datapoints. Via dimensionality reduction, the clusters we have identified after applying the K-means clustering approach can be visible. PCA is an effective tool for visualizing high-dimensional data in combination with K-means. It is an unsupervised machine learning algorithm. PCA projects them into a lower-dimensional vacuum, restricts them, and visualizes them to only a few significant key ones [26]. Figure 5 describes the code for implementing PCA on the dataset, each block in this figure represents the code for obtaining the principal components, creating a dataframe with two components, c concatenating the labels to the dataframe, and visualizing and interpreting the clusters.

Visualizing the dataset
The dataset must be visualized before applying the K-means algorithm to the dataset. Results of data visualization are shown in Figures 6 (see Appendix) and 7. Figure 6 (see Appendix) depicts the KDE plot combined with distplot is plotted for the dataset to analyze the data through visualization. Figure 7 depicts the result of the heatmap plot which is plotted by representing the dataset in the form of a 2-dimensional format for finding correlations among the data.

Clustering
To calculate the K value (number of clusters), the elbow method is applied to the dataset. The outcome of the elbow process is represented in Figure 8, and it depicts the result of the elbow plot which is plotted using the within-cluster sum of squares for a range of values of K. The optimum number of clusters (K value) is determined by choosing the "elbow" value of K, i.e., the point at which the WCSS starts to decrease linearly. Therefore, we assume that the number of clusters is 4 for the given dataset. Table 1 depicts the result of the K-means clustering algorithm. Figure 9 depicts the clusters we have obtained, represented by reducing their dimensions using Principal component analysis. Crops are commonly picked for their economic significance. The agricultural planning process, however, involves an estimate of the yield of many crops. In this context, using data availability as the main metric, 54 crops have been selected for this work. Crops were only chosen when appropriate data samples came under review in the 6-year range .
As a result of the K-means clustering algorithm, 4 clusters are formed. Cluster 0 represents the crops having medium production, high area, and medium-low yield. Cluster 1 represents the crops having low production, low area, and medium yield. Cluster 2 represents the crops having high production, medium area, and high yield. Cluster 3 represents the crops having medium-low production, medium-low area, and low yield. Principal component analysis is used to represent the clusters by reducing their dimensions. The present work covers the distplot combined with the kernel density estimate plot and heatmap for visualization. The elbow method is used for finding the optimal number of clusters "K". K-means clustering algorithm is applied to form clusters from the dataset. The principal component analysis is used to represent the clusters formed by reducing their dimensions. The crop data collection can be analysed using these methods and the optimum parameters for crop production can be calculated.

CONCLUSIONS AND FUTURE WORK
In developing countries such as India, agriculture is the most significant application field. In agriculture, the use of information technologies can improve the decision-making scenario, and farmers can perform more. In several matters relating to the agriculture sector, data mining plays a key role in decisionmaking. This paper discusses the role of data mining from the perspective of the agriculture field. On the input data, different data mining techniques are applied that can be used to determine the best output yielding process. To obtain the optimum parameters to achieve higher crop yield, the present study used data mining techniques such as K-means clustering, principal component analysis. Through this paper, an attempt is made to lessen the manual work of applying data mining algorithms by using different python modules. Expanding the present work to evaluate soil, climate conditions, demand data, and other variables for the crop to improve the crop yield is scope for future work.