Large scale data analysis using MLlib

ABSTRACT


INTRODUCTION
The application of information technology in several fields, such as computing, networking, and storage capacity, has seen significant advancements in the previous decade [1]- [6].The result of this advancement is the emergence of a new scientific paradigm: the era of huge data collection and exploration, which has evolved into a scientific discovery approach that is on an equal footing with conventional theoretical analysis, experimental designs, and computer simulation.The amount of data generated and stored has expanded dramatically over the last two decades as a result of the development of the internet of things (IoT), artificial intelligence, cloud computing, and other cutting-edge computer technologies [7]- [19].Because more than 6000 tweets are sent out per second on Twitter, and the similar trend can be found on Facebook, WhatsApp, and other social media platforms.Social media and the Internet have made substantial contributions to this rate of data generation [20]- [24].Large amounts of data are generated by application/web servers from their logs at the organizational level.Other systems also contribute to the increased rate of data generation.As a result, data has evolved into an essential component of business existence.Because of the increased level of digital data generation, combined with the increasing complexity of the data.It has become impossible to process them using conventional data processing methods; as a result, efforts are now being directed toward developing advanced computing infrastructures that can handle data volumes and complexity of this magnitude (referred to as parallel and distributed processing) [25]- [42].
Managing vast amounts of data is a difficult task that demands the development of more complex systems in order to achieve accurate and timely enormous data analysis [43]- [45].In order to process big data analytics problems in a timely and reliable manner, infrastructure for big data has been developed, allowing for high-quality performance and resource availability for self-service and convenience of use on demand.There are numerous machine learning frameworks for large data analysis presently accessible; they are relevant to different scientific domains and have been shown to be useful in healthcare informatics genetic data analysis, text exploration, and random picture modeling, among other applications.Apache Spark machine learning library (MLlib) is an open-source, in-demand, and independent library for big data analysis using machine learning techniques [46].It has the advantage of having an automatic data balancing and a distributed design, making it a good choice for big data analysis.A collection of dominating people in occupations for numerous machine learning tasks, such as classification, regression, base compilation and extraction (and dimensional reduction), is introduced by Apache Spark [47]- [51].Despite the fact that numerous research have been conducted on machine learning and its usefulness, ML libraries for big data analysis, such as Apache Spark MLlib, have received little attention.Perhaps this is the first study to look at libraries for large data analysis that are based on machine learning techniques.Big data analytics is primarily concerned with the advancement of computer infrastructures in such a way that data mining and analysis can be completed quickly and efficiently [45], [52]- [56].It is the primary driving force behind the existing business.Because large data analytics is a computationally intensive operation, the user experience during large data analytics is influenced by the setup of different software and devices [57]- [66].
Several big data processing techniques have been recommended since the last decade due the failure of the conventional processing methods to handle the large volume of data generated daily from business and industrial processes [67].As a result, researchers have been concentrating their efforts on developing more effective methods of obtaining value-added information from large amounts of data.There are many different types of studies in the area of big data processing models.For example, data flow models such as MapReduce, which facilitate data processing utilizing a variety of operators while sharing stable storage systems, are one type of study [68]- [74].Resilient distributed datasets (RDDs) are a more efficient data sharing abstraction from stable storage since they do not require data copying, which saves money.In most high-level application programming interfaces (APIs) for data flow systems, integrated language APIs [75]- [81] are provided, which allow the user to interact with "parallel groups" through operators such as map and join.Parallel groups on these systems, on the other hand, either represent files on disk or the temporary data sets that were used for query plan expression on these systems.Despite the fact that systems have the ability to convey data via the operators in the same query, data exchange through inquiries has proven to be inefficient.As a result, Spark's API is built on the parallel summation model, which is convenient to implement.It does not claim to be the first to use an integrated interface language, but by including RDDs as a storage layer behind this interface, it can support a larger range of applications.
The systems in the following category are those that provide high-level interfaces for specialized applications that require data sharing, as described above.Pregel [82] provides support for redundancy diagramming applications, whilst Twister [83] and HaLoop [84] are iterative MapReduce programs, respectively.These frameworks only provide data sharing for the calculation styles that are supported, and they do not provide an universal abstracting framework.They can only be used by the user to share selected data from specified operations.For example, a user cannot load data into memory using Pregel or Twister and then select the query to run on it after it has been loaded.The fact that RDDs expressly provide distributed storage means that it can be used to enable applications that are not currently supported by these specialized systems, such as interactive data mining.According to [85], it is proposed a methodology that demonstrates to be an upgrade over standard big data analytics methods that use either Hadoop/Spark or deep learning as distinct components.Lunga et al. [85] proposed a framework that makes use of Spark's distributed computing capabilities as well as deep learning architecture for multiple layers perceptron (MLP) using cascade learning to train multiple layers perceptrons is proposed.A framework for in-depth training learning models with Apache Spark has been created and developed in [47], [48], [50], [51], [57], [69], [86]- [92].This framework shortens the training time by taking advantage of the advantages of both data and parity modeling at the same time.It is possible to create data parallelism by distributing training data across many TELKOMNIKA Telecommun Comput El Control  Large scale data analysis using MLlib (Ahmed Hussein Ali) 1737 Spark block machines and replicating models on each device [86].Each model goes through its training in parallel with the data part.The parallelism model is implemented by distributing each replica of the deep neural network model over the spark group in a layer-by-layer fashion.
The impact of various software and hardware configurations on the problem of big data processing is explored in this research.The focus of the presentation is on the capabilities and advantages of Apache Spark MLlib 2.0 as a large data analytics tool, particularly in relation to Hadoop.This study is developed as a means of providing insight into the usage of machine learning libraries in big data analysis from the standpoint of industry.This work opens the door to other elements of big data analysis utilizing machine learning methods, which is regarded to be a rapidly expanding study topic.Several real-world tests is carried out to investigate the qualitative and quantitative aspects of Apache Spark MLlib 2.0.Moreover, a comparative research is carried out using the massive online analytics (MOA) library, which is a well-known Java-based machine learning library that is widely used in the industry.Furthermore, the performance of several commonly used machine learning models for big data analysis is examined, and compared across a variety of software and hardware settings.The remaining part of this article is arranged thus: section 2 introduces Apache Spark MLlib.The method and components of the investigated Apache Spark MLlib 2.0 are presented in section 3, while the results and discussion of the features and benchmarking are presented in section 4, and conclusion is presented in section 5.

APACHE SPARK MLLIB 2.0
This is a scalable and fast big data processing engine that was first developed by the AMPLab at the University of California, Berkeley [93]- [95].It may be used to construct distributed applications in a variety of computer languages, including Java, Python, and other programming languages [96]- [105].When it is installed, it includes four major libraries: Apache Spark structured query language (SQL), Apache Spark Streaming, Apache Spark MLlib, and Apache Spark GraphX.These libraries are described in more detail below [106]- [108].However, despite the fact that the most basic scheduling Spark modules are Apache Spark Streaming, which is fault tolerant and performs high level analytics, Apache Spark SQL performs relational queries for a variety of mining databases because it incorporates a data abstraction model known as data frames [109].It is important to note that Apache Spark GraphX [110] is a high-level Apache Spark processing library that can handle two commonly used data structures utilizing distributed arithmetic models.Apache Spark MLlib provides >55 scalable machine learning algorithms for big data analytics, taking advantage of the advantages of both data and the data collection method.As well as enabling the implementation of numerous machine learning strategies, such as grouping and regression; classification; rule extraction; and dimensional reduction.It also enables the rapid and simple creation of machine learning approaches for large-scale applications [67], [111]- [116].
A suite of multiple-language APIs is also available on the Apache Spark MLlib [117] platform for the evaluation and deployment of a wide range of machine learning techniques.In recent years, several changes have been made to multiple areas of data science solutions [118], [119], and a number of academics have committed attention to the creation of the components of Apache Spark MLlib for big data analytics.Figure 1 depicts the development side of Apache Spark MLlib track 2.0, with a unique number of anchors assigned to each release of the library [120]- [122].This section discusses some of the recent improvements in Apache Spark MLlib applications, including some of the new features introduced.In order to aid in the development of smart transportation applications, a scalable and open-source platform known as connected vehicles and smart transportation (CVST) has been proposed by a number of researchers.The proposed CVST is built of four essential components: data distribution, resource management, business intelligence, and application.The business intelligence component is in charge of data analytics, and it makes use of MLlib to process and transmit data to the front end.According to the findings of the study [107], [123], [124], an architectural design for academic information system services for students enrollment pattern analysis should be considered.This system makes use of MLlib to anticipate the suggested courses for the forthcoming semester, which is a powerful prediction tool.
Sparktext is a text mining framework developed by Ye et al. [125] for use with Apache Spark learning and flow algorithms in conjunction with the Cassandra NoSQL database [90], [126]- [129].The database was built using a big collection of medical publications for the purpose of cancer type classification.Aurora [130] demonstrated how to analyze web-sourced mobile data using Apache's K-algorithm Spark MLlib, which is based on the Spark algorithm.The study gave an effective technique of determining the number of grid users based on the grouping of latitude and longitude information, which was based on the results of the investigation.When learning human behaviors, the study by [122] provided ALMD, which performs feature description by monitoring the appearance and movement randomly based on the usage of the Apache Spark ML library and the usage of Apache Spark ML library.Assefi et al. [131] have described the construction of a framework for demographics analysis utilizing next-generation data sequencing as a case study.In order to optimize the system, it is necessary to update the resource estimator and optimize the components.The system was developed entirely on Apache Spark, and as a result, it takes advantage of the favorable aspects of the MLlib and other Spark components.BigNN was developed by Assefi et al. [131] as another fascinating feature of big data analytics on Apache Spark.It is capable of handling biomedical strings on a very large scale, which is very useful in the healthcare industry.MLlib can be implemented using programs written in R, Scala, Python, and Java, among other programming languages.Vector, LabeledPoint, and rating are the core data abstractions used by MLlib; as a result, the pedestrian and other statistical components of MLlib work on data represented by these abstractions.Observational data features are captured using the vector type, which represents an index set of double type values with a zero-index of the int type.The vector type is used to record the observational data features.
A vector of length n might theoretically represent a note with n properties, which would imply that it represents an object in a file with N dimensions.The vector type offered by MLlib differs from the vector type supplied by the Scala set library in that the vector type in MLlib implements the digital vector concept from linear algebra, but the vector type in the Scala set library does not.MLlib is capable of handling both dense and sparse vector types.In addition, because the MLlib Vector type is considered an adjective, it cannot be instantiated directly by the application; instead, the factory methods given by MLlib must be utilized to construct an instance of either the sparse vector class or the dense vector class.It should be noted that the factory methods for creating instances of the dense vector or sparse vector classes are already specified in the vectors object, which is convenient.

METHOD
Spark's machine learning library, MLlib, has been under heavy development since its inception, and unlike the Spark core.It is still not in a fully stable state with regard to its overall API and design.As of Spark version 1.2.0, a new, experimental API for MLlib has been released under the ml package (whereas the current library resides under the MLlib package).Figure 1 shows the Spark ecosystem with MLlib.This new API aims to enhance the APIs and interfaces for models as well as feature extraction and transformation so as to make it easier to build pipelines that chain together steps that include feature extraction, normalization, dataset transformations, model training, and cross-validation.Since the new API is still experimental, it may be subject to major changes in the next few Spark releases.Over time, the various feature-processing techniques and models that we will cover will simply be ported to the new API; however, the core concepts and most underlying code will remain largely unchanged.
This section summarized the tests carried out on the six datasets listed in Table 1.The findings were provided in terms of the processing time for MLlib and MOA when both programs were run on the same hardware.The performance of Apache Spark MLlib 2.0 was compared and evaluated on six distinct large datasets obtained from the University of California, Irvine's machine learning repository.The experimental setup used in this work consisted of a standalone Spark cluster that makes use of an HDFS storage system and Apache Zeppelin 0.7.1 as an editor, both of which were developed by the authors.The Spark cluster is made up of the following components: a master node that runs a driver software; three worker nodes; and a data node (includes1 worker node that executes on the master node).Similar to the design illustrated in Table 2, the three nodes had a similar configuration.The three worker nodes each had a memory capacity of 48 GB, and each worker node was configured with four executors (each with a memory capacity of 4 GB) and two CPUs.Each worker in the master node was configured with three executors (each with a size of 5 GB) and two cores, as shown in the diagram.A total of 16 GB of RAM was allocated to the driver process.


Large scale data analysis using MLlib (Ahmed Hussein Ali)

1739
The MLlib was run on a Scala 2.11.8 PL in a Spark 2.2.1 cluster, with Hadoop 2.7.3 serving as the distributed storage device, and the results were published.The amount of RAM available to the executors in each worker node was changed by employing the optimal number of data partitions in order to obtain the fastest possible execution time.Table 1 describes the characteristics of the datasets that were used in this investigation in terms of the amount of attributes, records, and classes that they contained.

RESULTS AND DISCUSSION
The implementation process was kicked off by first defining the Spark context for the program that was selected.As previously stated, this is the primary point of entry for Spark functionality, and it must be given before attempting to create the RDDs.The three Spark Context parameters, which are the application name, the number of cores, and the URL of the cluster, were also supplied in the configuration.In addition, the name of the application should be significant in order to clearly identify the program's objective.To specify the name of an application for a local cluster, the keyword "local" is used.Worker nodes are responsible for processing work in Spark and, as previously stated, the number of worker nodes to be formed is dictated by the number of cores available.The following step is to train the model using the training data and to provide the parameters that are accessible for the supervised machine learning methods that have been selected (support vector machines (SVM), decision tree, and logistic regression).The parameters for the decision tree, SVM, and logistic regression methods were shown in Tables 3, 4, and 5, respectively.
The testing of the trained model on the testing set is the next step; this was accomplished using the "predict" method which was implemented using the "map" transformation of Spark for each row of the test set.The comparison of the computational time of Apache Spark MLlib and MOA under different experimental conditions is shown in Figure 2.There was a close similarity in the area under the ROC for both Apache Spark MLlib and MOA as the difference between them was not statistically significant.However, the little difference between them could be due to the detailed parametric settings of each classifier during the random selection of the test and train datasets.Obviously, Apache Spark MLlib was faster than MOA based on the observed computational times of the classifiers; however, the clustering method showed statistically significant differences between the Apache Spark MLlib and MOA.

CONCLUSION
Data generation has increased at an alarming rate in recent years, necessitating advancements in data analytics and processing tools in order to enable the extraction of relevant information from vast amounts of organized and unstructured data.Big data machine learning techniques which are believed to be efficient in pattern finding, can be used to more efficiently handle this challenge.Apache Spark MLlib is a widely used machine learning library for big data, and it is a powerful tool for big data analytics.As proved in this study, it provides excellent performance in terms of computational time.Massive online analytics (MOA), on the other hand, is slightly slower than Apache Spark MLlib during big data analysis; however, because the classifiers use different configurations and file systems, the comparison may not be appropriate.MLlib was implemented on the Spark distributed file system, whereas the MOA classifier was implemented on the TELKOMNIKA Telecommun Comput El Control  Large scale data analysis using MLlib (Ahmed Hussein Ali) 1741 Hadoop distributed file system, the comparison may not be appropriate.Because we want to demonstrate how well Spark performs on large data sets using MOA as a benchmark, it is assumed that there are many MOA features that Spark cannot compete with, such as the availability of a large pool of resources and documents for MOA users, the ease with which non-experts can implement MOA, and the presence of a good graphical user interface in MOA, among other things.These characteristics are the reason why MOA supports a variety of machine learning techniques.

Table 2 .
System description

Table 3 .
The decision tree classification technique relies on a number of parameters

Table 5 .
The parameters that were used in the logistic regression algorithm