Solid waste classification using pyramid scene parsing network segmentation and combined features

Solid waste problem become a serious issue for the countries around the world since the amount of generated solid waste increase annually. As an effort to reduce and reuse of solid waste, a classification of solid waste image is needed to support automatic waste sorting. In the image classification task, image segmentation and feature extraction play important roles. This research applies recent deep leaning-based segmentation, namely pyramid scene parsing network (PSPNet). We also use various combination of image feature extraction (color, texture, and shape) to search for the best combination of features. As a comparison, we also perform experiment without using segmentation to see the effect of PSPNet. Then, support vector machine (SVM) is applied in the end as classification algorithm. Based on the result of experiment, it can be concluded that generally applying segmentation provide better source for feature extraction, especially in color and shape feature, hence increase the accuracy of classifier. It is also observed that the most important feature in this problem is color feature. However, the accuracy of classifier increase if additional features are introduced. The highest accuracy of 76.49% is achieved when PSPNet segmentation is applied and all combination of features are used.


INTRODUCTION
Solid waste problem become a serious issue for the countries around the world since the amount of generated solid waste increase annually. In 2016 the total generation of solid waste by the world's cities was up to 2.01 billion tonnes. It was equal to 0.74 kilogram of solid waste generated by a person in a day. This number is estimated to increase by 70% or up to 3.40 billion tonnes of solid waste in 2050. Population growth and urbanization are the most siginificant factors that trigger the increase in the amount of waste. Poor management of waste may create serious problem related to health, safety, and environment. Therefore, proper waste management strategy is needed to minimize such negative impacts [1].
One of the effort to reduce the number of solid waste is by improving the waste reusability. Waste sorting plays significant role to support the waste reusability [2]. Since the number of waste is great and the awarness of people in waste sorting is still low, an automatic waste sorting is needed. The starting point to TELKOMNIKA Telecommun Comput El Control  Solid waste classification using pyramid scene parsing network segmentation and … (Khadijah) 1903 produce an automatic waste sorting is by building a classification model that may recognize the type of waste image.
Some previous studies have been investigated the use of machine learning algorithm to classify or recognize waste image. Mustaffa et al. [3] and Torres-gracia et al. [4] classified waste image into three classes using conventional machine learning algorithm and were able to achieve good accuracies, but their experiment used only 20 samples in each class. Therefore, the generalization of the resulting model for classifying the variety of real waste image could not be assured. Adedeji and Wang [5] and Costa et al. [6] classified waste image using deep learning model and more number of samples, but they used the capture of waste image directly without any segmentation. On the other hand, segmentation is one of the most important part in the image preprocessing. Poor segmentation result may degrade the performance of the subsequent processes, such as feature extraction and classification [7].
Segmentation is the process of partitioning an image into different several disjoint subset [8], for example partitioning an image into background and foreground. Segmentation can also be used to extract region of interest of an image [7]. State of the art of segmentation methods are kind of deep learning algorithm with special architecture, such as encoder-decoder which have better performance than the conventional one (such as thresholding method) [9]. Pyramid scene parsing netwrok (PSPNet) is a kind of deep learning network that can be used for semantic image segmentation. PSPNet successfully outperformes other deep learning based segmentation in some large benchmark dataset, such as fully convolutional network (FCN), DeepLab, deep parsing network (DPN) and Laplacian pyramid reconstruction and refinement (LRR). PSPNet is able to achieve better segmentation result because it considers global context of the image and uses pyramid pooling module to obtain different region based context of an image [10].
In addition, feature extraction of the image must be determined properly to achieve good classification result [11], [12]. Feature extraction is aimed to extract relevant subset of features from an image and to reduce the large dimension of image to the lower dimensional set of image features [13]. Color, texture, and shape are the most visual features extracted from an image. Colormoments is one of the simplest color feature compared to the other, such as color histogram, color coherence vector, and color correlogram [14]. Color moment is also proven to be effective and efficient for extracting color features of an image [15]. In addition texture feature is also important to extract the relationship from neighboring pixel. Gray level co-occurence matrix (GLCM) is one of the popular texture-based feature extraction that has been successfully applied in many computer vision problem [16]- [18]. The other important image feature to describe the object of an image is shape feature. Some morpohlogical features, such as area, perimeter, major and minor axis, centroid-x, centroid-y, roundness, rectangularity, eccentricity and elongation, can be used as shape descriptor [19]. In addition, Hu moment is also important to extract shape features. Hu moment is region-based method that uses second and third order central moments and constructs seven invariant moments whose values are not affected when the image is translated, rotated, or scaled [20].
In this research, we propose the application of PSPNet as segmentation to provide good source for feature extraction. We also use various combination of image feature extraction (color, texture, and shape) to search fo the best combination of features for solid waste image classification. As a comparison, we also perform experiment without using segmentation to see the effect of PSPNet. Then, support vector machine (SVM) is applied at the end as a classifier. SVM is a binary classification algorithm proposed by Cortes and Vapnik [21] which works by finding the optimal hyperplane to maximize the separation between binary class data. SVM has been successfully applied in various classification problem and proven to better than other popular classification algorithm, such as artificial neural network (ANN) [22], [23], Naïve Bayes classifier dan random forest [24]. Figure 1 shows the stages of process in this research. First, the image dataset is segmented by using PSPNet segmentation, then the process is continued by feature extraction, classification, and evaluation. As a comparison, to examine the effect of PSPNet, we also perform experiment without using segmentation, hence the process of segmentation in Figure 1 is skipped.

Dataset
Public trash image dataset from Trashnet are used in this research as source data for conducting experiments. Trashnet dataset was collected by Yang and Thung [25]. This dataset contain 2,527 trash images of 224×224 pixels which is grouped into six classes: glass (501), paper (594), cardboard (403), plastic (482), metal (410), and trash (137). A sample image from each class can be seen in Figure 2 [25]. glass paper cardboard plastic metal trash

PSPNet for image segmentation
PSPNet is performed to generate segmented binary image, then the bounding box of segmented image are calculated and image is cropped so that only the main object remain. PSPNet is a kind of deep learning network for semantic image segmentation. PSPNet outperformed FCN based segmentation because PSPNet consider global context of the image and uses pyramid pooling module to obtain different region based context of an image. The architecture of PSPNet is shown in Figure 3 [10], [26]:  Sub region average pooling Each feature map is pooled over different sub-region to obtain different context reprsentation in each sub-region. In the first level (red), the global average pooling is performed in each feature map. The result is a single bin output for each fature map. In the second level (orange), third level (blue), and fourth level (green), each feature map is divided into 2×2, 3×3 and 6×6 sub-region, respectively, then each sub-region is pooled by average pooling.

 Convolution
The 1x1 convolution is performed at each level to reduces the size of feature map at each level into 1/N of the original one (black) where N is the level size of pyramid.  Upsampling Upsampling is performed by using bilinear interpolation to make each feature map have the equal size as the original one (black).

 Concatenation
The original feature map (black) and all upsampled feature map from the first to fourth level are concatenated and the result is forwarded to convolutional layer for prediction. The process of segmentation using PSPNet consists of training and testing. Dataset is divided into 70% of training data, 15% of validation data, and 15% of testing data. Training is performed using some combination of hyperparameter: learning rate (0.001, 0.0001, and 0.00001) and batch (5 and 10) in 50 epoch. After training using such combination of hyperparameters, six models of image segmentation are obtained, then testing data is used to evaluate and select the best model. Dice coefficent ( ) is used to evaluate the results of segmentation as shown in (1) where and is the image regions being compared [27].

Feature extraction
Feature extraction is aimed to extract relevant subset of features from an image [13]. This research uses three kinds of image features, namely color features extracted by using color moments, texture features extracted by using gray level co-occurence matrix (GLCM) and shape features. Experiments are run using one or combination of such features to obtain the best classification result. Table 1 shows the comparison of source image for each feature extraction method. When using PSPNet segmentation, original red, green, and blue (RGB) image is segmented resulting the segmented binary image. Then, for extraction of color and texture features, the image is cropped around the bounding box using OpenCV library, findcontour. When the segmentation is skipped, before shape features are extracted, each image is converted into binary image by using inverse binary thresholding (value of threshold = 128).

Colormoments
Color feature is visual feature that can be used to discriminate or recognize visual information. If the color distribution of an image is interpreted as a probability distribution, then color moments can be used to characterize the color distribution [28]. Three color moments (mean, standar deviation, and skewness) are extracted for every image channel, therefore there are 9 numerical values extracted for an image in RGB color space. Mean is the average of pixel values as shown in (2), standard deviation is the variation of pixel values as shown in (3), and skewness is the degree of asymmetry in the color distribution in an image channel as shown in (4). is total number of pixel in each channel and is the -th pixel value in channel [15].

Gray level co-occurence matrix (GLCM)
GLCM is a method for extracting texture features of an image. First, co-occurence matrix P is created. P is a square matrix whose size is equal to the number of gray intensity value of an image. Each element in the matrix is the number of occurence (frequency) of two neigboring pixel in specific orientation where the gray intensity value of the first pixel is equal to and gray intensity value of the second pixel is equal to [29]. Neighboring pixel can be selected based on specified spatial orientation. For example when the orientation is 0 0 , then the neighbor of a pixel is a pixel that is on the right side. The resulting GLCM matrix can be obtained by making P as symmetrical matrix (adding matrix P with its transpose) and then normalizing the value of each element into [0, 1]. Some metrics can be calculated based on the resulting GLCM matrix, they are contrast, angular second moment (ASM), energy, homogenity, correlation, and dissimilarity. The detail formula for each metric can be referred at [30]. In this research, we construct GLCM matrix in various spatial orientation (0 0 , 45 0 , 90 0 , and 135 0 ).

Shape
Shape is also prominet feature to discirminate an image to another. This research extracts shape descriptors of an image from morphological features and Hu Invariant Moment. Some morpohlogical features extracted are area, perimeter, major and minor axis, centroid-x, centroid-y, roundness, rectangularity, eccentricity, elongation, dispersion I, dispersion IR, convexity, and solidity [19]. The illustration of such morphological shape features can be seen in Figure 4.
In addition, Hu moment is also performed to extract shape descriptor of an image. Hu moment is region-based method that uses second and third order central moments and constructs 7 invariant moments. The value of invariant moment features are not affected when the image is translated, rotated, or scaled. The detail oh Hu moment can be referred in [20].

Classification
Classification consists of training and testing. Training is used to build the classifier model, while testing is used to evalute the performance of the model as illustrated in Figure 5. First the dataset is splitted by using 10-fold cross validation. This research applies SVM as classification algorithm.  Figure 5. Traning and testing in classification SVM training algorithm works by finding the optimal hyperplane that maximize the separation between binary class data. The closest training data to the optimal hyperplane that defined the optimal margin TELKOMNIKA Telecommun Comput El Control  Solid waste classification using pyramid scene parsing network segmentation and … (Khadijah) 1907 are called support vectors [21]. When the data are non-linearly separable, non-linear mapping ( ) is applied to transform the original data into higher dimension [31]. Let the ( , ) =1 where ∈ is input training data, is targeted data and is the number of training data, SVM find the solution by solving the following optimization problem as show in (5) where is weight vector and is error penalty. Such optimization problem can be solved using Lagrangian formulation. The training data is normalized into [0, 1] before they are inputted to the SVM and is set into -1 or 1 [32]. In order to reduce the computational cost when working with nonlinear data, kernel tricks can be used to substitute the dot product between transform data tuples as (6). Some popular kernel function can be used, such as polynomial and radial basis function (RBF) as shown in (7) and (8), respectively [33].
Once the optimization problem solved, the optimal hyperplane and the support vectors are obtained. Then, the output ( ) of a new test sample can be determined by using (9) where are support vector, is class label of -th support vector, is the number of support verctors, is Lagrange multipliers, and is bias [32]. This research applies the one-versus-rest strategy to handle the multiclass classification problem, because Trashnet dataset consist of 6 classes.

Evaluation
Evaluation is performed to evaluate the resulting classification model. In this research, evaluation of classification model is measured in term of accuracy. Accuracy shows the ratio between the correctly predicted data and the total number of data [34].

RESULTS AND ANALYSIS
This research is performed in two main scenario. The first scenario is performed by using PSPNet segmentation, while the second scenario skip the process of segmentation. In each scenario, single or combination feature extraction of color (colormoments), texture (GLCM) and shape (morphological features and Hu Invariant Moments) are experimented to search for the best image features that well describe the trash image in order to reach the best classification results. GLCM feature extarction method is performed in various spatial orientation: 0 0 , 45 0 , 90 0 , and 135 0 . Then in the classification, SVM training algorithm is performed using some combination of parameters, namely kernel function (RBF and polynomial) and error penalty (1 or 100). Therefore, for each feature extraction in a scenario, classification with SVM is performed four times using different combination of kernel function and error penalty . In the last section the results of the first scenario and the second scenario are compared.

The first scenario
In this scenario segmentation is performed in the first stage by using PSPNet. In order to obtain the best model of PSPNet, this reseach try some combination of hyperparameter: learning rate (0.001, 0.0001, and 0.00001) and batch (5 and 10). An experiment for each combination of hyperparameter is performed in 50 epoch.
Based on Figure 6(a), it is shown that the learning rate of 0.0001 gives the best results than other values. It can be explained that when the learning rate is too small, the progress of network learning is very slow, then the result is lower. Conversely, when the learning rate is too high the progress of network learning 1908 may diverge, then the network is failed to achieve the best result. Figure 6(b) shows that the batch value of 5 is able to reach better performance than the batch value of 10. It can be explained that in this case the stochastic nature of using lower number of mini batch may lead to find the optimum solution. Therefore, the segmentation in the rest of experiment are performed by using the best segmentation model trained by those combination of parameter. The results of segmentation using PSPNet for sample images in Figure 2 can be seen in Figure 7. The result of experiment in this scenario for various features with the best accuracy in each SVM kernel (RBF and polynomial) can be seen in Figure 8. It is shown that RBF kernel is better than polynomial kernel in most of experiments, but when more combination of features are used, the polynomial kernel are better than RBF kernel. The highest accuracy of 76.49% in this scenario is achieved by polynomial kernel with =1 when using combination features of color, GLCM 135 and shape. While the highest accuracy of RBF kernel in this scenario is 74.55% when using = 100 and the same combination features of color, GLCM 135 and shape. Therefore, it can be concluded that when segmentation is used, the performance of classification increase as the more combination of features are used. The use of more combination of features give the more representative feature sets of an image, therefore the accuracy of classification increase. However, the most important feature is the color feature. When the color feature is removed, the accuracy of classifier decrease.

The second scenario
The second scenario is performed without segmentation in the preprocessing. The result of experiment in this scenario for various features with the best accuracy in each SVM kernel (RBF and polynomial) can be seen in Figure 9. It is shown that RBF kernel is better than polynomial kernel in most of experiments. However, the highest accuracy of 74.83% in this scenario is achieved by polynomial kernel with = 100 when using combination features of color and GLCM 90. While the best result of RBF kernel in this scenario is 74.55% when using = 100 and combination features of color and GLCM 135. Therefore, it can be concluded that when segmentation is not used, the best combination of feature that well describe the trash image is combination of color and GLCM. When, the shape features are added, the performance of classification decrease. To extract shape features in this scenario, a conventional thresholding operation is applied to transform a RGB image into binary image, thefore the resulting binary image is not good enough as source for extracting shape features. Figure 10 shows the comparison between the first scenario and the second scenario. Based on Figure 10, it is shown that the first scenario (using PSPNet segmentation) is better than the second scenario (without segmentation) in most of experiment. The second scenario outperforms the first scenario only in 5 from 19 experiments. Therefore, it can be concluded that generally applying PSPNet segmentation provide better source for feature extraction, especially in color and shape feature, hence increase the performance of classification. It is also observed that the most important feature in this problem is color feature. When using single feature, color feature provide the highest result compared to GLCM (texture) and shape feature, both in the first and the second scenario. However, the accuracy increase if additional features are introduced. In the first scenario better results are achieved when using all combination of features, while in the second scenario better results are achieved when using only color and texture features. Therefore, it can be concluded that when segmentation is applied by using PSPNet, the segmented binary image provide better source for shape feature extraction. Conversely, when the binary image is only obtained by using inverse binary thresholding, the result is not good enough for shape feature extraction. Hence, the accuracy of classification decrease when shape feature is added in the second scenario. From all combination of parameters conducted in this research, the highest accuracy of 76.49% is achieved when using PSPNet segmentation and all combination of features (color, texture, and shape).

TELKOMNIKA Telecommun Comput El Control
The results of this research show that the combination of features are able to increase the performance of the resulting model than when using the individual feature, but they are still not enough to uniquely characterize each class of solid waste image. The more representative additional features are still required to improve the performance of classifier. The tuning of parameter of classification algorithm also need to be explored to obtain better classification results.

CONCLUSION
In this research we apply PSPNet as segmentation and combination of image feature extraction (color, texture, and shape) to classify the solid waste image. As a comparison, to see the effect of PSPNet segmentation, we also perform experiment without using segmentation. Based on the result of experiment, it can be concluded that generally applying segmentation provide better source for feature extraction, especially in color and shape feature, hence increase the accuracy of classification. It is also observed that the most important feature in this problem is color feature, both when the segmentation is applied or not. However, the accuracy of classifier increase if additional features are introduced. When segmenation is not used, better result is achieved when using only color and texture features, while when segmentation is applied the highest accuracy of 76.49% is achieved when using all combination of features.