Customer segmentation with RFM models and demographic variable using DBSCAN algorithm

ABSTRACT


INTRODUCTION
Customer segmentation is the process of classifying customers into separate categories based on shared characteristics [1], [2].The customer segmentation gives the understanding about customers which is needed by the company, helps to identify the prospective customers [3], and helps to classify the customers with similar characteristics.In customer segmentation, there is segmentation based on behavior with the most commonly used segmentation models being recency, frequency, and monetary [4].The field of retail marketing and retail decision-making has extensively researched customer behavior [5].For the past two decades, this model has been utilized to categorize customer databases according to their purchasing behavior [6].
Hughes introduced the of recency, frequency, and monetary (RFM) model in 1994 as a means of analyzing customer behavior [7].The model factors in a customer's recency (the interval since their most recent purchase), frequency (the number of purchases they make in a particular time period), and monetary value (the amount they spend on each transaction) to determine their value [2].As a result, businesses can determine which customers are worth engaging and maintaining by using this effective method of predicting customers' future purchasing behavior [8].
212 mart is a retail business that focuses on the monetary aspect when it comes to customer segmentation.212 mart only gave a special treatment to the customers who had high monetary value, or the TELKOMNIKA Telecommun Comput El Control  Customer segmentation with RFM models and demographic variable using … (Siti Monalisa) 743 customers who made large transaction value.212 mart had done nothing from the recency and frequency side, which made it ineffective to identify the prospective customers.As a result, a recency variable needed to be added to provide information about the interval between the latest transaction time and analysis time.Also, a frequency variable was needed to provide information that the customers with high frequency showed bigger customer loyalty.The prospective customers can be identified effectively by using those three RFM variables and can also be used as a development of an effective marketing strategy [8].In the RFM model, the customers are categorized into 4 characteristics based on the RFM average value.They are prospective customers, new customers, lost customers, and loyal customers [9].
Aside from the segmentation based on RFM, the segmentation can also be analyzed based on demographic, which is the most common form of market segmentation and the easiest to understand as well.The information obtained from demographic segmentation is easy to interpret, collect, and transfer from one study to another due to the ease of collecting such information [10].The variables of demographic that are commonly used are age, gender, family size and type, income, occupation, education level, race, and nationality [11].Demographic is a statistic defined for the customer population data set.Also, demographic is used in marketing and public opinion polls or public view of a trend [12].
Clustering is a data mining method employed to group data into various segments according to their characteristics.The data with similar characteristics will be in the same segment, while those who do not have similar characteristics will be separated into a different segment [13].One of the commonly used algorithm in clustering techniques is density based spatial clustering of applications with noise (DBSCAN).The algorithm can find clusters of any shape at one density condition [14] and can handle large-scale data, can detect an outlier and categorize bigger data with different form and size [15].

METHOD
This research consisted of 5 steps, the first step was preprocessing data.The data were selected based on the RFM criteria, and then were transformed into the RFM.After that, the data were normalized so that the data scale would not be too far, as the  value was currency value in rupiah.Unlike  and  values [16], whose values were normalized with the Min-Max method and a range of 0 to 1, this method used a range of 0-1 [16].In this study RN is normalized for recency, RF is normalized for frequency and MN is normalized for monetary.Here is the formula for calculating the normalization number : The second one is clustering the data by using the DBSCAN algorithm.We need to determining the optimal values of epsilon and MinPts.In order to do it, a k-dist graph was used, by observing the shift of epsilon values from  values.The points that experienced drastic shift or change in the k-dist graph were used as the epsilon values, while the  values were used as MinPts [15].The third step is measuring the cluster quality using silhouette index (SI).After obtaining the best cluster, the fourth step is determining the rank symbol of each cluster.The average value of the RFM attribute for each cluster was used to look for the rank symbol.Finally, in the fifth step, an analysis based on the demographic variable was performed.

RFM model
Hughes introduced the RFM model in 1994, which is a behavior-based customer segmentation technique that evaluates a customer's past behavior.It segments customers based on recency (the time their most recent buy and the current), frequency (the volume of transactions during a specific time period), and monetary value (how much was spent on transactions) [8].The RFM model enables companies to easily assess customer loyalty towards their products and services, allowing them to optimize their profits [17].
In order to determine customer value based on RFM value, clusters of customers with an RFM value higher than average are denoted with '↑', while clusters with a lower RFM value are denoted with '↓'.The cluster of customers denoted by  ↑  ↑  ↑ is referred to as loyal customers, while those denoted by  ↓  ↓  ↓ are considered lost of customers.Customers denoted by  ↑  ↓  ↓ are new customers, and those denoted by  ↑  ↑  ↓ are prospective customers.Table 1 provides an explanation of customer characteristics based on their RFM value [9].

DBSCAN clustering
DBSCAN is a density-based clustering algorithm that clusters data points with high density into a group [18].The algorithm is guided by two essential parameters, epsilon and MinPts.Epsilon represents the greatest distance within a cluster of data values, while MinPts represents the minimum number of data points required to form a cluster within the epsilon radius [19].The followings are the steps of the DBSCAN algorithm: a) Randomly select a data point from the dataset as the starting point for the core point candidate.b) Establish the values for epsilon and MinPts.c) If the starting point which has been selected is sufficient with the core point based on the user-defined epsilon and MinPts values, a cluster will be formed with its neighboring object.The distance between the object in the core point and neighboring object can be measured by using euclidean equation: d) If the beginning point object is a border point and the starting point object does not have a density-reachable connection with it, the DBSCAN will visit the following object from the data set to become the following core point.e) Do process 3 and 4 again until all points have been visited.f) If the selected object is not sufficient as a center point or border point in the formed cluster, then it will belong to the outlier data, which are the data that have bigger distance than the distance between epsilon and core point, but have less number of objects than the specified MinPts.They have a high level of response, just recently make transactions which are quite often, and thus, make them become the prospective customers for the companies.

Silhouette index validation
Silhouette index was firstly introduced by Rousseeuw in 1987, who combined the polymerization factor of intra-cluster and resolution between clusters to evaluate the cluster quality [20], [21] in order to better represent the separability of clusters, and to be cluster validity.Silhouette index is useful when the data are on a ratio scale (euclidean distance) and when looking for a clearly separated data set [22].Silhouette index describes a description of the accuracy of an object in its occupied cluster.The optimal cluster has a high silhouette index value or close to 1.If the si value is close to 1, it means that the cluster is very dense.However, If the si value is near to -1, the cluster that contains object i is not dense [23].Here is the equation used to measure the silhouette index value:

RESULTS AND DISCUSSION
The data used were the transaction data of the customers who have member cards in 212 mart and made transactions on January−December 2020.There were 1,205 customers who met these criteria.The data were selected based on the RFM criteria and was normalized using by using (1) with range 0−1.The result is shown in Table 2 and Table 3.
The next step was clustering the data using the DBSCAN algorithm.To determining the optimal values of epsilon and MinPts, a k-dist graph was used.K-dist was searched at  = 3,  = 4, and  = 5 using  studio.The results of k-dist graph values are as shown in Figure 1, Figure 2, and Figure 3.
Based on Figure 1, Figure 2, and Figure 3, the points which experience the drastic change are at 0.06 until 0.08.Therefore, the optimal values of epsilon and MinPts are in the range 0.06 and 0.08 with MinPts 3, 4, and 5.The DBSCAN results can be seen in Table 4.After obtaining the results of the clusters, the next thing to do was validating the cluster to know the optimal number of clusters, as well as the quality and power of clusters in each epsilon and MinPts values.The results of the cluster validation are in Table 5   Based on Table 5, the highest SI value is at Eps 0.06 and MinPts 3 whose SI value is close to 1.It is 0.4222 and, hence, can be said as the most optimal cluster.The epsilon value 0.06 and MinPts 3 produce 5 clusters with 31 noisy data.The next consist of 1,118 customers in cluster 1, 7 customers in cluster 2, 14 customers in cluster 3, 9 customers in cluster 4, and 26 customers in cluster 5.
The next step was determining the rank symbol of each cluster.The average value of the RFM attribute for each cluster was used to look for the rank symbol.The cluster whose average value of frequency and monetary was higher than the average value of frequency and monetary before clustering was given the symbol ↑, while the cluster whose average value of frequency and monetary was lower than the average value of frequency and monetary before clustering was given the symbol ↓ [9].In contrast to frequency and monetary, if the average value of recency after clustering was higher than the one before clustering, it was given the symbol ↓.On the contrary, if the average value of recency after clustering was lower than the one before clustering, it was given the symbol ↑.This was because the shorter the interval between the last purchase time and analysis period, the greater the recency value [24].Table 6    Based on Table 8, the group of customers in cluster 1 is the customers categorized as prospective customers ( ↑  ↑  ↓).Customers in this group have higher average  and  values than the average transaction value, which indicate that those customers have recently shopped in a frequent or repeated period of time.The 212 mart party can actively contact these customers to offer new products accompanied by promotional activities and various new gifts which aim at increasing the customers' interest in buying the products and increasing the sum of money paid.
Cluster 2, 3, 4, and 5 have the same rank symbol ( ↑ F ↑ M ↑), which shows that the customers in those clusters belong to the category of loyal customers.The customers in this segment are highly retainable customers.The 212 mart party must maintain the customers' loyalty by regularly giving information on the latest products.Through transactions, we can better comprehend their purchasing behavior and needs and providing benefits which the customers can get every time they make a transaction on a certain value.


Customer segmentation with RFM models and demographic variable using … (Siti Monalisa)

747
The company can also increase the reward program for the customers according to the spending made by them.The analysis of customer segmentation based on demographic variables, such as gender, age, employment, address, and marital status, was the next task after getting the RFM rank for each cluster.Table 9 and Table 10 show the consumer demographics.
Based on Table 9 and Table 10, the information is obtained that the majority of the customers are from the prospective customers category (cluster 1) and consist of 1118 customers.It is many of them are middle-age customers with an age range of 35−44 years (34%), dominated by female (744.67%),work as a lecturer (312.28%),live in Pekanbaru (95%), and, mostly, is married (1029.92%).Meanwhile, the number of the customers from the category of loyal customers (cluster 2, 3, 4, and 5) are 56 people in total, and most of them are 35−44 years old (30%), female, which as many as 42 people (75%), mostly work as a lecturer (41%), live in Pekanbaru (95%), and almost all of them are married (92%).

Discussion
This study segmented the customers of 212 mart based on RFM and demographics using the DBSCAN algorithm.Customer data were clustered into different segments based on RFM variables.Then, the data were analyzed based on RFM rank and estimating each cluster's average value categorizing consumers based on their traits in accordance with the theory [9].Two customer segments were obtained, namely loyal and potential category.Then, both customer characteristics were analyzed based on demographics.The demographic variables used were age, gender, occupation, address and marital status.This analysis produces an understanding of customers and proposed strategies that will be applied to each customer segment based on their characteristics.
A research on RFM and demographic-based consumer segmentation had previously been carried out [25] using customer data of five-star hotels in Antalya, Turki.The difference with this study is the algorithm used for the clustering process, which was the self-organizing map (SOM) and K-means algorithm.Also, the demographic variables used here were gender, age, nationality and travel companion.The findings showed 8 clusters, with the majority of customers belonging to the "lost customers" segment, who remain for a shorter amount of time, and being predominately male.Results showed that RFM clusters the customers effectively, which might encourage senior managers to develop original suggestions for improving their customer relationship management (CRM) abilities.

CONCLUSION
This study provides information that from the existing 5 clusters, 2 categories of customers are generated based on customer characteristics, namely loyal customers and prospective customers.Customers who belonged in the loyal category made repetitive transactions and often spent large amount of money.Customers belong to this category are very worthy of being maintained by the company by providing the best service, so that these customers will stay and not become targets by other companies.
Based on demographic analysis, the majority of 212 mart's customers were middle-age customers (35−44; 34%), female (786; 67%), and married (1080; 92%), which showed that the majority of 212 mart's customers are housewives.This can also be seen from the products sold at 212 mart were products that were related to women or housewives.Besides that, the mostly purchased items were groceries such as rice and oil, as well as household items such as washing supplies and toiletries.In addition, the majority customers of 212 mart worked as lecturers (335; 25%), and stayed in Pekanbaru.It was because the target customers of 212 mart were the upper middle class, such as lecturers, and also the location of 212 mart was close to the campus.

Table 1 .
The customer characteristics based on the RFM value This segment shows the customers who have high average recency and frequency values, but have low average monetary value.

Table 2 .
. The customer data in RFM model Table3.The normalization data Customer segmentation with RFM models and demographic variable using … (Siti Monalisa) 745

Table 4 .
The result of DBSCAN clustering

Table 5 .
The results of silhouette index

Table 6 .
displays the RFM average value prior to clustering, Table7displays the best RFM cluster average values, and Table8displays the rank symbol findings for each cluster.The RFM before clustering

Table 7 .
The RFM average of each cluster

Table 8 .
The RFM rank symbol of each cluster

Table 9 .
The demographic of cluster 1 (prospective customers)

Table 10 .
The demographic of loyal customers clusters