A comparison of different support vector machine kernels for artificial speech detection

ABSTRACT


INTRODUCTION
Speaker recognition is the process of identification or verification of a speaker from the speech signal. Speaker identification is the process of determining the speech owner from the speech, whereas speaker verification is the process of accepting or rejecting the claimed identity of a speaker. Recently, automatic speaker verification (ASV) systems were introduced to provide better security, replacing the traditional authentication methods that were less efficient and secure. Applications of ASV systems include but are not limited to access control and banking transactions [1].
In spite of the security and comfort brought by ASV systems, spoofing attacks from security foes is unavoidable. To bypass the ASV systems, malicious entities attempted to launch a spoofing attack to get access to the system illegally. Various countermeasures named voice presentation attack detection (PAD) were introduced to secure the ASV systems. Generally, voice PAD can be categorized into replayed and artificial speech detection. Artificial speech refers to the speech signal generated by speech synthesis and voice conversion techniques, whereas replayed speech refers to the speech signal generated by replaying the recorded speech.
To secure the ASV systems, numerous voice PADs were introduced to secure the ASV systems. There were many classifiers used in recent works to detect artificial speech. One of the extensively used classifiers in recent works was the support vector machine (SVM), as it was found to excel in various classification tasks [2], [3]. From the findings, the recent work [4] showed that SVM with radial basis function kernel (RBF) outperformed classifiers such as k-nearest neighbour (KNN), decision tree, and Naive Bayes with 1% equal error rate (ERR) on the ASVspoof 2019 replay evaluation set. The performances of different SVM kernels were experimented and found that the RBF kernel performed the best in replay detection.
Nonetheless, to the best of our knowledge, there were none of the studies shown to investigate the performance of different kernels of SVM against artificial speech detection. It is necessary to investigate the appropriate kernel used in SVM as the performance of the model varies depending on the classification tasks. The selection of kernel is dependent on the features input as some features are linear separable by the SVM hyperplane, and some are not. Hence, in this work, the performance of different kernels of SVM on artificial speech detection is presented. On the other hand, handcrafted features such as hexadecimal-based features, image-based features, and the conventional mel-frequency cepstral coefficient (MFCC) features are used for artificial speech detection.
The key contribution of this paper is the empirical comparison performance of SVM kernels in detecting artificial speech when applied to the presented handcrafted features. The remaining sections of the paper are arranged as: section 2 describes the proposed features and classifiers for artificial speech detection; section 3 presents the experimental setup, results, and discussion; and lastly, section 4 concludes the paper.

METHOD 2.1. MFCC
MFCC coefficients, which are typically optimal for speech analysis, were also used as features for the work described in this paper. The process of extracting MFCC is presented in Figure 1. First, the input signal was windowed into short frames. Then, discrete fourier transform (DFT) was applied to the signal (waveform) to obtain the power spectrum. Logarithm was applied to the amplitude to obtain the log-amplitude spectrum. Then, mel-scaling was conducted in which mel filterbank was applied to the log-amplitude spectrum to produce the mel spectrum. Lastly, discrete cosine transform (DCT) was applied on the mel spectrum to produce a number of coefficients known as MFCC. It is shown that the first 13 coefficients of MFCC were most informative about formants and spectral envelope [5]. Hence, 13 MFCC coefficients were used in this paper as conventional speech features.  Figure 2 is presented to show the hexadecimal representation of voice data. Similar to image representation, voice data can also be represented in text and numeric formats such as binary and hexadecimal. Hexadecimal representation provides a more human-friendly representation in numeric compared to binary. To the best of our knowledge, there was no related work done by applying features engineered from a hexadecimal representation of speech signal for spoof detection. In this paper, text-based features were extracted from the hexadecimal representation of audio data to form a feature space.

Hexadecimal frequencies
A work that utilizes features extracted from hexadecimal represented data for classification problems were found in [6], [7]. In the works [6], [7] the occurrences of each opcode in the executable file were counted and used as features to classify malicious software (malware). The approach used by [6], [7] produced high accuracy in malware classification. The approach was able to achieve good performance because the different classes of malware usually have a higher frequency of certain opcodes.
As the approach [6], [7] produced good result in malware classification, it was adapted to the domain of artificial speech detection in this paper. In this paper, a hexadecimal representation of speech was used to extract features to classify between genuine and spoof speech. The artificial speech data may contain an abnormal number of certain hexadecimal, ranged from 00 to FF, which may be used as an indicator to distinguish between genuine and spoof voices. To the best of our knowledge, this is the first work that used hexadecimal-based features in detecting artificial speech. For each of the text-represented speech data, the occurrences of each of the 256 hexadecimal, from 00 to FF, were counted. Then, a histogram of hexadecimal frequencies consisting of 256 feature sets was computed. In addition, min-max normalized hexadecimal frequencies were derived from the hexadecimal frequencies by applying min-max normalization [8] on the hexadecimal frequencies. In total, 512 features were extracted from hexadecimal. The (1) shows the formula for min-max normalization used in this paper.

TELKOMNIKA Telecommun Comput El Control
Where is the occurrence of hexadecimal value , and are the maximum and the minimum number of occurrences of hexadecimal values in a speech, respectively.

Image-based features
The images used in this works were the spectrogram and MFCC for artificial speech detection. Although both spectrogram and MFCC are commonly used to represent speech signals, little interest has been paid to applying both as images [9], [10]. In this paper, the spectrogram and MFCC images were generated from the audio using pyplot and librosa libraries in python, respectively and saved as a 640×480 pixels PNG image. The examples of the generated spectrogram and MFCC images are Figure 3 and Figure 4, respectively.
Two types of image-based features were extracted from both of the spectrogram and MFCC images in this paper, namely color layout filter (CLF) and local binary patterns (LBP) features. Weka's implementation of the CLF features was used in this paper, resulted in 33 CLF features [11]. This paper applied the setting used in the original LBP [12], with a neighborhood radius = 1, resulting in 8 neighboring pixels in a 3×3 pixels window. Then, the generated frequency histogram of the LBP operation was used to generate the 256 LBP features.

Support vector machine (SVM)
SVM is one of the supervised machine learning models which mostly used in binary classification tasks [13], [14]. There were also several recent works introduced which used SVM, for example [15], [16]. In this paper, various SVM settings were tested to identify the appropriate settings for artificial speech detection. The Weka implementation of SVM, known as libsvm library, was used in this paper. Four SVM kernels were tested, namely radial basis function, linear, polynomial, and sigmoid. The RBF kernel is usually the default kernel used in most of the machine learning tools and libraries such as Weka and sklearn. The RBF is a real-valued function often used to build function estimates. The of linear kernel is (2).
Where parameter defines the influence of a training sample selected as support vector while || − || is the euclidean distance between two points and . As for linear kernel, it is used for linearly separable data. Linear separable means that the data can be separated by a straight line if the data is graphed into two dimensions. The (3) shows the formula of the linear kernel.
( , ) =  Where , is the dot product of two points and . Polynomial kernel portrays the resemblance of feature vectors in feature space over the original variables polynomials to allow the non-linearity of the model. The polynomial kernel is often used in image processing tasks. The (4) shows the formula of the polynomial kernel.
( , ) = (  + 1 ) Where , is the dot product of two points and while is the degree of the polynomial. The sigmoid kernel is equivalent to a two-layer perceptron model and is often used in a neural network as an activation function. The (5) shows the formula of the sigmoid kernel.
( , ) = ℎ ( + ) Where is the scaling parameter of the sample while is the shifting parameter for threshold mapping of the transpose of the two points and . More information on the kernels can be found in [17]- [19].

Experimental setup
An experiment was conducted to identify the appropriate settings for SVM in artificial speech detection. In this paper, the ASVspoof 2019 logical access (LA) dataset was used for the experiment, which was made up of speech synthesis and voice conversion attacks [20], [21]. The ASVspoof 2019 LA dataset consists of three partitions, namely training, development, and evaluation sets. The corpus was built from speech samples of 107 speakers, of which 46 were male and 61 females. There were six spoof algorithms, namely A01-A06, in the training and development partition.
As the training and partition consist of 5,128 bona fide utterances and 45,096 utterances, resampling was conducted as a preventive measure in this paper to reduce the chances of model overfit during training. Both under-sampling and over-sampling were used to ensure both bonafide and spoof samples were in the same number. Both bonafide and spoof samples were resampled to 7,200 samples, a total of up to 14,400 samples used in the experiments.
The resampled training and development partitions were used to train the SVM models. The evaluation partition was used for testing. The experiment was conducted using Weka, whereby default SVM settings other than the kernel, was used. To further improvise the detection performance, feature fusion was conducted. The fused features include MFCC, image-based features, and hexadecimal-based features. Feature fusion may cause the model generated by a classifier to overfitting, as the feature set with large numbers will often be biasedly assigned a larger weight. To mitigate this issue, feature normalization should be applied [22]. A min-max normalization described in section 2.2 is used in this experiment. Concerning the classification output, we also investigate the classification performances when probability estimates [23] are used. The probability estimates in general are used to calculate the number of times an event happened divided by the number of trials. The mostly used approach to produce probability estimates in SVM is the platt scaling. The platt scaling is often applied to a binary class problem using logistic regression model to output a probability estimates in the range of 0-1. The machine used in the experiment conducted is with the specification as: Intel i5-3210 M processor, 2.50 GHz, 8 GB of RAM, Windows 10 (64-bit) OS.

Analysis of results
EER is the primary metric to assess the performance of a biometric system, especially speaker verification. The EER is a threshold point of a biometric system at which the false acceptance rate (FAR) and false rejection rate (FRR) are equals. Hence, in this work, the EER metric was used to measure the performances of SVM models where a lower EER indicates a better performance. Table 1 showed the performances of the SVM models with different settings as described in the foregoing section.  Table 1, there are two SVM models with different settings performed with less than 5% EER in the ASV spoof 2019 LA evaluation set. The two best settings, as Table 1, are the polynomial SVM and linear SVM, both with feature normalization, which produced 1.42% and 3.55% EER, respectively. This observation can be seen as the fused features were effective in artificial speech detection when using the appropriate SVM settings.

TELKOMNIKA Telecommun Comput El Control
An interesting observation is that all SVM kernels performed the best when feature normalization is applied. This indicates that the feature normalization can produce a better result. When no normalization is applied, the features with larger values are likely to influence the prediction result intrinsically. This is because the SVM models may tend to give more weight to the features with larger values, and overfitting occurs. Therefore, normalized features produced better results by bringing all the features to the same range to reduce the probability of overfitting, but it may not always be the case.
Another observation is that the commonly best-performing kernel, the RBF was not performed well, as Table 1, although it was shown to perform well in most cases [24]. Nonetheless, it can be observed that the normalization improved the performance of the SVM with RBF kernel to 10.84% from 50% EER when no normalization was applied. Polynomial kernel SVM performed the best among the compared kernels may be due to most features included were the image-based features. Note that polynomial kernel was often used for image processing and shown to produce decent performance [25]. Some recent works that applied SVM for artificial speech detection were used in this paper for comparison. Table 2 compared the performance between the proposed approach in this paper and recent works. For better representation, the linear SVM with feature normalization and polynomial SVM with feature normalization as shown in the Table 1 were labeled as model 1 and model 2 in Table 2. A performance comparison was conducted to compare model 1 and model 2 against the recent works, namely model 3model 7 which used SVM as classifier on the evaluation set.
From Table 2, the model 2 performed the best among the compared recent works which used SVM as classifier. It can be observed in Table 2 that the mostly used kernel in the recent works was the polynomial. In addition, models which use polynomial SVM, namely model 2, model 3, model 5, and model 6 produced below 3% EER when detecting artificial speech. The RBF kernel often outperformed polynomial kernel especially in the case of replay attack detection [4]. However, in the case of artificial attack detection, the best kernel for SVM is the Polynomial kernel as shown in Table 2.  [26] 9.08 Model5: LFCC + polynomial SVM [27] 2.92 Model6: LFCC-GMM + GAT-S + GAT-T + RawNet2 + polynomial SVM [28] 1.68 Model7: X-vectors + linear SVM [29] 7.12

CONCLUSION
In this paper, various SVM kernels were experimented to identify the best kernel for artificial speech detection when applied to the presented handcrafted features. Resampling was conducted to reduce the implication of an unbalanced dataset towards overfitting. Three categories of features were used in the experiment, namely MFCC, hexadecimal-based, and image-based features. Feature fusion was applied to improvise the performance of artificial speech detection using SVM. The ASVspoof 2019 logical access dataset was used in the experiment. Results showed that the polynomial SVM with feature normalization performed the best. Besides, it was found that feature normalization improvised the result of artificial speech detection. Future works are directed at the extraction of deep learning-based features and ensemble classification, as well as the integration of voice PAD and ASV systems.