PCA-based dimensionality reduction for face recognition

In this paper, we conduct a comprehensive study on dimensionality reduction (DR) techniques and discuss the mostly used statistical DR technique called principal component analysis (PCA) in detail with a view to addressing the classical face recognition problem. Therefore, we, more devotedly, propose a solution to either a typical face or individual face recognition based on the principal components, which are constructed using PCA on the face images. We simulate the proposed solution with several training and test sets of manually captured face images and also with the popular Olivetti Research Laboratory (ORL) and Yale face databases. The performance measure of the proposed face recognizer signiﬁes its superiority. This is an open access article under the CC BY-SA license.

INTRODUCTION Data mining is a way for extracting or mining knowledge from large amounts of data [1]- [4]. In developing data mining application, the amount data taken from various repositories such as databases, data warehouse, and World Wide Web (WWW). is typically huge to be either stored or processed. Long time may be required for analyzing complex data and mining on huge amounts of data. Therefore, it makes such analysis sometimes impractical or infeasible. Data reduction techniques are traditionally applied to find a reduced representation of the dataset, which is much smaller in size ensuring the close integrity of the original data. To what follows, mining on the reduced dataset should be more efficient producing the same or almost the same analytical results. The common strategies for data reduction include data cube aggregation, attribute subset selection, dimensionality reduction (DR) and numerosity reduction [1].
Recently, the dataset size in terms of number of records and attributes is exploring very rapidly, which prompts the development of a number of big-data platforms, parallel data analytics algorithms and the usage of data DR procedures efficiently. In order to handle the real-world data effectively, the respective dimensionality needs to be reduced in an effective (more economic) amount. DR is the study of methods of transformations for reducing the number of dimensions describing the object of high-dimensional data into a meaningful representation of reduced dimensionality. Theoretically, the reduced representation of dataset should have such a dimensionality that corresponds to the intrinsic dimensionality of the dataset. The intrinsic dimensionality of dataset means the minimum number of arguments needed to account for the observed properties of the data. The general objectives of DR are to remove irrelevant and redundant data for reducing the manipulation cost and avoiding data over-fitting, and increasing the quality of data for efficient data-intensive processing tasks, such as pattern recognition, data mining, visualization, database navigation, and compression of high-dimensional data. As such, DR offers an effective solution to the diverse problem of "curse of dimensionality" and fixes other undesired properties of high-dimensional spaces [5]. Mathematically, the DR techniques can be defined as to convert a given dataset represented in a n × D matrix X consisting of n data vectors x i , i = 1, 2, ..., n with dimensionality D into another dataset Y that has an intrinsic dimensionality d, where d < D, and often d << D. The intrinsic dimensionality of data signifies that the points in dataset X are belonging to or near a manifold with dimensionality d that is implanted in the D dimensional space. In another words, the DR methods encode the given dataset X having dimensionality D into a new dataset Y with dimensionality d retaining the geometry of the data as much as possible. In general, neither the intrinsic dimensionality d of the dataset X nor the geometry of the data manifold is completely known. Therefore, DR of a dataset is an ill-posed problem that can only be solved by assuming certain properties of the data such as its intrinsic dimensionality [5].
There are some DR techniques for the purpose of taking a smaller image and compression and there are some other DR techniques for machine learning purpose (e.g., for better data analysis, classification, statistics, and visualization) [6]. In machine learning, dimension reduction is usually concerning with the feature vectors. In this case, DR techniques can be divided into two categories: feature extraction and feature selection methods. Feature extraction can further be divided into linear and non-linear methods. The main goal of some methods is to preserve fidelity with respect to the original data using a certain metric such as mean squared error, and the goal of some other methods is to improve the performance of a typical task, such as classification, prediction, and visualization [7]. Linear feature extraction methods include principal component analysis (PCA), factor analysis, independent component analysis (ICA), and linear discriminant analysis (LDA). Nonlinear feature extraction methods include the front-ranked techniques such as multidimensional scaling (MDS), Isomap, maximum variance unfolding, kernel PCA etc [5]. Feature selection is divided into feature ranking and feature subset selection. Feature ranking commonly uses two scoring function, such as Euclidean distance and correlation and information gain ratio. On the other hand, the feature subset selection methods are divided into filter method, wrapper method and embedded method. The filter methods do not use any learning algorithm [8].
In this paper, after conducting a comprehensive study on the DR techniques, we present a face recognition approach using PCA transformation. We perform experiment using Olivetti Research Laboratory (ORL) and Yale face databases. The experimental results manifest the superiority of the proposed method. The main contribution of this paper is listed: i) comprehensive study on the DR techniques; ii) technical and mathematical intuitions behind the PCA approach; iii) two face recognition proposals using PCA data; and iv) performance evaluation on ORL and Yale face databases.
The remainder of this paper is organized as follows. We provide the technical detail of the PCA method in section 2. Then, we discuss the related works to ours in section 3. After that, we explain the proposed face recognition approach in section 4. The experiments and results are provided in section 5. At last, we summarize and conclude the findings and observations in section 6.

PRINCIPAL COMPONENT ANALYSIS
The constituent attributes of real-world dataset reveal relationships among them. The relationships are often linear or approximately linear. This makes the attributes amenable to common analysis techniques. One of such techniques is PCA, which rotates the original data to new coordinates with a view to making the data as flat as possible. PCA is a statistical transformation that identifies patterns in data through detecting the correlation between attributes [9]. If there exists a strong correlation between attributes, the attempt to reduce the dimensionality only makes sense. PCA finds the directions of maximum variance in high-dimensional data and then projects it onto a reduced dimensional subspace while retaining most of the information of the original dataset [10]. Mathematically, given a matrix of two or more attributes, PCA produces a new matrix with the same number of attributes, called the principal components. Each generated principal component is a linear transformation of the entire original dataset. The measurements of the principal components are calculated in such a way that the first principal component holds the maximum variance, which can tentatively The mean is the sum of the data points divided by the number of data points. That is, Ai n . The mean is that value that is most commonly referred to as the average. The mean vector is often referred to as the centroid. The variance is roughly the arithmetic average of the squared distance from the mean. The variance is defined as n−1 , whereĀ is the mean of the data. Note that the standard deviation (σ) is the square root of the variance. , whereS andT denote the means of S and T , respectively. The covariance matrix is defined as where Y is the transformed n × k-dimensional samples in the new subspace.

RELATED WORK
Dash et al. [12] presented a PCA based entropy measure for ranking features and compares with a similar feature ranking method (Relief) in [12]. Maaten, Postma, and Herik have investigated the performances of the nonlinear techniques on artificial and natural tasks, also conduct review and systematic comparison of DR techniques [5]. Spectral DR methods have explained with a short tutorial in the following paper [13]. In review work [14], the authors categorized the plethora of available DR methods and illustrated the mathematical insight behind them. Looga, Ginneken, and Duin have proposed a DR technique for image features using the canonical contextual correlation projection in [15]. In [16] article, the authors provide a comprehensive review and comparison of the performance of the principal methods of dimension reduction proposed in the approximate Bayesian computation literature. Silipo, Adae, and Berthold have discussed seven techniques for DR which are missing values, low variance filter, high correlation filter, PCA, random forests, backward feature elimination, and forward feature construction in [17]. Joshi and Machchhar [18] conduct a comprehensive survey on DR methods and proposed a DR method that depends upon the given set of parameters and varying conditions [18]. The authors investigate that recursive feature elimination, and genetic and evolutionary feature weighting and selection give better classification result than PCA [19]. Several works have also been conducted on recognition problem based on PCA in various ways. Huang and Yin [20] compare and investigate linear PCA and various nonlinear techniques for face recognition. Alkandari and Aljaber [21] have presented the importance of PCA to identify the facial image without human intervention [21]. Dandpat and Meher proposed a face recognition for improving performance using PCA and two-dimensional PCA in [22]. PCA in linear discriminant analysis space for face recognition has been proposed by Su and Wang [23]. The following paper investigates the performance when two DR methods such as self-organizing map (SOM) and PCA have been combined [24].

PROPOSED APPROACH TO FACE RECOGNITION
In this paper, after discussing the working principle of PCA in detail, we propose a solution for face recognition problem based the principal components of the training grayscale face image matrices. The proposal is a customization of various principal components-based existing classifiers. The main customization is made in case of deriving the training and test sets, where the images are placed as matrices rather than as vectors of the traditional approaches and introducing the transpose of the main sets as discussed later. To implement the proposal, the face recognition problem is divided into two categories.

Problem statement-1: Recognition of a typical face
Given a new image, classify it to "face" or "non-face" from a set of N original peoples' face images, each image is R pixels high by C pixels wide i.e., the pixel resolution is R × C. To solve this, we merge N training image matrices into a single big matrix by placing one after another. Then, we also place the input image matrix N times one after another to form another big matrix. After that, we take the transpose of both big matrices. Subsequently, we apply PCA on the four big matrices and select k eigenvectors for each. We then determine the similarity of the normal input big matrix with the normal training big matrix, and transposed input big matrix with the transposed training big matrix using selected k features (eigenvectors). Finally, the decision is taken based the similarity result. The solution is illustrated with the following steps: -Step 1: Input the N original images of size R × C.
-Step 2: For each of the N images, convert the image to a matrix of length (dimension) R × C a.
Step 2.1: Put all the matrices together in one big image-matrix, Train1 like this: Step 2.2: Take the transpose of Train1 and assign it to another matrix, Train2. Train2 = T ranspose(Train1) -Step 3: For the new image to be classified, a.
Step 3.1: Convert the image to a matrix of length R × C and put it N times together in another big image-matrix, Test1 like this: Step 3.2: Take the transpose of Test1 and assign it to another matrix, Test2. i.e., determine the similarity of Test1 with Train1 and Test2 with Train2 using the k extracted features. -Step 6: Classify the new input image either to "face" if the similarity is highest, or to "non-face", otherwise.

Problem statement-2: Recognition of individual face
Given a new image, classify it to most similar image(s) from a set of N original face images for each of the m peoples, each image is R pixels high by C pixels wide i.e., the size is R × C. To solve this, we merge N training image matrices for each of the m people into a separate single big matrix by placing one after another. Then, we also place the input image matrix N times one after another to form another big matrix. After that, we take the transpose of all big matrices. Subsequently, we apply PCA on all big matrices and select k eigenvectors for each. We then determine the similarity of the normal input big matrix with all normal training big matrices, and transposed input big matrix with all transposed training big matrices using selected k features. Finally, the decision is taken based the similarity result. The solution is illustrated with the following steps: • To determine the similarity for both problem statements, first, each eigenvector in a training set is subtracted with its corresponding eigenvector in the testing set. Then the result of each eigenvector is averaged. Now, the new instance is classified as "yes", if the average values are near to a threshold value, say α, that would be ideally around zero (0).

RESULTS
The proposed method for face recognition based on principal components has been implemented in MATLAB simulation platform. The implemented code has been tested for some common face images captured manually. In addition, it has been tested for the two popular face image databases: ORL and Yale. In ORL database, there are 10 different grayscale images of each of 40 distinct subjects. For some of the subjects, the images were taken at different times, and with the variation of lighting and facial expressions. All images were captured against a dark homogeneous background with the subjects in an upright, frontal position.  Figures  1 (a-c) respectively [25] while Table 1 shows the results on different data distributions. For the database, the training and testing sets are created in the same manner mentioned above. For the first problem statement, a random subset of images from every subject was taken to form the training set, Train1 and thus Train2. The other images were considered to be the testing set, Test1 and thus Test2. For the second problem statement, a random subset of images per every subject was taken to form the training set, Train3 and thus Train4. Any of the rest image(s) of the respective subject, upon which the training sets are formed, was considered to be the testing set, Test3 and thus Test4. The recognition result of the proposed method was quite acceptable because of, especially, the training sets, Train2 and Train4, which are the transpose of the original training sets, Train1 and Train3 respectively. The recognition accuracy can significantly be decreased with the inconsistent images in the training sets.

CONCLUSION AND FUTURE WORK
The discussed comprehensive overview of DR techniques and the working principle of PCA can be the ingredients for developing a typical image-data mining application. The proposed method for face recognition based on principal components can, mostly, be used in those applications where a few images are enough to train. The proposed approach can be used for not only face recognition but also for other kind of objects recognition in the same manner. In future, the proposed technique will be applied on ORL and Yale databases completely along with other face databases and its performance will be compared with the existing classifiers based on either machine learning algorithms or other statistical approaches. In addition, an adaptive range of the threshold, α to recognize an instance will be determined.