Comparison of Data Partitioning Schema of Parallel Pairwise Alignment on Shared Memory System

The pairwise alignment (PA) algorithm is widely used in bioinformatics to analyze biological sequence. With the advance of sequencer technology, a massive amount of DNA fragments are sequenced much quicker and cheaper. Thus, the alignment algorithm needs to be parallelized to be able to align them in a shorter time. Many previous researches have parallelized PA algorithm using various data partitioning schema, but it is unknown which one is the best. The data partitioning schema is important for parallel PA performance, because this algorithm uses dynamic programming technique that needs intense inter-thread communication. In this paper, we compared four partitioning schemas to find the best performing one on shared memory system. Those schemas are: blocked columnwise, rowwise, antidiagonal, and blocked columnwise with manual scheduling and loop unrolling. The testing is done on quad-core processor using DNA sequence of 1000 to 16000 bp as the input. The blocked columnwise with manual scheduling and loop unrolling schema gave the best performance of 89% efficiency. The synchronization time is minimized to get the best performance possible.This result provided high performance parallel PA with fine-grain parallelism that can be used further to develop parallels multiple sequence alignment (MSA).


Introduction
The pairwise alignment (PA) algorithm is used in bioinformatics to align a pair of DNA or protein sequences of certain organism, in order to determine the similarity between them [1].It uses dynamic programming technique to get the best alignment result with complexity of O (n 2 ), where n is the sequences' length [2].It is the foundation of the multiple sequence alignment (MSA) algorithm to align more than two sequences altogether.Other than that, it is also used for database sequence searching to find the most similar sequence to the one that is given [3].
The next-generation DNA sequencer technology nowadays can produce a lot of sequence data, up to hundreds of billion base pair (bp) in one run [5].This big data needs faster processing, so the algorithm needs to be parallelized to speed up the alignment process.Many researches have parallelized MSA, such as Praline [6], ClustalW-MPI [7], MT-ClustalW [8], MAFFT [9], and star algorithm [10].But only a few that have parallelized PA, such as ParAlign [11] and CudaSW [12].
The data dependency of PA is high due to its dynamic programming nature.Because of this, the data partitioning schema on parallel PA algorithm affects the performance greatly.Several data partitioning schemas that have been applied for PA parallelization were: columnwise [13], diagonal [11], rowwise [14], blocked columnwise [15], and blocked antidiagonal [16].Unfortunately, the best partitioning schema on shared memory system is not yet known.
In this paper, we parallelized PA algorithm on shared memory system using four different data partitioning schemas: blocked columnwise, rowwise, antidiagonal, and revised blocked columnwise.We tested and revised each schema to obtain the highest performance possible of parallel PA algorithm.

Research Method 2.1. Implementation of Sequential Pairwise Alignment Algorithm
The first step is to implement PA algorithm in sequential manner using global alignment approach.The longest common sequence (LCS) algorithm [17] is used as a basis to develop the sequential PA algorithm.The LCS itself is a global alignment algorithm that is more known as Needleman-Wunsch algorithm.The alignment score is set as follow: the score is incremented by one if both DNA residues are match; else the score is substracted by one.The gap penalty is applied by initializing the score of zeroth row and zeroth column multiplied by its distance from starting point (upper-left corner).
An example of alignment table calculation using gap penalty of -3 can be seen on Figure 1.This table aligns AGTCA and ATGA sequence resulting in alignment score of 1 (the value of right-bottom corner) and alignment result as follows.

AGTCA A-TGA
This algorithm correctness is verified by aligning two sequences from BAliBASE DNA sequence alignment benchmark [18].These sequences are Saccharomyces cereviseae GlyRS1 (SYG_YEAST) and Schizosaccharomyces pombe GlyRS (SYG_SCHPO).Our alignment result is compared with the result of ClustalW 2.1 program using default option.

Implementation of Parallel Pairwise Alignment Algorithm
The second step is to develop the parallel version of this algorithm and verify its correctness.Parallelization is implemented using OpenMP, because it is much easier to implement rather than by using processor instruction directly [11] or by using Pthreads [8], [9].OpenMP is a library to parallellize sequential program into multithreaded on a shared memory system [19].Four partitioning schemas were tested: blocked columnwise, rowwise, antidiagonal, and blocked columnwise with manual scheduling and loop unrolling.The correctness of parallel PA is verified by comparing its output with the sequential one to satisfy the sequential consistency [20].

Blocked columnwise
The blocked columnwise partitioning schema divides the alignment table per block of columns [15].The -th thread ( ) gets its part of a block from column to , where is the number of columns and is the number of threads.For example, if the number of columns was 9 and the number of threads was 3, then would get its part of a block of column 1-3.This partitioning schema is illustrated on Figure 2.

Rowwise
The rowwise partitoning schema divides the alignment table per row [14].The -th thread ( ) gets its part of -th row, where , is the number of rows, and is the number of threads.For example, if the number of rows was 6 and the number of threads was 3, then would get its part of 1st and 4th row.This partitioning schema is illustrated on Figure 3.

Antidiagonal
The antidiagonal partitioning schema divides the alignment table per antidiagonal, crossing from bottom-left to top-right [16].The -th thread ( ) gets its part of -th antidiagonal, where , is the number of rows, is the number of columns, and is the number of threads.For example, if the number of rows was 6, the number of columns was 9, and the number of threads was 3, then would get its part of 1st, 4th, 7th, 10th, and 13th antidiagonal.This partitioning schema is illustrated on Figure 4.

Blocked columnwise with manual scheduling and loop unrolling
The first blocked columnwise schema is not optimal because the usage of OpenMP parallel for construct, that is simple but less flexible.To overcome this shortcoming, the data  partitioning mechanism needs to be done manually.The next revision is by applying loop unrolling technique, i.e. merging several loop iterations into one.This technique will reduce processor instruction and increase cache locality [21], [22].

Performance Comparison
The final step is to test the performance of each data partitioning schemas.Several DNA sequence data is generated randomly with varying length of 1000, 2000, 4000, 8000, and 16000 bp as the input.The alignment table calculation is timed using omp_get_wtime function.This process is repeated ten times for each schemas and sequence lengths, and then take the average as the final execution time.The final execution time is compared to get the best performing data partitioning schema.
The test is conducted using Intel Core i5-3470 processor that has 4 cores, Debian GNU/Linux 64-bit operating system, and GCC 4.8.1 compiler that supports OpenMP 3.1 specification.

Results and Analysis 3.1. Implementation of Sequential Pairwise Alignment Algorithm
The pseudocode for PA algorithm is presented as a procedure called PAIRWISE-ALIGN, which takes two sequences of X and Y as the parameters.The score of each cell is calculated by choosing the maximum score between three possible alignment directions: diagonal, up, and left.The diagonal score is calculated by the help of SIMILARITY procedure: if the DNA residue is equal, the score is added by MATCH score, else substracted by MISMATCH score.The up and left score are calculated by substracting it by linear GAP penalty score.
In general, our PA algorithm has been able to align two sequences correctly, whereas ClustalW produces alignment with more contiguous gap.This is caused by the lack of gap extension feature in our PA algorithm.The comparison of alignment result for SYG_YEAST and SYG_SCHPO between ClustalW and our PA is as follows.Based on this result, our simple PA algorithm has been implemented correctly and will become our basis to develop the parallel version of PA algorithm.

Blocked columnwise
The synchronization time that is caused by inter-thread data dependency for blocked columnwise schema is low.Only the first cell of each row (asterisk sign on Figure 2) that must be checked whether its left cell has been filled by another thread or not yet?.If a thread runs slower, then the next thread must wait until the former thread finished one row of its part.
The synchronization could be done only for m × t times, but because of OpenMP limitation the synchronization still be done for m × n times.The usage of OpenMP parallel for directive although simpler but less flexible, so the checking is still done at each cell.Thus, the parallel algorithm performance becomes less efficient.Below is the pseudocode to parallelize PA using this schema.
The result yields up to 3.00 times faster execution time using blocked columnwise schema on 4 threads.Thus, the efficiency-speedup per number of threads-of this schema is 75%.

Rowwise
The synchronization time for rowwise schema is high, it is done for m × n times.Every single cell must check whether its upper cell has been filled by another thread or not yet.Despite of that, the parallel algorithm performance of this schema does not become worse.This is caused by the nature of shared memory system, i.e. no data movement involved between threads.Below is the pseudocode to parallelize PA using this schema.
parallel for i = 1 to m // schedule(static,1) Unexpectedly, the result of rowwise schema is better than blocked columnwise schema.It yields up to 3.18 times faster execution time using this schema on 4 threads.Thus, the efficiency of this schema is 80%.

Antidiagonal
The synchronization time for antidiagonal schema is very high, two times of the rowwise schema, that is 2 × m × n times.Every single cell must check whether its left and upper cells have been filled by another thread or not yet.Moreover, the non-linear memory access pattern of this schema makes the parallel algorithm performance much worse.Below is the pseudocode to parallelize PA using this schema.
As expected, the result of antidiagonal schema is the worst among the others.It yields up to 1.44 times faster execution time using this schema on 4 threads.Thus, the efficiency of this schema is only 36%.

Blocked columnwise with manual scheduling and loop unrolling 3.2.4.1. Manual scheduling
The loop scheduling is done manually using block decomposition method [23] as a substitute for OpenMP parallel for construct.Therefore, the synchronization can be done for only m × t times, that is only once at first cell of each row (asterisk sign on Figure 2).This manual scheduling makes blocked columnwise schema more efficient.Below is the pseudocode of blocked columnwise schema that has been revised using manual scheduling.

Loop unrolling
The loop for each row will be unrolled by factor of two, i.e. iteration for two rows will be merged as one.This is done by making two loop copy for C i,j and C i+1,j and also using increment of two for each iteration.A checking mechanism is added in the middle of the loop to check whether the last row has been reached or not.This is necessary to anticipate when the number of rows (m) is not divisible by the unroll factor.This loop unrolling technique by factor of two reduces the synchronization time to half, that is ½ × m × t.Below is the continued pseudocode of blocked columnwise schema that has been revised using loop unrolling by factor of two.
The result of the blocked columnwise schema with manual scheduling and loop unrolling by factor of two yields up to 3.54 times faster execution time on 4 threads.Thus, the efficiency of this schema is 89%.

Verification of Parallel PA Algorithm Correctness
All version of the parallel PA above have been tested for sequential consistency by comparing their outputs with the sequential PA algorithm.All of them produced the same output, therefore the parallel algorithm implementation are 100% consistent with sequential algorithm.This verification is important because incorrect implementation of multithreaded program leads to race condition that result in inconsistent output.

Performance Comparison
The antidiagonal schema yields the worst performance with efficiency of 36%.It is understandable because of its non-linear memory access pattern that causing a lot of cache misses.
The rowwise schema yields better performance than blocked columnwise schema with efficiency of 80% and 75% respectively.This result is beyond our expectation, because rowwise schema has more inter-thread data dependency.The main reason of this is because the implemention is on shared memory system, where data movement between threads is nonexistent, i.e. each thread is accessing the same shared memory space.The result would be different if it is implemented on distributed memory system.Another reason is that the first blocked columnwise schema above is not optimal yet.
The second blocked columnwise schema that has been revised yields the best performance with efficiency of 89% on 4 threads.The loop unrolling technique (merging two row iterations into one) on this schema has been proven to reduce synchronization time, hence increasing the computation portion of the parallel algorithm.The comparison of those data partitioning schemas that have been tested is shown on Figure 5.

Conclusion
The revised blocked columnwise schema using manual scheduling and loop unrolling yields the highest performance of 89% efficiency.The manual scheduling makes the blocked columnwise schema more efficient and the loop unrolling technique halves the synchronization time.In shared memory system, the performance is defined by the portion of synchronization time.The synchronization time must be minimized to maximize the performance.
We have presented parallel PA algorithm with high performance and fine-grain parallelism that could be used further as a component to develop parallel MSA algorithm on hybrid shared-distributed memory system using OpenMP and message-passing interface (MPI).