# Finding the optimum partition size while parallelizing large matrix operation in R

Problem Statement:

Finding the optimum partition size while parallelizing large matrix operation in R

Description:

How R’s parallel system works?

Before moving to analysis, we must know how R’s parallel packages work. Basically R is a single threaded system but by importing some parallel packages available on R we can achieve parallelism. There are three important packages for parallelism in R: ‘parallel’, ‘doParallel’ and ‘foreach’. Here ‘deParallel’ package acts as an interface between ‘parallel’ and ‘foreach’. By default ‘doParallel’ uses ‘multicore’ functionality on unix like system and snow functionality on windows.

NOTE:

1. 1.      ‘Multicore’ functionality only runs tasks on single computer, not a cluster of computer.
2. 2.      It is pointless to use ‘parallel’ and ‘doParallel’ packages on a machine with only one processor with single core.

By using makeCluster() function we can create N clusters of  our task and with the help of  registerDoParallel() function we can register our clusters with M number of cores for parallelism. Figure. For this study we will consider N=M.

CASE STUDIES

In following cases we try to make upper triangular correlation matrix of given matrix by using different number of processors for different partitioned sizes. Following result is obtained in m3.xlarge machine with 4 Cores and 15 GB Memory.

CASE I : In first case we have a matrix of 20000 observations of 200 attributes (Size 20000 X 200) . Resultant correlation matrix will be size of 200 X 200.

Following observation is obtained in R for different size and number of cores under Linux, Windows environment. Figure 1: Case I Observation.

In above observation If we see optimum time (Not Minimum) taken for partition is 2.09 for 2 Cores, 2.92(approx) for 3 Cores and 1.76 for 4 Cores. In our Problem statement we want to partition our matrix attributes as small as possible in optimal amount of processing time.

Similar Plot observation is obtained with different number of observations in matrix like 50000X200, 100000X200 and 1000000X200.

Case II: In second case we have a matrix of 20000 observations of 400 attributes (Size 20000 X 400). Resultant correlation matrix will be size of 400 X 400.

Following observation is obtained in R for different size and number of cores under Linux, Windows environment. Figure 2: Case II Observation.

In above observation If we see optimum time taken for partition (Not Minimum), It is 5.84 for 2 Cores, 6.44(approx) for 3 Cores and 6.94 for 4 Cores.

Similar Plot observation is obtained with different number of observations in matrix like 50000X400, 100000X400.

Conclusion: On the basis of above analysis and study it is concluded that size of partition depends on number of attributes in our matrix and number of cores available in our machine. A relation on size of partition P, number of Attributes in matrix A and available cores in machine C is given by:

P ≥ (0.2(A))/C

NOTE:

1. Aim of above study is to find breakpoint for size of partition, after which processing time increases rapidly.
2. Above Observations (both cases) may vary in dedicated machines.
3. Case study is done upto 1,000,000 observations of different numbers of attributes.
4. Above relationship is tested and verified for Integer, Double and Boolean data types.

In my following blogs I will go beyond this number of observations and present those results.