Loading Large Data files in R

To start working with large data sets in R, the first question is how to load the data for further analysis. Our data tests consisted of large CSV files having data as matrices, which was needed for further correlation calculations. The options we have, to load it in R are:

  • >> read.csv
  • >> use fread function of package data.table
  • >> use read.big.matrix of package big memory

 

In order to test load performance for each, we first used the following machine with R (64-bit) installed on it for a Matrix with data as Bits:

vCPU: 4 (High Frequency Intel Xeon E5-2670)

Memory: (GB) 15

Storage: SSD

Here is summary of what we found:
50 Variables Load - Loading Large Data files in R100 Variables Load - Loading Large Data files in R200 Variables Load - Loading Large Data files in R500 Variables Load - Loading Large Data files in R

Our Observations – 

  • >> Clearly the fread function of data.table package performs the best, by far.
  • >> With 15 GB RAM, the options other than fread could only load data with file size around 3.5 GB and fread could load data around 7 GB. This helps us in selecting the right hardware going forward to continue working on correlation calculations for more big matrices.

NOTE – We would be covering our findings about R’s use with DFS like HDFS and loading of CSVs from it in the coming blogs.

Leave a Reply

Your email address will not be published. Required fields are marked *


− one = 6

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>