Sometime ago I wrote a post comparing the speed of importing datasets based on their format. In that post I compared
fread, from the data.table package, for importing CSV files and also an R SQL query to read the same data from an SQLite database.
The short conclusion from that earlier post was that
fread is always much quicker than
read.csv and also quicker than returning data from the SQLite database. The SQL query route was almost as quick as
fread if the query was structured in such a way as to return specific fields, rather than just
SELECT * from db. The most interesting point was that
fread was quicker importing the entire 52MB test CSV file, than it was to subset certain columns and then import.
Enough of the past. Recently, I've once again had cause to investigate data loading times across different file types. In this case, the files were a CSV, SQLite database and also an Rds file.
An Rds file is an R specific file format. A single object saved to a binary Rds file is serialised, retaining all its structures and properties. This avoids the
stringsAsFactors = FALSE dance with CSVs, and various other tricky SQLite data type issues (dates are especially tricky). The other great thing about Rds files is that they are compressed by default, which drastically reduces file sizes when compared to the other two formats. The downside is that the Rds file then needs to be re-inflated when reading back into R.
While creating the pinhooker R package, I was working with all three file types, which the package supports creating. Once the full 152,000 rows of data, across 19 columns, had been imported, the resulting approximate file sizes were as follows:
- CSV: 34.9MB
- SQLite: 35MB
- Rds: 4MB
The Rds file is smaller by greater than a factor of ten. That's quite a size saving.
This is the main reason I chose to use Rds as the default file type produced by the pinhooker package, rather than a more common CSV. The pinhooker dataset will continue to grow, and therefore so will the file sizes. I'd much rather an installable R package of around 4MB, than a requirement to download in excess of 35MB.
Nonetheless, it's still worth looking at comparative file loading times. Let's start with the CSV and the base R
# Start the clock! ptm <- proc.time() # Load CSV file bloodstockSaleData <- read.csv("bloodstockSalesData.csv", sep=",", as.is = TRUE) # Stop the clock proc.time() - ptm user system elapsed 4.596 0.032 4.869
Not terribly slow, but still a noticeable lag while waiting for the code to complete.
fread from the data.table package:
# Start the clock! ptm <- proc.time() # Load necessary libraries library(data.table) # Load CSV file bloodstockSaleData <- fread("bloodstockSalesData.csv", sep=",") # Stop the clock proc.time() - ptm user system elapsed 0.840 0.028 0.863
Blisteringly quick, and almost half of that time is accounted for by the library loading. A second load of the csv, with the data.table library already loaded results in:
user system elapsed 0.520 0.008 0.525
Now an SQL query to return data from the SQLite database:
# Start the clock! ptm <- proc.time() # Load necessary libraries library(RSQLite) # Load SQL data con <- dbConnect(SQLite(), "bloodstockSalesData.sqlite") sql1 <- paste("SELECT * FROM bloodstockSalesData", sep="") bloodstockSalesData <- dbGetQuery(con, sql1) dbDisconnect(con) # Stop the clock proc.time() - ptm user system elapsed 2.284 0.320 8.865
Substantially slower than even
read.csv. Is there a significant difference if the RSQLite library is already loaded?
user system elapsed 0.932 0.040 0.968
Most certainly there is. Still not as fast as
fread, but if multiple SQL queries were required, after the RSQLite library is first loaded, import is quite quick.
Let's look at loading an Rds file.
# Start the clock! ptm <- proc.time() # Load Rds data bloodstockSalesData <- readRDS("bloodstockSalesData.rds") # Stop the clock proc.time() - ptm user system elapsed 1.324 0.020 1.409
Not quite as fast as
fread or an SQL query with library already loaded, but much quicker than
readRDS doesn't match the outright speed of
fread, the vastly reduced file size is certainly a very fair trade-off.
Hadley Wickham's readr package is also worth mentioning, as it supports fast reading of both CSVs and Rds files. For the sake of completeness, these functions were also tested.
# Start the clock! ptm <- proc.time() # Load necessay libraries library(readr) # Load CSV data bloodstockSalesData <- read_csv("bloodstockSalesData.csv") # Stop the clock proc.time() - ptm user system elapsed 1.456 0.020 1.474
And again, with library already loaded:
user system elapsed 0.944 0.012 0.948
# Start the clock! ptm <- proc.time() # Load necessay libraries library(readr) # Load Rds data bloodstockSalesData <- read_rds("bloodstockSalesData.Rds") # Stop the clock proc.time() - ptm user system elapsed 1.828 0.024 1.848
And lastly, with readr already loaded:
user system elapsed 1.396 0.004 1.398
Unfortunately, the readr package doesn't seem to offer an advantage, in terms of raw loading speed, for this sample set of files. Although there are other nice aspects of readr, such as the CSV parsing failure warnings returned.
It may also be worth averaging times across multiple data loads, as there will be small variations, but not large enough to change the overall outcome in this scenario.
fread still wins the outright speed sprint when it comes to loading CSV data in R. However, with the radically reduced file size, object immutability and still very quick loading times, the Rds file is an excellent overall choice.
sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux stretch/sid locale:  LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8  LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  RSQLite_1.0.0 DBI_0.3.1 readr_0.2.2 data.table_1.9.6 loaded via a namespace (and not attached):  tools_3.2.3 Rcpp_0.12.3 chron_2.3-47