Faster CSV Import with R
This article is inspired by a Twitter conversation with Jason Hathorn (@JJHeathview) and Tom (@UTVIlla), responding to an earlier post, Create an SQLite database with R.
One of the key advantages in converting large CSV files to an SQLite database is the speed in which subsequent data queries can be returned. Other reasons include re-usability of data structures and re-use of existing knowledge if SQL notation is already known.
After Jason noted how fast importing CSVs via the data.table package was, Tom went on to suggest fread from the same package is also a very fast way to import large datasets. This page on the R-Bloggers website also displays some comparisons between the different methods of initially importing CSV files. Clearly fread is a long way ahead of the competition. However, the R-Bloggers page does not include a specific SQL comparison.
The dataset used in these comparisons is a 52MB CSV file, consisting of 44246 rows and 267 columns. This file contains historical rating and racing data from the Formbet service, over a seven month period beginning in January 2014. Unfortunately, this specific data cannot be shared publicily.
The R session is restarted and all global variables cleared between each test. Minimal additional applications are running in parallel. R scripts are executed withing RStudio, which does create some processor overhead. A more correct method of comparison would be to conduct multiple executions of each code snippet, then average the complete set. However, this is most likely overkill for this simple comparison.
Import Entire File
Three methods will be used, read.csv, fread and an R SQL query, to load the entire data structure into R. The SQL query assumes the original CSV has been saved as an SQLite database, as discussed in a previous article.
system.time(read.csv(file="FBDatabase.csv", sep=",", header=TRUE, stringsAsFactors=FALSE))
user system elapsed 16.156 0.076 16.226
require("data.table") system.time(fread("FBDatabase.csv", sep = ',', header = TRUE, stringsAsFactors=FALSE))
user system elapsed 2.468 0.020 2.486
R SQL Query
require("RSQLite") con <- dbConnect(SQLite(), "formbet.sqlite") sql1 <-paste("SELECT * FROM formbet_historic", sep="") system.time(results <- dbGetQuery(con, sql1)) dbDisconnect(con)
user system elapsed 7.704 0.120 7.813
Obviously fread is significantly quicker than both read.csv and the R SQL query. In turn, the R SQL query is faster than read.csv.
Import Specific Fields
These tests involve just importing eight specific fields from the dataset:
Date Horse Trainer Jockey Course Time Win BSP
These fields represent what may be a valid set of columns if one wished to calculate, for example, a trainer or jockey strike rate at a specific track.
It is possible to just read specific columns using the colClasses attribute of read.csv. Unfortunately, the fields cannot be named, rather their numerical order within the csv file must be known. Therefore, read.csv is not practical for importing only specific columns from a large CSV file.
require("data.table") system.time(originalcsv <- fread("FBDatabase.csv", sep = ',', header = TRUE, stringsAsFactors=FALSE, select=c("Date", "Horse", "Trainer", "Jockey", "Course", "Time", "Win", "BSP")))
user system elapsed 0.324 0.008 0.330
R SQL Query
require("RSQLite") con <- dbConnect(SQLite(), "formbet.sqlite") sql1 <-paste("SELECT formbet_historic.Date, formbet_historic.Horse, formbet_historic.Trainer, formbet_historic.Jockey, formbet_historic.Course, formbet_historic.Time, formbet_historic.Win, formbet_historic.BSP FROM formbet_historic", sep="") system.time(results <- dbGetQuery(con, sql1)) dbDisconnect(con)
user system elapsed 0.376 0.032 0.407
Again, fread returned results faster than the R SQL query, although the time difference is now much less. Perhaps what is more interesting is that fread was quicker to import the entire CSV file, than to merely return eight requested columns.
When to use fread
fread is a very quick method of importing large CSV datasets. It is far superior to the more usual read.csv. One off tasks, such as importing sectional times from a single race, or even repetitive tasks using new data on each occasion, such as returning selections or shortlisting horses from a daily ratings file, would be very well suited to fread. Verbose by default, fread also exposes some interesting additional information each time it is used.
When to use an R SQL Query
If the requirements of returning data are slightly more complex, the R SQL query may be more practical. For example, returning the same eight fields, but this time between two specific dates:
require("RSQLite") con <- dbConnect(SQLite(), "formbet.sqlite") sql1 <-paste("SELECT formbet_historic.Date, formbet_historic.Horse, formbet_historic.Trainer, formbet_historic.Jockey, formbet_historic.Course, formbet_historic.Time, formbet_historic.Win, formbet_historic.BSP FROM formbet_historic WHERE Date BETWEEN '2014-02-01' AND '2014-03-31'", sep="") system.time(results <- dbGetQuery(con, sql1)) dbDisconnect(con)
user system elapsed 0.112 0.036 0.146
The elapsed time in this case is much shorter, as only a small subset of data is being returned. It is not believed possible to return only data between two specific dates using fread
If the data is to be re-used over a longer period of time, with multiple analysis tasks taking place, the logic and structure of an SQL database may also have advantages.
It is worth noting that very complex SQL queries can make for slow results. As an example, it is not the fastest route, although certainly possible, to calculate strike rates or financial returns using complex SQL queries from within R. A better approach is to simply return the relevant data fields via an SQL SELECT and perform the calculations with native R code.
Both fread and an R SQL query are fast routes to importing large datasets in R. Both are far superior to read.csv. Determining which to use should be made on a case by case basis.