Importing Data Files into R

Preparing and managing data is one of the most time-consuming task for data scientists. Importing data into a data science software is the first step for the downstream analysis, which seems trivial but at times frustrating and time consuming to figure out the right way. Data can be directly entered into a system (often feasible only for small data), or imported into analytical and visualization software from different data file formats. In this post, I put together a list of R packages/functions (see Table below) with illustrations that, I hope, will ease the burden of locating the right package/function. Note that the table neither list all data file formats nor all available packages. For a comprehensive R import/export discussion you can read R Import/Export documentation.

R packages / functions to read/import data files

Data File Format File extension R Package R Import Function
Text file .txt Base package 'utils' read.table()
read.delim()*
Comma Separated Value (CSV) .csv Base package 'utils' read.csv()**
Excel .xlsx xlsx read.xlsx()
OCTAVE .mat foreign
R.matlab
read.octave()
readMat()
SPSS .sav foreign
Hmisc
read.spss()
spss.get()
SYSTAT .sys foreign read.systat()
STATA .dta foreign read.dta()
SAS .sas7bdat sas7bdat read.sas7bdat()
SAS Transport file .xpt foreign
Hmisc
read.xport()
sasxport.get()
EpiInfo .rec foreign read.epiinfo()
Minitab .mtp foreign read.mtp()
RDA or RDATA .rda base load()

* If comma is used instead of decimal point, use read.delim2()
** If comma is used instead of decimal point, use read.csv2()

Before running the data import function, make sure that the required packages are installed. If the package is not base package or not installed previously, the first task is installing and loading the package. I wrote the following R function to install R packages along with any package that the given package depends on.

Install R package

In [ ]:
#  R function (r.pkg.install) to install multiple packages at once.  

r_pkg_install <- function(r_pkgs)
  for (i in 1:length(r_pkgs)){
    # "dependencies" option is TRUE to install all packages that a given package depends on. 
    install.packages(paste(r_pkgs[i]), lib="directory you want to install package", 
                     repos = "https://cran.cnr.berkeley.edu/", dependencies = TRUE)
  }

# List all R packages you need to install into your computer ;

r_pkg_install(r_pkgs=c("xlsx", "XML", "sas7bdat", "foreign", "R.matlab"))
In [4]:
# Locating the data file directory. 
# You can specify the full directory in the R function or, or change the working directory so that you don't need to specify every time you run your R function. 

setwd("C:/Users/TM/MWWW/dataset1/") # Replace by your working directory

# Check working directory 
getwd()
"C:/Users/TM/MWWW/dataset1"
In [ ]:
# Note that each function has several arguments and can be checked using help() or ?. For example,
help(read.table)
?read.table

Import tab delimeted text file

In [5]:
# using read.table ;

tab_data <- read.table("iris.txt", header = TRUE)
head(tab_data)

# Or using read.delim ;

delim_data <- read.delim("iris.txt", sep="\t", header = TRUE)
head(delim_data)
Sepal_LSepal_WPetal_LPetal_WSpecies
15.1 3.5 1.4 0.2 setosa
24.9 3 1.4 0.2 setosa
34.7 3.2 1.3 0.2 setosa
44.6 3.1 1.5 0.2 setosa
55 3.6 1.4 0.2 setosa
65.4 3.9 1.7 0.4 setosa

Caution should be taken in reading text files with an empty (" ") missing data value. For example, for missing data values within a tab delimited text file, you can specify sep="\t" option to read in the data properly.

In [ ]:
tab_data <- read.table("iris.txt", sep="\t", header = TRUE)
head(tab_data)

Some data files coded missing values differently, for example "?". We can read such data using the na.string option. This will automatically replace the special missing values by NA. For example, if you are reading a data with missing value recorded as "?", then use the following code

In [ ]:
tab_data_missing <- read.table("iris_missing.txt", na.strings=c("?"), header = TRUE)
head(tab_data_missing)

On the other hand, if you want to replace NA by a different value, for example, replacing all NA by 0, you can use the following code

In [ ]:
tab_data[is.na(tab_data)] <- 0

Import comma separated value (CSV) file

In [ ]:
csv_data <- read.csv("iris.csv" )
head(csv_data)

Import excel file

In [ ]:
library(rJava)
library(xlsxjars)
library(xlsx)

xlsx_data <- read.xlsx("iris.xlsx", sheetName = "iris" )
head(xlsx_data)

Import SPSS file

In [ ]:
library(foreign)

spss_data <- read.spss("iris.sav", to.data.frame = TRUE )
head(spss_data)

Import STATA file

In [ ]:
library(sas7bdat)

stata_data <- read.dta("iris.dta" )
ead(stata_data)

Import SAS file

In [ ]:
library(sas7bdat)

sas_data <- read.sas7bdat("iris.sas7bdat" )
head(sas_data)

Import SAS export file

In [ ]:
library(foreign)

sasxpt_data <- read.xport("iris.xpt" )
head(sasxpt_data)

Summary

The above list helps to locate the right R package/function. When the file size is larger than the computer memory allocated to R, the functions may give you an error message. In such cases, you may 1) look for a computer with bigger memory that accommodate the file size 2) split the data into several small datasets, read and analyze each of them separately, and later combine the results. 3) Use R packages or techniques that are tailored to big data. See for example the discussion in High-Performance and Parallel Computing with R.