Importing Data Files into R

Preparing and managing data is one of the most time-consuming task for data scientists. Importing data into a data science software is the first step for the downstream analysis, which seems trivial but at times frustrating and time consuming to figure out the right way. Data can be directly entered into a system (often feasible only for small data), or imported into analytical and visualization software from different data file formats. In this post, I put together a list of R packages/functions (see Table below) with illustrations that, I hope, will ease the burden of locating the right package/function. Note that the table neither list all data file formats nor all available packages. For a comprehensive R import/export discussion you can read R Import/Export documentation.

R packages / functions to read/import data files

Data File Format	File extension	R Package	R Import Function
Text file	.txt	Base package 'utils'	read.table() read.delim()*
Comma Separated Value (CSV)	.csv	Base package 'utils'	read.csv()**
Excel	.xlsx	xlsx	read.xlsx()
OCTAVE	.mat	foreign R.matlab	read.octave() readMat()
SPSS	.sav	foreign Hmisc	read.spss() spss.get()
SYSTAT	.sys	foreign	read.systat()
STATA	.dta	foreign	read.dta()
SAS	.sas7bdat	sas7bdat	read.sas7bdat()
SAS Transport file	.xpt	foreign Hmisc	read.xport() sasxport.get()
EpiInfo	.rec	foreign	read.epiinfo()
Minitab	.mtp	foreign	read.mtp()
RDA or RDATA	.rda	base	load()

* If comma is used instead of decimal point, use read.delim2()
** If comma is used instead of decimal point, use read.csv2()

Before running the data import function, make sure that the required packages are installed. If the package is not base package or not installed previously, the first task is installing and loading the package. I wrote the following R function to install R packages along with any package that the given package depends on.

Install R package¶

#  R function (r.pkg.install) to install multiple packages at once.  

r_pkg_install <- function(r_pkgs)
  for (i in 1:length(r_pkgs)){
    # "dependencies" option is TRUE to install all packages that a given package depends on. 
    install.packages(paste(r_pkgs[i]), lib="directory you want to install package", 
                     repos = "https://cran.cnr.berkeley.edu/", dependencies = TRUE)
  }

# List all R packages you need to install into your computer ;

r_pkg_install(r_pkgs=c("xlsx", "XML", "sas7bdat", "foreign", "R.matlab"))

# Locating the data file directory. 
# You can specify the full directory in the R function or, or change the working directory so that you don't need to specify every time you run your R function. 

setwd("C:/Users/TM/MWWW/dataset1/") # Replace by your working directory

# Check working directory 
getwd()

# Note that each function has several arguments and can be checked using help() or ?. For example,
help(read.table)
?read.table

Import tab delimeted text file¶

# using read.table ;

tab_data <- read.table("iris.txt", header = TRUE)
head(tab_data)

# Or using read.delim ;

delim_data <- read.delim("iris.txt", sep="\t", header = TRUE)
head(delim_data)

Caution should be taken in reading text files with an empty (" ") missing data value. For example, for missing data values within a tab delimited text file, you can specify sep="\t" option to read in the data properly.

tab_data <- read.table("iris.txt", sep="\t", header = TRUE)
head(tab_data)

Some data files coded missing values differently, for example "?". We can read such data using the na.string option. This will automatically replace the special missing values by NA. For example, if you are reading a data with missing value recorded as "?", then use the following code

tab_data_missing <- read.table("iris_missing.txt", na.strings=c("?"), header = TRUE)
head(tab_data_missing)

On the other hand, if you want to replace NA by a different value, for example, replacing all NA by 0, you can use the following code

tab_data[is.na(tab_data)] <- 0

Import comma separated value (CSV) file¶

csv_data <- read.csv("iris.csv" )
head(csv_data)

Import excel file¶

library(rJava)
library(xlsxjars)
library(xlsx)

xlsx_data <- read.xlsx("iris.xlsx", sheetName = "iris" )
head(xlsx_data)

Import SPSS file¶

library(foreign)

spss_data <- read.spss("iris.sav", to.data.frame = TRUE )
head(spss_data)

Import STATA file¶

library(sas7bdat)

stata_data <- read.dta("iris.dta" )
ead(stata_data)

Import SAS file¶

library(sas7bdat)

sas_data <- read.sas7bdat("iris.sas7bdat" )
head(sas_data)

Import SAS export file¶

library(foreign)

sasxpt_data <- read.xport("iris.xpt" )
head(sasxpt_data)

Summary¶

The above list helps to locate the right R package/function. When the file size is larger than the computer memory allocated to R, the functions may give you an error message. In such cases, you may 1) look for a computer with bigger memory that accommodate the file size 2) split the data into several small datasets, read and analyze each of them separately, and later combine the results. 3) Use R packages or techniques that are tailored to big data. See for example the discussion in High-Performance and Parallel Computing with R.

	Sepal_L	Sepal_W	Petal_L	Petal_W	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa