Category Archives: R

Correctly Reporting P-Values in Summary Tables Reported with xtable

Reading Time: 3 minutes

Often when writing a manuscript in using knitr and xtable I am flustered by my p-values. In simple summary tables, R conveniently rounds my p-values to be 0: a mathematically inappropriate task. A colleague recently commented on the poor reporting of my table (shown below using print.xtable with the type="html" argument), inspiring a much needed change.

Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.001704 0.000005 100409.770956 0.000000
sizemedium 0.000046 0.000005 90.534705 0.000000
sizesmall 0.000003 0.000005 0.294331 0.587458
time -0.000004 0.000001 11.614917 0.000654

 

The fix is actually fairly straight forward, and can be summarized in a simple function: "fixp", with the code shown below:

fixp <- function(x, dig=3){
  x <- as.data.frame(x)
  
  if(substr(names(x)[ncol(x)],1,2) != "Pr")
    warning("The name of the last column didn't start with Pr. This may indicate that p-values weren't in the last row, and thus, that this function is inappropriate.")
  x[,ncol(x)] <- round(x[,ncol(x)], dig)
  for(i in 1:nrow(x)){
    if(x[i,ncol(x)] == 0)
      x[i,ncol(x)] <- paste0("< .", paste0(rep(0,dig-1), collapse=""), "1")
  }
  
  x
}

All that's going on: the function is pulling in the summary table (usually through a $coef), trying to turn it into a dataframe (some already are, though some tables are numeric (e.g. lm)), throwing a warning if the last heading doesn't begin with "Pr" (as it may not be the column that contains p-values), and editing any values that were rounded to 0 (at the user specified rounding point) to be < the smallest number that could be rounded to (e.g. <.01). Then we output the edited table, all ready for reporting! To mimic what was above, we set our digits to be equal to 6 (so go out 6 decimal places for the p-value), and re-run:

Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.001704 0.000005 100409.770956 < .000001
sizemedium 0.000046 0.000005 90.534705 < .000001
sizesmall 0.000003 0.000005 0.294331 0.587458
time -0.000004 0.000001 11.614917 0.000654

 

Much better! Also, to report a two digit p-value (for some writing styles), we simply set dig = 2:

Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.001704 0.000005 100409.770956 < .01
sizemedium 0.000046 0.000005 90.534705 < .01
sizesmall 0.000003 0.000005 0.294331 0.59
time -0.000004 0.000001 11.614917 < .01

 

By design, the p-values can be manipulated independent of the estimates. This allows reporting of the estimated coefficients in meaningful units (in the above example, very small units), while reporting the p-values on a scale that many writing styles request.

Want to try this yourself? Here's an example that you can try with just a built in dataset in R:

#this gives a summary table with a small p-value. Trying to report this with xtable would results in an R rounding issue!
(mod <- coef(summary(lm(uptake ~ conc + Treatment + Type + Plant, data=CO2))))

#this fixes the p-value to 2 digits, correctly reporting p-values that would have been rounded to 0
fixp(mod,dig=2)

Here's the final output via print.xtable (dig=2 for fixp and xtable):

Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.42 4.67 8.00 < .01
conc 0.02 0.00 7.96 < .01
Treatmentchilled -12.50 5.10 -2.45 0.02
TypeMississippi -23.33 6.01 -3.88 < .01
Plant.L 21.58 11.14 1.94 0.06
Plant.Q -4.62 2.27 -2.03 0.05
Plant.C 1.46 5.10 0.29 0.78
Plant^4 2.34 2.27 1.03 0.31
Plant^5 -0.48 5.77 -0.08 0.93
Plant^6 -0.04 2.27 -0.02 0.99
Plant^7 -1.91 3.64 -0.53 0.6
Plant^8 -3.28 2.27 -1.44 0.15
Plant^10 0.55 2.27 0.24 0.81

 

Limitations (ish):

  1. Again, this assumes that the last column is the one to be transformed. This is by design, though may be inconvenient in some situations. If needed, the change is easily made through the definition of the function.
  2. When the last column is manipulated, it becomes a character column in the dataframe. Alternatively, when it is rounded but no entry rounds to 0, it is numeric.
  3. This assumes a dataframe-style format of your table. Thus, this method will NOT be effective at correcting reported p-values for an individual test: say a t-test, where only the statistic is reported (and not a table). Personally this is not a concern, as I deal with these situations in other ways, but for some users seeking an overall "p-value fixing" method this may not be the answer.

As with other functions I write posts on, this function is available in my package (creatively named "myStuff")  via Github. If you'd like to play with the most current version of the function, I'd encourage you to check it out here. Alternatively, to have access to other fun functions, install the package directly from GitHub with the code below (requires devtools):

devtools::install_github("flor3652/myStuff")

Leave a Comment

Filed under R

The Keep Function

Reading Time: 2 minutes

Occasionally when I am jotting some code I find myself creating several temporary variables with the intention of later getting rid of them. These variables involve quick names that are defined in a local scope and get quite confusing out of their context. If the project expands, however, I find myself with three options for playing with the next level of the project: remember what I've already used and try to avoid it, rename all of the temporary variables (e.g. rewrite the code of the base level of the project), or wipe the variables for use later. This decision is usually made based on how the project is going. If the project is going well, I'll go back and dutifully rewrite the initial code to track variables in a more unique way. If the project is still a bit shaky, I will clear the variable names that I tend to use as temporary variables and keep exploring. Remembering variable names never turns out well for me; I inevitably forget that a variable was defined in a previous section, use it thinking I had redefined it (when I didn't), and wonder at the strange results I get.

Clearing all variables names can be a task, though. The problem comes as, in order to move on to the next stage, I actually wish to keep a few of the variables and get rid of all of the rest. After playing with a few ideas (removing the unwanted variables one at a time, writing out the variables that I wanted to keep, etc.), I decided on the idea of writing a keep  function. The keep function does just what it says: given a list of variables as arguments, it keeps those variables and removes the rest. For an example, consider the vectors "x" and "y", which are combined to give a matrix "initMod." Using  keep(initMod)  keeps the matrix and will eliminate all other objects in the global environment (including "x" and "y"), allowing me to reuse the variable names "x" and "y" as temporary variables again (say, for "modifiedMod").

Code for keep can be found on GitHub, here. Note that the function will self delete if defined in the global environment, so add in a segment to always leave keep. You could also simply grab the containing personal package (myStuff) off of GitHub using the code below.

if(!require(devtools)) install.packages("devtools", dependencies=TRUE)
devtools::install_github("flor3652/myStuff")
library(myStuff)

Note that when using the function from the package you don't have to worry about it being deleted as it isn't in the global environment.

Leave a Comment

Filed under R

Text bashing in R for SQL

Reading Time: < 1 minutes

Fairly often, a coworker who is strong in Excel, but weak in writing code will come to me for help in special details about customers in their datasets. Sometimes the reason is to call, email, or snail mail a survey, other times to do some classification grouping on the customer. Whatever the reason, the coworker has a list of ID numbers and needs help getting something out of a SQL database.

When it isn't as simple as just adding quotes and commas to the cells in Excel before copying all the ID's into the WHERE clause of a very basic SELECT statement, I often fall back to R and let it do the work of putting together the SELECT statement and querying the data.

Suppose that you're given an Excel file with 1.2 million ID's and there's some transformation that you need to do first. Obviously, you first read the file in using your protocol and package of choice. Since we're ultimately doing SQL, let's take advantage of the RODBC package's cool features.

library(RODBC)

xlsxFile <- file.path("C:", "MyFile.xlsx")
xlsxChan <- odbcConnectExcel2007(xlsxFile)
xlsxSheets <- sqlTables(xlsxChan)
View(xlsxSheets)

# From here you can choose the table (worksheet) you need to pull the data from.

xlsxData <- sqlFetch(xlsxChan, 'Sheet1$')
odbcClose(xlsxChan)

Now that the data have been pulled into R any manipulations that need to be made can be done until you end up with a list of ID's that you need to query the SQL database for. Let's start by writing the basis of the query.

myQuery <- "SELECT ID, Name, Street, City, State, Zip, Phone, Email FROM CustomerTable WHERE ID in (\s)"

Notice the use of the \s switch. We're ultimately going to be dropping this into sprintf() with the ID's so that we don't have to clutter up the script with a stupidly-long list of hard-coded values. Next we need to make R build that stupidly-long list of values to put into the query.

id <- as.character(xlsxData$ID)
id <- sprintf("'\s'", unique(id))
id <- toString(id)

# Now we can put the two together and send the whole query off.
myQuery <- sprintf(myQuery, id)

# Here do what is appropriate for you database flavor.  I'm pretending to use MySQL on Windows.
dbChan <- odbcConnect("CustDb", uid="jeremy", case="tolower")
dbData <- sqlQuery(dbChan, myQuery, stringsAsFactors=FALSE)
odbcCloseAll()

What you should end up with in dbData is the list of things that you are after and a happy colleague.

Leave a Comment

Filed under R, Sql Server

Keeping rows containing particular strings in R

Reading Time: 2 minutes

I was recently presented with the need to filter out certain rows in my dataset based upon them containing the desired strings. I needed to retain any row that had a "utm_source" and "utm_medium" and "utm_campaign". Each row in my dataset was a single string. The idea is to parse the strings of interest. My approach was to use grep and check each string for each condition that I needed it to satisfy. I consulted with my co-blogger to see if he had a more intelligent way of approaching this problem. He tackled it with a regular expression using a look-ahead. You can see my 'checker' function below and Jeremy's function 'checker2'. Both seem to perform the required task correctly. So now it is simply a matter of performance.

#Sample Data
querystrings <- c("skuId=34567-02-S&qty=1&continueShoppingUrl=https://www.beardedanalytics.com/?utm_source=ER&utm_medium=email&utm_content=Main&utm_campaign=ER101914G_greenlogoupper&cm_lm=foo@person.invalid&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket1&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937", 
"skuId=6950K-02-S&qty=1&continueShoppingUrl=https://www.beardedanalytics.com/&utm_medium=email&utm_content=Main&utm_campaign=ER101914G_greenlogoupper&cm_lm=foo2@person.invalid&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket2&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937"
)

mydf <- as.data.frame(querystrings)


# This Should return TRUE when all conditions have been satisfied
checker <-function(foo){
  grepl(pattern="utm_source", x=foo) &
    grepl(pattern="utm_medium", x=foo)&
    grepl(pattern="utm_campaign", x=foo)
 
}

checker2 <- function(foo){
    grepl(pattern="^(?=.*utm_source)(?=.*utm_medium)(?=.*utm_campaign).*$",
                      x=foo, perl=TRUE)
   }
# This is the loop that was run to repeatedly test each function with a much larger dataset
# Yes, I know this is not an efficient way to do this but it is easy to read.
ttime=c()
for( i in 1:100){
  tt <- system.time(  tresult <-mydf[checker(mydf[, 1]), ]  )
  ttime =rbind(ttime,tt[3])
}


jtime=c()
for( i in 1:100){
  jt <- system.time(  jresult<-mydf[checker2(mydf[, 1]), ] ) 
  jtime =rbind(jtime,jt[3])
}

mean(ttime)
mean(jtime)

I am not able to share the full dataset that I was using, due to privacy concerns. The dataset that I tested both functions against had 26,746 rows. The 'checker' function which I wrote took  on average 0.0801 seconds and Jeremy's approach took  0.1488 seconds. I decided to stick with my checker function, but that was not because of speed. I would have happily accepted the increased computation time for mine if the times had been reversed. The reason for this is that I find mine easier to read. This means that there is a chance that I could come back to this code in 6 months and have a clue about what it is suppose to be doing. Regular Expressions can sometimes be quite hard to come back to and say, " oh yeah, I wanted to check if all the characters that occupy prime digits in my string are vowels!". I think that my simplistic grep statement will be easier to change if that becomes needed in the future and so I will stick with the 'checker' approach. Do you have a better way to approach this using R? If so, make sure to post a comment.

Leave a Comment

Filed under R

Summary Function that is compatible with xtable

Reading Time: < 1 minutes

If you like to make nice looking documents using Latex, I highly recommend using the 'xtable' package. In most instances, it works quite well for producing a reasonable looking table from an R object. I however recently wanted a LaTeX table from the 'summary' function in base R. So naturally I tried:

library(xtable)
set.seed(123)
foo = rnorm(10)
summary(foo)
xtable(summary(foo))

Which gave me the following error:

Error in xtable.table(summary(foo)) :
  xtable.table is not implemented for tables of > 2 dimensions

So I decided to create a simple function that would return a dataframe which is easy to use with xtable. Here is what I came up with.

 

summaryfunction= function (x){
  if( is.numeric(x)!=TRUE) {stop("Supplied X is not numeric")}
  mysummary = data.frame(
            "Min." =as.numeric( min(x)),
            "1st Qu." = quantile(x)[2],
            "Median" = median(x),
            "Mean" = mean(x),
            "3rd Qu." = quantile(x)[4],
            "Max." = max(x),
            row.names=""
              
            )
  names(mysummary) = c("Min.","1st Qu.","Median","Mean","3rd Qu.","Max.")
  return( mysummary )
  }

Now, when I try to use xtable I get the following output:

% latex table generated in R 3.1.1 by xtable 1.7-4 package
% Tue Nov 04 21:58:07 2014
\begin{table}[ht]
\centering
\begin{tabular}{rrrrrrr}
  \hline
 & Min. & 1st Qu. & Median & Mean & 3rd Qu. & Max. \\
  \hline
 & -1.27 & -0.53 & -0.08 & 0.07 & 0.38 & 1.72 \\
   \hline
\end{tabular}
\end{table}

This should lead to an easier way to incorporate more summaries when you are writing your paper, using R and Knitr of course. If you do use knitr, make sure to try the results = 'asis' option with xtable from R.

Leave a Comment

Filed under R

Sending Email From R

Reading Time: 2 minutes

When I am am working in Sql Server and need to send an email I use "sp_send_dbmail". So when I am working in R, I didn't know how to send an email. I often use this as notification that a process has finished. It also works nicely as a text to your cell phone. I had one additional reason why I wanted to be able to email from R. I wanted to send an email to my evernote account with just a few key strokes. The goal was to accomplish this by writing a simple wrapper function. Below is the solution that I came up with. It works, but there are serious security implications. I offer this merely as a proof of concept. Hopefully someone can show me a better way to handle passing your email password to the R function.

# install.packages("mailR")
library(mailR)
en <- function(s="Message From R",tags="#todo",b=" ",r = c("myevernoteaddress@m.evernote.com")){
    sender <- "myemail@mydomain.com"
    password <- readLines("pathtopassword/password.txt")
    if(r=="me") r <- "myemail@mydomain.com"
    send.mail(
        from = sender,
        to = r,
        subject=paste(s,tags),
        body =b,
        smtp = list(host.name = "mail.mydomain.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
        authenticate = TRUE,
        send = TRUE
        )
}

I often have an R terminal open, so when I have a great idea for a research project I can add a note to my evernote account simply. For instance:

en(s="Read up on current imputation methods",tags="#Research")

Then a new entry is added for me in evernote with the tag 'Research'. ( I have noticed that the tagging seems to only work if the tag previously exists in my evernote account.)  Often I have a task that just needs done that I don't want to forget about. I can issue a quick command and then I will have a record of it.

en(s="email adviser about research")

That is all I need to do and the note is added to my account. I have found this to be quite useful and hopefully you will as well.

Leave a Comment

Filed under email, R

Unexpected behavior with summary function in R

Reading Time: 2 minutes

I often find myself working with data that includes dates and times. Sometimes I am interested in looking at what happened on on a particular calendar day. I usually avoid dealing with actual datetime formats when working at the day level and prefer to use an integer for representing a day. For instance the first day of each quarter can be represented as integers (numeric also works for this example). For instance if I wanted to know the oldest date in my dataset I can just take the minimum since I am using a numerical data structure.

mydates  <- c(20140101L,20140401L,20140701L,20141001L)
min(mydates)

[1] 20140101

For some reason which I don't recall, I tried using the summary() function on my dates. The only values that would be valid are the minimum and the maximum, or so I thought.

summary(mydates)

The output:
# Min.            1st Qu.       Median      Mean          3rd Qu.       Max.
# 20140000 20140000 20140000 20140000 20140000 20140000

This behavior which seemed odd to me, is caused by the way summary() deals with numerical data. So I decided to look at what summary is actually doing. To view the code behind the summary function type:

summary.default

The portion of code that we are interested in is this:

else if (is.numeric(object)) {
nas <- is.na(object)
object <- object[!nas]
qq <- stats::quantile(object)
qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.",
"Max.")
if (any(nas))
c(qq, `NA's` = sum(nas))
else qq
}

This code chunk gets executed when the object that we pass to summary is numeric.  If we substitute in our object 'mydates' we get the following code.

digits = max(3L, getOption("digits") -     3L)
nas <- is.na(mydates)
mydates<- mydates[!nas]
qq <- stats::quantile(mydates)
qq <- signif(c(qq[1L:3L], mean(mydates), qq[4L:5L]), digits)

If you step through the code line by line, you will notice that after line 4, summary produces what we would expect to see for a min and max value. However, after you execute line 5, the numbers are changed because they are not the actual numbers but they are changed to be significant figures. For example try:

signif(20140101,digits)

[1] 20140000

So be careful when using generic functions if you don't know what they are doing. I would encourage to take a look at the code behind some of the R functions you use the most. For instance using the fivenum() function does not change my min and max values the same way summary did.

Leave a Comment

Filed under R

Building a productivity system in R, Part 1

Reading Time: 3 minutes

I recently came to the conclusion that I need a more meaningful way to track my productivity than the spreadsheet I am currently using, so my next few posts are going to be about building a system in R to track this.  If you're building your own productivity tracking system then by all means take this as inspiration, but don't expect it to suit your needs.  I'm making it to suit my needs using terminology that is common in my workplace and you'll have to figure out what will work for your needs in your workplace.

As with all such endeavors, the thing that is really going to make or break this tracking is the data model, so let's define that first.

At the very top level I have projects.  Each client will have one or more projects.  I'm not interested in tracking work for particular clients (at least for now) so I'm skipping that level, but it is necessary to note that each client has a 4 digit number.  Each project also has a 4 digit number, so the combination of the client digits and the project digits form a partial billing code.  The addition of the task-level 4 digit number makes a complete billing code that can be entered into my timesheet, but we're not there yet.  At the project level, the first two quartets is all that is necessary.  Additionally, we're going to have a name for the project, the date the project gets added, and the date the project gets removed.  Projects can often be multi-year endeavors, so understanding just how long you've been working on various tasks for a project can be useful.  For referencing across different datasets in this data model a project ID will also be defined.

Below the project level, as mentioned, are tasks.  Each task is a concrete goal that has been assigned for me to work on for that project.  Sometimes I only have one task for an entire project, other times I might have several tasks simultaneously. Some tasks may also depend on the completion of other tasks.   So we're going to want the following things: task ID, task name, project ID, complete 12 digit billing code, if the task depends on the completion of another task, add date, complete date, budgeted hours, total used hours (will be cumulative), impact, effort, and notes.  I'm using the impact and effort fields to automatically assign priorities.  They will each be given a value from 1 to 10, with 10 being the highest.  I'm not going to get into how impact and effort will be used to create the priority since I will go into more detail about that in a future post, but see this article for my inspiration.

Finally, I want to track the actual hours in the day that I do the work.  So for this dataset I just want the task ID, the date/time in, and the date/time out.

Since I want all of this to appear as a single object I'm going to use a list containing three data frames.  Below is a function that will actually generate this object.  I expect I'll only ever have to use it once, but it's still useful to me to think in this way.  My next post will get into adding projects and tasks.

createStructure <- function() {
  require(lubridate)
  Projects <- data.frame(ProjectID = character(),
                         ProjectName = character(),
                         BillingCode = character(), #(possibly partial)
                         AddDate = ymd(),
                         RemoveDate = ymd(),
                         stringsAsFactors=FALSE)
  Tasks <- data.frame(TaskID = character(),
                      TaskName = character(),
                      ProjectName = character(),
                      BillingCode = character(), #(should be complete)(multiple codes spill into Notes field)
                      AddDate = ymd(),
                      CompleteDate = ymd(),
                      BudgetHours = numeric(),
                      TotalUsedHours = numeric(),
                      Impact = integer(),
                      Effort = integer(),
                      Notes = character(),
                      stringsAsFactors=FALSE)
  Hours <- data.frame(TaskID = character(),
                      TimeIn = ymd_hms(),
                      TimeOut = ymd_hms(),
                      stringsAsFactors=FALSE)
  return(list(Projects, Tasks, Hours))
}

 

Leave a Comment

Filed under Functional Programming, Productivity, R, Uncategorized

Assumption Checking - Part I

Reading Time: 2 minutes

Often when working, we are under deadlines to produce results in a reasonable timeframe. Sometimes an analyst may not check his assumptions if he is under a tight deadline. A simple example to illustrate this would be a one sample t-test. You might need to test your sample to see if the mean is different from a specific number. One assumption of a t-test that is often overlooked, is that the sample needs be drawn randomly from the population and the population is suppose to follow a Gaussian distribution. When is the last time in the workplace that you heard of someone performing a normality test before running a t-test? It is considered an extra step that is not usually taken. It should really not be considered a burden and can easily be accomplished with a wrapper function in R.

mytest <- function(x, value=0) {
xx <- as.character(substitute(x))
if(!is.numeric(x)) stop(sprintf('%s is not numeric', xx))
if(shapiro.test(x)$p.value>.10){
print(t.test(x, mu=value))
}else{
print(wilcox.test(x, mu=value))
}}

We can combine that with another function to produce a density plot.

myplot <- function(x,color="blue"){
xx <- as.character(substitute(x))
if(!is.numeric(x)) stop(sprintf('%s is not numeric', xx))
title <- paste("Density Plot","\n","Dataset = ",deparse(substitute(x)))
mydens <- density(x)
plot(mydens,main=title,las=1)
polygon(mydens,col=color)
}

Now, let's see how our functions work.  If we generate some random values from a Gaussian distribution, we would expect it to "normally" pass a normality test and a t-test to be performed. However, if we had data that was generated from another distribution that is not 'normal', than typically we would expect to see the results from the Wilcox test.

set.seed(123)
n <- 1000
normal <- rnorm(n,0,1)
chisq <- rchisq(n,df=5)

mytest(normal)
myplot(normal)

#Test for difference from 5 for chi-square data
mytest(chisq,value=5)
myplot(chisq ,color="orange")

Density Plots

Results from 'mytest(normal)':

One Sample t-test
data: x
t = 0.5143, df = 999, p-value = 0.6072
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.04541145 0.07766719
sample estimates:
mean of x
0.01612787

Results from 'mytest(chisq,value=5)':

Wilcoxon signed rank test with continuity correction

data: x
V = 214385, p-value = 8.644e-05
alternative hypothesis: true location is not equal to 5

Conclusion
The benefit of working ahead can be seen. Once you have these functions written you can add them to your personal R package that you host on github. Then you will be able to use them whenever you have an internet connection and the whole R community has the chance to benefit. Also, it is easy to combine these two functions into one.

 

#Combine the functions
PlotAndTest <- function(x){
mytest(x)
myplot(x)
}

 


Leave a Comment

Filed under Assumption Checking, R

Read a bunch of csv's quickly

Reading Time: 3 minutes

Let's say you have a whole lot of csv files in your working directory.  By some convenient act of divine grace they also happen to have the same column structures.  Reading them into R can be a slow matter as a new R user may try to write out the name of every file, assign it to a variable and then rbind() it all together later on.  A slightly more experienced user might choose to automate it a bit by using list.files() with a for loop to iterate through every csv file in the directory.  A yet more advanced user could figure out via much cursing and pain how to do this using the apply() family of functions, which may actually be the quickest way to do this.  For myself, I like to take headache-saving shortcuts when possible and still maintain some semblance of code efficiency, so naturally I use the eponymous plyr package for this task.

Just for example, let's make a bunch of fake csv files that will all have the same structure.

for(i in 1:100) {
df <- data.frame(x=rnorm(100), z=runif(100))
write.csv(df, sprintf('file%d.csv', i))
}

Then we can write up a convenience function to load plyr, find all the csv's, and define a function that we will run based on the user-entered arguments.  Keep in mind you're going to want to pass the output of this function into a variable.
read.directory <- function(dir, stringsAsFactors=F, keepMeta=F) {
require(plyr)
files <- list.files(dir, pattern='\\.csv', ignore.case=T)
toExec <- "mdply(files, read.csv, stringsAsFactors=stringsAsFactors)"
if(!keepMeta) {
toExec <- paste0(toExec, "[, -c(1, 2)]")
}
return(eval(parse(text=toExec)))
}

You're probably wondering about the "keepMeta" argument. When you run mdply() without adding the column subset to the end of it you end up with two extra columns: one for the index number of the file it came from in list.files(), the second being the actual row number that record resided in within that file. I find that info to be unnecessary most of the time, hence the default of "keepMeta=F" re-writing the function so that it excludes the offending columns.

As a side note about this, I ran my function against a for loop to see how well it performed.  I believe the results are quite clear and provide yet another piece of evidence as to why you should vectorize your code whenever possible.  Additionally, the *ply functions in plyr support multi-core execution which could cause the function to execute even quicker.  This is an area I'll have to research perhaps for a "Part 2".  The for loop used is after the image.  In both cases, they were simply dropped into system.time() and the "User" time recorded.  It was done for 10, 100, 500, 1000, 2000, ..., 10000 files.

funcVSfor

for (i in list.files('.', pattern='\\.csv', ignore.case=T)) {
myData <- rbind(try(myData, TRUE), read.csv(i, stringsAsFactors=FALSE))
}

One additional thing I learned is that when testing for loops you really should write a script to automate it.  It took more time for all of the for loops to finish than it did to come up with the idea for this post, write the function, test the function, get badgered my co-blogger for taking forever to get this post up, and write all the portions of the blogpost not relating to testing the for loop.

Also, I apologize for the weird formatting.  I was hoping to figure out how to get WordPress to respect my code indentation, but it doesn't seem to agree with me on that.  Hopefully you can forgive this fact as a first post while we try to figure that bit out.

Leave a Comment

Filed under Functional Programming, R