Author Archives: Jeremy Jackson

About Jeremy Jackson

Jeremy enjoys working as a fairly new data analyst. He has a background working with SQL, R, and likes to mess around with a number of languages. While his posts are probably going to fairly analytics-lite while he continues to learn statistics and econometrics, expect him to (try to) post interesting functional ways to perform various data manipulations that can lead to insights and analysis.

Text bashing in R for SQL

Fairly often, a coworker who is strong in Excel, but weak in writing code will come to me for help in special details about customers in their datasets. Sometimes the reason is to call, email, or snail mail a survey, other times to do some classification grouping on the customer. Whatever the reason, the coworker has a list of ID numbers and needs help getting something out of a SQL database.

When it isn't as simple as just adding quotes and commas to the cells in Excel before copying all the ID's into the WHERE clause of a very basic SELECT statement, I often fall back to R and let it do the work of putting together the SELECT statement and querying the data.

Suppose that you're given an Excel file with 1.2 million ID's and there's some transformation that you need to do first. Obviously, you first read the file in using your protocol and package of choice. Since we're ultimately doing SQL, let's take advantage of the RODBC package's cool features.

Now that the data have been pulled into R any manipulations that need to be made can be done until you end up with a list of ID's that you need to query the SQL database for. Let's start by writing the basis of the query.

Notice the use of the \s switch. We're ultimately going to be dropping this into sprintf() with the ID's so that we don't have to clutter up the script with a stupidly-long list of hard-coded values. Next we need to make R build that stupidly-long list of values to put into the query.

What you should end up with in dbData is the list of things that you are after and a happy colleague.

Leave a Comment

Filed under R, Sql Server

Building a productivity system in R, Part 1

I recently came to the conclusion that I need a more meaningful way to track my productivity than the spreadsheet I am currently using, so my next few posts are going to be about building a system in R to track this.  If you're building your own productivity tracking system then by all means take this as inspiration, but don't expect it to suit your needs.  I'm making it to suit my needs using terminology that is common in my workplace and you'll have to figure out what will work for your needs in your workplace.

As with all such endeavors, the thing that is really going to make or break this tracking is the data model, so let's define that first.

At the very top level I have projects.  Each client will have one or more projects.  I'm not interested in tracking work for particular clients (at least for now) so I'm skipping that level, but it is necessary to note that each client has a 4 digit number.  Each project also has a 4 digit number, so the combination of the client digits and the project digits form a partial billing code.  The addition of the task-level 4 digit number makes a complete billing code that can be entered into my timesheet, but we're not there yet.  At the project level, the first two quartets is all that is necessary.  Additionally, we're going to have a name for the project, the date the project gets added, and the date the project gets removed.  Projects can often be multi-year endeavors, so understanding just how long you've been working on various tasks for a project can be useful.  For referencing across different datasets in this data model a project ID will also be defined.

Below the project level, as mentioned, are tasks.  Each task is a concrete goal that has been assigned for me to work on for that project.  Sometimes I only have one task for an entire project, other times I might have several tasks simultaneously. Some tasks may also depend on the completion of other tasks.   So we're going to want the following things: task ID, task name, project ID, complete 12 digit billing code, if the task depends on the completion of another task, add date, complete date, budgeted hours, total used hours (will be cumulative), impact, effort, and notes.  I'm using the impact and effort fields to automatically assign priorities.  They will each be given a value from 1 to 10, with 10 being the highest.  I'm not going to get into how impact and effort will be used to create the priority since I will go into more detail about that in a future post, but see this article for my inspiration.

Finally, I want to track the actual hours in the day that I do the work.  So for this dataset I just want the task ID, the date/time in, and the date/time out.

Since I want all of this to appear as a single object I'm going to use a list containing three data frames.  Below is a function that will actually generate this object.  I expect I'll only ever have to use it once, but it's still useful to me to think in this way.  My next post will get into adding projects and tasks.

 

Leave a Comment

Filed under Functional Programming, Productivity, R, Uncategorized

Read a bunch of csv's quickly

Let's say you have a whole lot of csv files in your working directory.  By some convenient act of divine grace they also happen to have the same column structures.  Reading them into R can be a slow matter as a new R user may try to write out the name of every file, assign it to a variable and then rbind() it all together later on.  A slightly more experienced user might choose to automate it a bit by using list.files() with a for loop to iterate through every csv file in the directory.  A yet more advanced user could figure out via much cursing and pain how to do this using the apply() family of functions, which may actually be the quickest way to do this.  For myself, I like to take headache-saving shortcuts when possible and still maintain some semblance of code efficiency, so naturally I use the eponymous plyr package for this task.

Just for example, let's make a bunch of fake csv files that will all have the same structure.

for(i in 1:100) {
df <- data.frame(x=rnorm(100), z=runif(100))
write.csv(df, sprintf('file%d.csv', i))
}

Then we can write up a convenience function to load plyr, find all the csv's, and define a function that we will run based on the user-entered arguments.  Keep in mind you're going to want to pass the output of this function into a variable.
read.directory <- function(dir, stringsAsFactors=F, keepMeta=F) {
require(plyr)
files <- list.files(dir, pattern='\\.csv', ignore.case=T)
toExec <- "mdply(files, read.csv, stringsAsFactors=stringsAsFactors)"
if(!keepMeta) {
toExec <- paste0(toExec, "[, -c(1, 2)]")
}
return(eval(parse(text=toExec)))
}

You're probably wondering about the "keepMeta" argument. When you run mdply() without adding the column subset to the end of it you end up with two extra columns: one for the index number of the file it came from in list.files(), the second being the actual row number that record resided in within that file. I find that info to be unnecessary most of the time, hence the default of "keepMeta=F" re-writing the function so that it excludes the offending columns.

As a side note about this, I ran my function against a for loop to see how well it performed.  I believe the results are quite clear and provide yet another piece of evidence as to why you should vectorize your code whenever possible.  Additionally, the *ply functions in plyr support multi-core execution which could cause the function to execute even quicker.  This is an area I'll have to research perhaps for a "Part 2".  The for loop used is after the image.  In both cases, they were simply dropped into system.time() and the "User" time recorded.  It was done for 10, 100, 500, 1000, 2000, ..., 10000 files.

funcVSfor

for (i in list.files('.', pattern='\\.csv', ignore.case=T)) {
myData <- rbind(try(myData, TRUE), read.csv(i, stringsAsFactors=FALSE))
}

One additional thing I learned is that when testing for loops you really should write a script to automate it.  It took more time for all of the for loops to finish than it did to come up with the idea for this post, write the function, test the function, get badgered my co-blogger for taking forever to get this post up, and write all the portions of the blogpost not relating to testing the for loop.

Also, I apologize for the weird formatting.  I was hoping to figure out how to get WordPress to respect my code indentation, but it doesn't seem to agree with me on that.  Hopefully you can forgive this fact as a first post while we try to figure that bit out.

Leave a Comment

Filed under Functional Programming, R