Monthly Archives: July 2014

Building a productivity system in R, Part 1

Reading Time: 3 minutes

I recently came to the conclusion that I need a more meaningful way to track my productivity than the spreadsheet I am currently using, so my next few posts are going to be about building a system in R to track this.  If you're building your own productivity tracking system then by all means take this as inspiration, but don't expect it to suit your needs.  I'm making it to suit my needs using terminology that is common in my workplace and you'll have to figure out what will work for your needs in your workplace.

As with all such endeavors, the thing that is really going to make or break this tracking is the data model, so let's define that first.

At the very top level I have projects.  Each client will have one or more projects.  I'm not interested in tracking work for particular clients (at least for now) so I'm skipping that level, but it is necessary to note that each client has a 4 digit number.  Each project also has a 4 digit number, so the combination of the client digits and the project digits form a partial billing code.  The addition of the task-level 4 digit number makes a complete billing code that can be entered into my timesheet, but we're not there yet.  At the project level, the first two quartets is all that is necessary.  Additionally, we're going to have a name for the project, the date the project gets added, and the date the project gets removed.  Projects can often be multi-year endeavors, so understanding just how long you've been working on various tasks for a project can be useful.  For referencing across different datasets in this data model a project ID will also be defined.

Below the project level, as mentioned, are tasks.  Each task is a concrete goal that has been assigned for me to work on for that project.  Sometimes I only have one task for an entire project, other times I might have several tasks simultaneously. Some tasks may also depend on the completion of other tasks.   So we're going to want the following things: task ID, task name, project ID, complete 12 digit billing code, if the task depends on the completion of another task, add date, complete date, budgeted hours, total used hours (will be cumulative), impact, effort, and notes.  I'm using the impact and effort fields to automatically assign priorities.  They will each be given a value from 1 to 10, with 10 being the highest.  I'm not going to get into how impact and effort will be used to create the priority since I will go into more detail about that in a future post, but see this article for my inspiration.

Finally, I want to track the actual hours in the day that I do the work.  So for this dataset I just want the task ID, the date/time in, and the date/time out.

Since I want all of this to appear as a single object I'm going to use a list containing three data frames.  Below is a function that will actually generate this object.  I expect I'll only ever have to use it once, but it's still useful to me to think in this way.  My next post will get into adding projects and tasks.

createStructure <- function() {
  Projects <- data.frame(ProjectID = character(),
                         ProjectName = character(),
                         BillingCode = character(), #(possibly partial)
                         AddDate = ymd(),
                         RemoveDate = ymd(),
  Tasks <- data.frame(TaskID = character(),
                      TaskName = character(),
                      ProjectName = character(),
                      BillingCode = character(), #(should be complete)(multiple codes spill into Notes field)
                      AddDate = ymd(),
                      CompleteDate = ymd(),
                      BudgetHours = numeric(),
                      TotalUsedHours = numeric(),
                      Impact = integer(),
                      Effort = integer(),
                      Notes = character(),
  Hours <- data.frame(TaskID = character(),
                      TimeIn = ymd_hms(),
                      TimeOut = ymd_hms(),
  return(list(Projects, Tasks, Hours))


Leave a Comment

Filed under Functional Programming, Productivity, R, Uncategorized

Assumption Checking - Part I

Reading Time: 2 minutes

Often when working, we are under deadlines to produce results in a reasonable timeframe. Sometimes an analyst may not check his assumptions if he is under a tight deadline. A simple example to illustrate this would be a one sample t-test. You might need to test your sample to see if the mean is different from a specific number. One assumption of a t-test that is often overlooked, is that the sample needs be drawn randomly from the population and the population is suppose to follow a Gaussian distribution. When is the last time in the workplace that you heard of someone performing a normality test before running a t-test? It is considered an extra step that is not usually taken. It should really not be considered a burden and can easily be accomplished with a wrapper function in R.

mytest <- function(x, value=0) {
xx <- as.character(substitute(x))
if(!is.numeric(x)) stop(sprintf('%s is not numeric', xx))
print(t.test(x, mu=value))
print(wilcox.test(x, mu=value))

We can combine that with another function to produce a density plot.

myplot <- function(x,color="blue"){
xx <- as.character(substitute(x))
if(!is.numeric(x)) stop(sprintf('%s is not numeric', xx))
title <- paste("Density Plot","\n","Dataset = ",deparse(substitute(x)))
mydens <- density(x)

Now, let's see how our functions work.  If we generate some random values from a Gaussian distribution, we would expect it to "normally" pass a normality test and a t-test to be performed. However, if we had data that was generated from another distribution that is not 'normal', than typically we would expect to see the results from the Wilcox test.

n <- 1000
normal <- rnorm(n,0,1)
chisq <- rchisq(n,df=5)


#Test for difference from 5 for chi-square data
myplot(chisq ,color="orange")

Density Plots

Results from 'mytest(normal)':

One Sample t-test
data: x
t = 0.5143, df = 999, p-value = 0.6072
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.04541145 0.07766719
sample estimates:
mean of x

Results from 'mytest(chisq,value=5)':

Wilcoxon signed rank test with continuity correction

data: x
V = 214385, p-value = 8.644e-05
alternative hypothesis: true location is not equal to 5

The benefit of working ahead can be seen. Once you have these functions written you can add them to your personal R package that you host on github. Then you will be able to use them whenever you have an internet connection and the whole R community has the chance to benefit. Also, it is easy to combine these two functions into one.


#Combine the functions
PlotAndTest <- function(x){


Leave a Comment

Filed under Assumption Checking, R