Category Archives: Uncategorized

Analyzing Lyft Data - Part I

Reading Time: 3 minutes

Ride sharing is a popular way to get around and much cheaper than some alternatives.  How much do the people who choose to participate in ride sharing actually make? We will start this series of examining Lyft driving data by getting the data loaded and trying to understand which hours are the best to drive.

The first step is to get our data loaded. The data is contained in a Github repository . You can clone the repo to get started or just directly read in the csv file. We will be just read in the data file directly. If you want to see the whole script in one place, try LyftData.R.

# Load libraries ----
library(lubridate) 
library(sqldf)
library(skimr)

# Read in Data ---
lyft <- read.csv(file="https://raw.githubusercontent.com/Spoted21/lyft/master/lyft2.csv",
stringsAsFactors = FALSE)


# Examine Data ----
head(lyft)
dim(lyft)
str(lyft)

# My new favorite function
skim(lyft)

After getting the data loaded there is some cleaning that is needed. After cleaning the data we will be looking at which hours are the best to drive. It will be useful to create a variable to indicate which rides are part of the same driving session. This will allow us to have a nice level to analyze the ride data.

# Calculate StartTime and EndTime ----
lyft$StartTime <- as.POSIXct(
  paste(
    strptime(lyft$Date,format = "%m/%d/%Y"),
    format(strptime(lyft$Time, "%I:%M %p"), format="%H:%M:%S")
  ),format="%Y-%m-%d %H:%M:%OS")

lyft$EndTime <-lyft$StartTime+(lyft$Time_Min*60+lyft$Time_Sec)

################################################################################
# Calculate a driving session ----
# If more than 4 hours since last ride a new session 
# is assumed to have started
################################################################################


lyft$rideSession <- 0
for(i in 1:nrow(lyft) ){
  #First Row - no comparison
  if(i==1) {lyft$rideSession[1] <- 1 } else if(i== nrow(lyft) ){ #Last Row
    lyft$rideSession[i] <- lyft$rideSession[i-1] }  else {
      timedifference <- as.numeric(difftime( lyft[i+1,]$StartTime, lyft[i,]$EndTime,units = "mins"))
      if(timedifference <= (60*4) ) { lyft$rideSession[i] <-lyft$rideSession[i-1] } else {
        lyft$rideSession[i] <- max(lyft$rideSession)+1 }
    }
}

lyft$TotalMoney <- lyft$Amount + lyft$Tip
lyft$starthour <- hour(lyft$StartTime)

# Night time driving 5pm to 3AM
night <- lyft[lyft$starthour %in% c(17:23,(0:3)) , ]

After we have the data loaded and cleaned, we can now attempt visualization. We want to see the variability of money earned by hour. A boxplot will work nicely for our purpose.

# Used for formatting the plot
HourLabels <- data.frame(hour = 0L:23L, label =
                           c(paste0(c(12,1:11),"AM") ,
                             paste0(c(12,1:11),"PM"))
)

#Distribution of Money Made by start hour ----
plotData <- night
myLabels <- HourLabels[HourLabels$hour %in% unique(plotData$starthour),]$label 
myColors <- c("lightblue","wheat","lightgreen","gray")
totalrides <- nrow(night)

# Make Boxplot To Show Money Made by Hour ----
png(filename = "MoneyByHour.png")
with(
  plotData,
  boxplot(
    TotalMoney ~ starthour ,
    col= myColors,
    las=1,
    main=paste0("Distribution of Money Made by Start Hour\n",
                "(n = ",totalrides,")") ,
    names=myLabels,
    yaxt="n"
  )
)#End With Statement

# Add formatted axis 
axis(side=2,
     at = axTicks(2),
     labels =paste0("$ ",axTicks(2)),
     las=1)
dev.off()

 

We end up with the plot below which shows us the variability by hour. The most profitable hour is 2AM with a median of $11.58.

Our data is still fairly sparse. A quick look with the table command will show this.

# Examine Frequency
table(lyft[c("day","starthour")])

This is due to the lower number of rides given. As this dataset grows we will be able to perform more interesting analysis, such as predicting the estimated earning for a given hour and day. Stay tuned for Part II where we look to answer some more questions and begin the modeling process.

 

 

 

Leave a Comment

Filed under Uncategorized

Job Market For Statisticians

Reading Time: 2 minutes

I have been forced to think about the job market lately. It started with a class assignment which was meant to simply open my eyes to current job market. I felt that I was already familiar enough but completed the assignment to be a good student. I completed the assignment and outlined the skills I need improve upon and so forth. With in day of completing my assignment I came across "A Guide and Advice for Economists on the U.S. Junior Academic Job Market: 2014-2015 Edition" after clicking through some links on facebook. I found it a great read and it caused me to starting thinking about a few things that will likely prove helpful down the road. I had originally intended to pursue a Ph.D. in Economics after finishing my M.S. in Statistics. However, life took a turn and I ended up working full time and then started working on my Ph.D. in Statistics part time while continuing to work. Make sure that you take a look at the salary tables that are included. The table below is for full time working White Males by which Ph.D. they obtained. There is more variation associated with the Economics degree, but not enough to not make it look better than Math or Statistics based solely upon salary.

For White Males Median Salary SE       95 % Range
Mathematics/Statistics  $100,000   1,500  (97,000 - 103,000)
Economics   $126,000   5,500 (115,000 -137,000)

(Data taken from: Table 50 , the 95% range is mine based on the assumption of a normal distribution.)

There is also an article in the American Statistician recently about career paths, "Which Career Path Will You Follow?". Between these three events that occurred within a week, I thought that it merited a post. Have some other useful job advice or interesting statistics that current graduate students should know? Post a comment.

Leave a Comment

Filed under Uncategorized

Building a productivity system in R, Part 1

Reading Time: 3 minutes

I recently came to the conclusion that I need a more meaningful way to track my productivity than the spreadsheet I am currently using, so my next few posts are going to be about building a system in R to track this.  If you're building your own productivity tracking system then by all means take this as inspiration, but don't expect it to suit your needs.  I'm making it to suit my needs using terminology that is common in my workplace and you'll have to figure out what will work for your needs in your workplace.

As with all such endeavors, the thing that is really going to make or break this tracking is the data model, so let's define that first.

At the very top level I have projects.  Each client will have one or more projects.  I'm not interested in tracking work for particular clients (at least for now) so I'm skipping that level, but it is necessary to note that each client has a 4 digit number.  Each project also has a 4 digit number, so the combination of the client digits and the project digits form a partial billing code.  The addition of the task-level 4 digit number makes a complete billing code that can be entered into my timesheet, but we're not there yet.  At the project level, the first two quartets is all that is necessary.  Additionally, we're going to have a name for the project, the date the project gets added, and the date the project gets removed.  Projects can often be multi-year endeavors, so understanding just how long you've been working on various tasks for a project can be useful.  For referencing across different datasets in this data model a project ID will also be defined.

Below the project level, as mentioned, are tasks.  Each task is a concrete goal that has been assigned for me to work on for that project.  Sometimes I only have one task for an entire project, other times I might have several tasks simultaneously. Some tasks may also depend on the completion of other tasks.   So we're going to want the following things: task ID, task name, project ID, complete 12 digit billing code, if the task depends on the completion of another task, add date, complete date, budgeted hours, total used hours (will be cumulative), impact, effort, and notes.  I'm using the impact and effort fields to automatically assign priorities.  They will each be given a value from 1 to 10, with 10 being the highest.  I'm not going to get into how impact and effort will be used to create the priority since I will go into more detail about that in a future post, but see this article for my inspiration.

Finally, I want to track the actual hours in the day that I do the work.  So for this dataset I just want the task ID, the date/time in, and the date/time out.

Since I want all of this to appear as a single object I'm going to use a list containing three data frames.  Below is a function that will actually generate this object.  I expect I'll only ever have to use it once, but it's still useful to me to think in this way.  My next post will get into adding projects and tasks.

createStructure <- function() {
  require(lubridate)
  Projects <- data.frame(ProjectID = character(),
                         ProjectName = character(),
                         BillingCode = character(), #(possibly partial)
                         AddDate = ymd(),
                         RemoveDate = ymd(),
                         stringsAsFactors=FALSE)
  Tasks <- data.frame(TaskID = character(),
                      TaskName = character(),
                      ProjectName = character(),
                      BillingCode = character(), #(should be complete)(multiple codes spill into Notes field)
                      AddDate = ymd(),
                      CompleteDate = ymd(),
                      BudgetHours = numeric(),
                      TotalUsedHours = numeric(),
                      Impact = integer(),
                      Effort = integer(),
                      Notes = character(),
                      stringsAsFactors=FALSE)
  Hours <- data.frame(TaskID = character(),
                      TimeIn = ymd_hms(),
                      TimeOut = ymd_hms(),
                      stringsAsFactors=FALSE)
  return(list(Projects, Tasks, Hours))
}

 

Leave a Comment

Filed under Functional Programming, Productivity, R, Uncategorized