Keeping rows containing particular strings in R

Reading Time: 2 minutes

I was recently presented with the need to filter out certain rows in my dataset based upon them containing the desired strings. I needed to retain any row that had a "utm_source" and "utm_medium" and "utm_campaign". Each row in my dataset was a single string. The idea is to parse the strings of interest. My approach was to use grep and check each string for each condition that I needed it to satisfy. I consulted with my co-blogger to see if he had a more intelligent way of approaching this problem. He tackled it with a regular expression using a look-ahead. You can see my 'checker' function below and Jeremy's function 'checker2'. Both seem to perform the required task correctly. So now it is simply a matter of performance.

#Sample Data
querystrings <- c("skuId=34567-02-S&qty=1&continueShoppingUrl=https://www.beardedanalytics.com/?utm_source=ER&utm_medium=email&utm_content=Main&utm_campaign=ER101914G_greenlogoupper&cm_lm=foo@person.invalid&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket1&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937", 
"skuId=6950K-02-S&qty=1&continueShoppingUrl=https://www.beardedanalytics.com/&utm_medium=email&utm_content=Main&utm_campaign=ER101914G_greenlogoupper&cm_lm=foo2@person.invalid&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket2&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937"
)

mydf <- as.data.frame(querystrings)


# This Should return TRUE when all conditions have been satisfied
checker <-function(foo){
  grepl(pattern="utm_source", x=foo) &
    grepl(pattern="utm_medium", x=foo)&
    grepl(pattern="utm_campaign", x=foo)
 
}

checker2 <- function(foo){
    grepl(pattern="^(?=.*utm_source)(?=.*utm_medium)(?=.*utm_campaign).*$",
                      x=foo, perl=TRUE)
   }
# This is the loop that was run to repeatedly test each function with a much larger dataset
# Yes, I know this is not an efficient way to do this but it is easy to read.
ttime=c()
for( i in 1:100){
  tt <- system.time(  tresult <-mydf[checker(mydf[, 1]), ]  )
  ttime =rbind(ttime,tt[3])
}


jtime=c()
for( i in 1:100){
  jt <- system.time(  jresult<-mydf[checker2(mydf[, 1]), ] ) 
  jtime =rbind(jtime,jt[3])
}

mean(ttime)
mean(jtime)

I am not able to share the full dataset that I was using, due to privacy concerns. The dataset that I tested both functions against had 26,746 rows. The 'checker' function which I wrote took  on average 0.0801 seconds and Jeremy's approach took  0.1488 seconds. I decided to stick with my checker function, but that was not because of speed. I would have happily accepted the increased computation time for mine if the times had been reversed. The reason for this is that I find mine easier to read. This means that there is a chance that I could come back to this code in 6 months and have a clue about what it is suppose to be doing. Regular Expressions can sometimes be quite hard to come back to and say, " oh yeah, I wanted to check if all the characters that occupy prime digits in my string are vowels!". I think that my simplistic grep statement will be easier to change if that becomes needed in the future and so I will stick with the 'checker' approach. Do you have a better way to approach this using R? If so, make sure to post a comment.

Leave a Comment

Filed under R

Summary Function that is compatible with xtable

Reading Time: < 1 minutes

If you like to make nice looking documents using Latex, I highly recommend using the 'xtable' package. In most instances, it works quite well for producing a reasonable looking table from an R object. I however recently wanted a LaTeX table from the 'summary' function in base R. So naturally I tried:

library(xtable)
set.seed(123)
foo = rnorm(10)
summary(foo)
xtable(summary(foo))

Which gave me the following error:

Error in xtable.table(summary(foo)) :
  xtable.table is not implemented for tables of > 2 dimensions

So I decided to create a simple function that would return a dataframe which is easy to use with xtable. Here is what I came up with.

 

summaryfunction= function (x){
  if( is.numeric(x)!=TRUE) {stop("Supplied X is not numeric")}
  mysummary = data.frame(
            "Min." =as.numeric( min(x)),
            "1st Qu." = quantile(x)[2],
            "Median" = median(x),
            "Mean" = mean(x),
            "3rd Qu." = quantile(x)[4],
            "Max." = max(x),
            row.names=""
              
            )
  names(mysummary) = c("Min.","1st Qu.","Median","Mean","3rd Qu.","Max.")
  return( mysummary )
  }

Now, when I try to use xtable I get the following output:

% latex table generated in R 3.1.1 by xtable 1.7-4 package
% Tue Nov 04 21:58:07 2014
\begin{table}[ht]
\centering
\begin{tabular}{rrrrrrr}
  \hline
 & Min. & 1st Qu. & Median & Mean & 3rd Qu. & Max. \\
  \hline
 & -1.27 & -0.53 & -0.08 & 0.07 & 0.38 & 1.72 \\
   \hline
\end{tabular}
\end{table}

This should lead to an easier way to incorporate more summaries when you are writing your paper, using R and Knitr of course. If you do use knitr, make sure to try the results = 'asis' option with xtable from R.

Leave a Comment

Filed under R

Sending Email From R

Reading Time: 2 minutes

When I am am working in Sql Server and need to send an email I use "sp_send_dbmail". So when I am working in R, I didn't know how to send an email. I often use this as notification that a process has finished. It also works nicely as a text to your cell phone. I had one additional reason why I wanted to be able to email from R. I wanted to send an email to my evernote account with just a few key strokes. The goal was to accomplish this by writing a simple wrapper function. Below is the solution that I came up with. It works, but there are serious security implications. I offer this merely as a proof of concept. Hopefully someone can show me a better way to handle passing your email password to the R function.

# install.packages("mailR")
library(mailR)
en <- function(s="Message From R",tags="#todo",b=" ",r = c("myevernoteaddress@m.evernote.com")){
    sender <- "myemail@mydomain.com"
    password <- readLines("pathtopassword/password.txt")
    if(r=="me") r <- "myemail@mydomain.com"
    send.mail(
        from = sender,
        to = r,
        subject=paste(s,tags),
        body =b,
        smtp = list(host.name = "mail.mydomain.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
        authenticate = TRUE,
        send = TRUE
        )
}

I often have an R terminal open, so when I have a great idea for a research project I can add a note to my evernote account simply. For instance:

en(s="Read up on current imputation methods",tags="#Research")

Then a new entry is added for me in evernote with the tag 'Research'. ( I have noticed that the tagging seems to only work if the tag previously exists in my evernote account.)  Often I have a task that just needs done that I don't want to forget about. I can issue a quick command and then I will have a record of it.

en(s="email adviser about research")

That is all I need to do and the note is added to my account. I have found this to be quite useful and hopefully you will as well.

Leave a Comment

Filed under email, R

Job Market For Statisticians

Reading Time: 2 minutes

I have been forced to think about the job market lately. It started with a class assignment which was meant to simply open my eyes to current job market. I felt that I was already familiar enough but completed the assignment to be a good student. I completed the assignment and outlined the skills I need improve upon and so forth. With in day of completing my assignment I came across "A Guide and Advice for Economists on the U.S. Junior Academic Job Market: 2014-2015 Edition" after clicking through some links on facebook. I found it a great read and it caused me to starting thinking about a few things that will likely prove helpful down the road. I had originally intended to pursue a Ph.D. in Economics after finishing my M.S. in Statistics. However, life took a turn and I ended up working full time and then started working on my Ph.D. in Statistics part time while continuing to work. Make sure that you take a look at the salary tables that are included. The table below is for full time working White Males by which Ph.D. they obtained. There is more variation associated with the Economics degree, but not enough to not make it look better than Math or Statistics based solely upon salary.

For White Males Median Salary SE       95 % Range
Mathematics/Statistics  $100,000   1,500  (97,000 - 103,000)
Economics   $126,000   5,500 (115,000 -137,000)

(Data taken from: Table 50 , the 95% range is mine based on the assumption of a normal distribution.)

There is also an article in the American Statistician recently about career paths, "Which Career Path Will You Follow?". Between these three events that occurred within a week, I thought that it merited a post. Have some other useful job advice or interesting statistics that current graduate students should know? Post a comment.

Leave a Comment

Filed under Uncategorized

Helper Stored Procedures in SQL Server - Part 2

Reading Time: 3 minutes

Continuing with last week's post about Helper Stored Procedures, I wanted to introduce two more that I use frequently. The first one is a built in stored procedure (  using SQL Server) called 'sp_helptext'. This stored procedure allows you to view the content of other stored procedures (SP) and more, here is a link to the Microsoft documentation. Here is an example of how you use it:

sp_helptext @objname='sp_helptext'

This will show you the contents of sp_helptext  by using it. This comes in very handy as a keyboard shortcut. I assigned mine to 'Ctrl+3'.  Often I have a SP that does data manipulation and puts the final data in a table that occurs as the last step of my SP. I may not remember which table the data goes into, but I can simply highlight the SP ( I typically double click) and then use 'Ctrl+3' and the contents of the SP appears in my results tab. This also works nicely for function and triggers that you might use.

Another piece of information that I am often curious about is how large is a particular table. If you like going through menus, then you can left click on a table name in the object explorer and choose properties. Then select 'storage' from the left navigation menu. You will see under the 'general' section: "Data Space","Row Count" and "Index Space". While this is okay if you only perform this task once a month, it takes way to long if you are running this 5 + times per day. I wrote a stored procedure called 'sp_tablesize' that is a wrapper for another Microsoft SP, that is sp_spaceused.

USE [YourDatabase]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO


Alter Procedure [dbo].[TableSize_Sp]
@TableName varchar(500)
as


If object_id(@Tablename,'U') is not null
Begin
If Object_ID('tempdb..#TableSummary','U') Is Not Null  
Drop Table #TableSummary

CREATE TABLE #TableSummary
(
[database] varchar(500) default db_name(),
table_name sysname ,
row_count BIGINT,
reserved_size VARCHAR(50),
data_size VARCHAR(50),
index_size VARCHAR(50),
unused_size VARCHAR(50),
[timestamp] datetime default getdate() not null,
Calendardate as convert(datetime,convert(varchar,[timestamp],101))
)

Declare @Sql varchar(max)
Set @Sql='
INSERT #TableSummary ([table_name],[row_count],[reserved_size],[data_size],[index_size],[unused_size])
EXEC sp_spaceused''' + @TableName + '''
'

Exec (@Sql)

If Object_ID('tempdb..#TableSummaryCleaned','U') is not null
Drop Table #TableSummaryCleaned
Select
	a.[Database],
	a.table_name [TableName],
	round(sum(cast(a.row_count as bigint)),2)[Rows],
	round(sum(cast(reserved_size_kb as float)/1024),2)ReservedSizeMB,
	round(sum(cast(data_size_kb as float)/1024),2)DataSizeMB,
	round(sum(cast(index_size_kb as float)/1024),2)IndexSizeMB,
	round(sum(cast(unused_size_kb as float)/1024),2)UnusedSizeMB,
	[Timestamp],
	Calendardate
Into #TableSummaryCleaned	
From
	(
	Select 
		[database],
		table_name ,
		row_count,
		cast(replace(reserved_size, ' KB','') as bigint) reserved_size_kb,
		cast(replace(data_size, ' KB','') as bigint) data_size_kb,
		cast(replace(index_size, ' KB','') as bigint) index_size_kb,
		cast(replace(unused_size, ' KB','') as bigint) unused_size_kb,
		[timestamp],
		[calendardate]
	From #TableSummary
	)a
Group By 
	a.[database],
	a.table_name,
	[Timestamp],
	Calendardate

Alter Table #TableSummaryCleaned
Add TotalMB float

Update #TableSummaryCleaned 
Set TotalMB = ReservedSizeMB 

Alter Table #TableSummaryCleaned
Add TotalGB float

Update #TableSummaryCleaned
Set TotalGB =TotalMB/1024 

Select *
From #TableSummaryCleaned
End
Else
Begin
	Declare @ErrorText varchar(500)
	Set @ErrorText = @TableName+' is not a valid table.'
 RAISERROR(@ErrorText, 16, 1)
End

GO

After you create the SP 'tablesize_SP' you can assign it to a shortcut in SSMS. I choose to use 'Ctrl+4' for myself. We can create a table and add data and an index to make sure that our new SP works.

 

Create Table dbo.TestTable
( numbers float )


-- Create Table with numbers 1 -100
Declare @Counter int
Set @Counter=1
While @Counter <=1000
Begin
    Insert Into dbo.TestTable Select @counter*1000
    Set @Counter=@Counter+1
End

-- Check Table Size
exec tablesize_sp'TestTable'

-- Create Index
Create NonClustered Index [NC_Numbers] on dbo.TestTable
(numbers)

-- Check size after creating index
exec tablesize_sp'TestTable'

You will notice that our index size increased after adding one, which is to be expected. Hopefully this will allow you to quickly check the size of a table. Combine that with 'sp_helptext' and 'sp_top10' you can find out a lot of information about a table quite quickly. Let me leave you with a quick reference guide of which SP to use depending on what you want to know.

 

SP Name Use Shortcut
sp_helptext View the contents of a stored procedure or function Ctrl+3
tablesize_SP Check the size of a table including indicies Ctrl+4
sp_top10 View the top 10 records of a table Ctrl+1
sp_help View Information about an object for instance table schema Alt+F1

Note: SSMS 2012 was used for this example.

Leave a Comment

Filed under Sql Server

Helper Stored Procedures in SQL Server - Part 1

Reading Time: 2 minutes

Often I have days where a large amount of my time is spent writing SQL code in SQL Server Management Studio (SSMS). Building tables, creating indices and verifying data are all common tasks. However, I often myself needing to checking which columns are in a table or to take a quick look at the data. I got tired of typing Select Top 10 * From [MyTable] more than five times per day.   In order to get around typing that repeatedly I made a function called 'sp_top10'.

CREATE Procedure [dbo].[sp_top10]
@Table as varchar(100)

as
Begin
set nocount on

Declare  @dbname as varchar(100)
Set @dbname = db_name()

Declare @MyQuery Varchar(max)
Set @MyQuery='
Select Top 10 *
From ' + @dbname + ' ..' + @Table + ' with(nolock)'

Exec (@MyQuery)
End

This stored procedure can be invoked by typing

exec sp_top10'YourTableHere'  where you substitute in your own table name. The real power comes when you assign your new stored procedure to a shortcut. In SSMS you can do this by going to Tools > Options > Keyboard > Query Shortcuts .  Simply add  'sp_top10' to an open key combination and then restart SSMS. You should now be able to highlight ( or doubleclick) a tablename and then use your new shortcut to view the top 10 records in that table. I set my shortcut to 'Ctrl+F1'.

An example:

-- Create Table To Hold Our Dataset
If Object_ID('tempdb..#TestData','U') is not null
Drop Table #TestData

Create Table #TestData
(ID bigint not null identity(1,1) ,
Letter varchar(1)
)

-- Quick Loop to get all lower case letters into our table
Declare @Counter Int;Set @Counter=97
While @Counter <=122
Begin
Declare @Sql Varchar(Max)
Set @SQL ='
Insert into #TestData (letter)
Select char(' + cast(@Counter as varchar)  + ')
'
--print @SQL
Exec (@Sql)
Set @Counter = @Counter+1
End

-- View all 26 records
Select *
From #TestData

-- Just view the top 10
exec sp_top10'#testdata'

You output should look similar to figure 1.

sp_top10_example

Figure 1.

Feel free to edit the SP to suit your particular situation. I use this in conjunction with 'sp_help' which has a default shortcut of 'Alt+F1'  in SSMS almost everyday. So I can select a tablename and 'Alt+F1' if I need to see what type of columns I have and then 'Ctrl+1' to view the first ten records of that table. Hopefully this will help save you time in your daily work.

Note: SSMS 2012 was used for this example.

1 Comment

Filed under Sql Server

Making a Code Book in Sql Server

Reading Time: 4 minutes

While working through a coursera course recently (https://www.coursera.org/course/getdata) I started to think about how a code book could be implemented in sql server. There are 3 points that Jeff Leek makes regarding a code book and they are as follows:

  1. Information about the variables ...
  2. Information about the Summary Choices  you made
  3. Information about the study design you used

My focus for this article will be able point number one, information about the variables. Often times it would be nice to know more about your data than simply what data structure it is defined by. For a simple working example, load the Iris datasets into Sql Server and then add some metadata about the columns to provide further information for the users.

 

Use YourDataBaseName
Go

-- Create Table 
If Object_Id('YourDataBaseName.dbo.Iris','U') is not null 
Drop Table dbo.Iris

Create Table dbo.Iris
(
IrisID bigint not null identity(1,1) ,
SepalLength numeric,
SepalWidth numeric,
PetalLength numeric,
PetalWidth numeric,
Species nvarchar(100)
)
-- Insert Values
Insert Into dbo.Iris Values('5.1','3.5','1.4','0.2','setosa')
Insert Into dbo.Iris Values('4.9','3','1.4','0.2','setosa')
Insert Into dbo.Iris Values('4.7','3.2','1.3','0.2','setosa')
Insert Into dbo.Iris Values('4.6','3.1','1.5','0.2','setosa')
Insert Into dbo.Iris Values('5','3.6','1.4','0.2','setosa')
Insert Into dbo.Iris Values('5.4','3.9','1.7','0.4','setosa')
Insert Into dbo.Iris Values('4.6','3.4','1.4','0.3','setosa')
Insert Into dbo.Iris Values('5','3.4','1.5','0.2','setosa')
Insert Into dbo.Iris Values('4.4','2.9','1.4','0.2','setosa')
Insert Into dbo.Iris Values('4.9','3.1','1.5','0.1','setosa')
Insert Into dbo.Iris Values('5.4','3.7','1.5','0.2','setosa')
Insert Into dbo.Iris Values('4.8','3.4','1.6','0.2','setosa')
Insert Into dbo.Iris Values('4.8','3','1.4','0.1','setosa')
Insert Into dbo.Iris Values('4.3','3','1.1','0.1','setosa')
Insert Into dbo.Iris Values('5.8','4','1.2','0.2','setosa')
Insert Into dbo.Iris Values('5.7','4.4','1.5','0.4','setosa')
Insert Into dbo.Iris Values('5.4','3.9','1.3','0.4','setosa')
Insert Into dbo.Iris Values('5.1','3.5','1.4','0.3','setosa')
Insert Into dbo.Iris Values('5.7','3.8','1.7','0.3','setosa')
Insert Into dbo.Iris Values('5.1','3.8','1.5','0.3','setosa')
Insert Into dbo.Iris Values('5.4','3.4','1.7','0.2','setosa')
Insert Into dbo.Iris Values('5.1','3.7','1.5','0.4','setosa')
Insert Into dbo.Iris Values('4.6','3.6','1','0.2','setosa')
Insert Into dbo.Iris Values('5.1','3.3','1.7','0.5','setosa')
Insert Into dbo.Iris Values('4.8','3.4','1.9','0.2','setosa')
Insert Into dbo.Iris Values('5','3','1.6','0.2','setosa')
Insert Into dbo.Iris Values('5','3.4','1.6','0.4','setosa')
Insert Into dbo.Iris Values('5.2','3.5','1.5','0.2','setosa')
Insert Into dbo.Iris Values('5.2','3.4','1.4','0.2','setosa')
Insert Into dbo.Iris Values('4.7','3.2','1.6','0.2','setosa')
Insert Into dbo.Iris Values('4.8','3.1','1.6','0.2','setosa')
Insert Into dbo.Iris Values('5.4','3.4','1.5','0.4','setosa')
Insert Into dbo.Iris Values('5.2','4.1','1.5','0.1','setosa')
Insert Into dbo.Iris Values('5.5','4.2','1.4','0.2','setosa')
Insert Into dbo.Iris Values('4.9','3.1','1.5','0.2','setosa')
Insert Into dbo.Iris Values('5','3.2','1.2','0.2','setosa')
Insert Into dbo.Iris Values('5.5','3.5','1.3','0.2','setosa')
Insert Into dbo.Iris Values('4.9','3.6','1.4','0.1','setosa')
Insert Into dbo.Iris Values('4.4','3','1.3','0.2','setosa')
Insert Into dbo.Iris Values('5.1','3.4','1.5','0.2','setosa')
Insert Into dbo.Iris Values('5','3.5','1.3','0.3','setosa')
Insert Into dbo.Iris Values('4.5','2.3','1.3','0.3','setosa')
Insert Into dbo.Iris Values('4.4','3.2','1.3','0.2','setosa')
Insert Into dbo.Iris Values('5','3.5','1.6','0.6','setosa')
Insert Into dbo.Iris Values('5.1','3.8','1.9','0.4','setosa')
Insert Into dbo.Iris Values('4.8','3','1.4','0.3','setosa')
Insert Into dbo.Iris Values('5.1','3.8','1.6','0.2','setosa')
Insert Into dbo.Iris Values('4.6','3.2','1.4','0.2','setosa')
Insert Into dbo.Iris Values('5.3','3.7','1.5','0.2','setosa')
Insert Into dbo.Iris Values('5','3.3','1.4','0.2','setosa')
Insert Into dbo.Iris Values('7','3.2','4.7','1.4','versicolor')
Insert Into dbo.Iris Values('6.4','3.2','4.5','1.5','versicolor')
Insert Into dbo.Iris Values('6.9','3.1','4.9','1.5','versicolor')
Insert Into dbo.Iris Values('5.5','2.3','4','1.3','versicolor')
Insert Into dbo.Iris Values('6.5','2.8','4.6','1.5','versicolor')
Insert Into dbo.Iris Values('5.7','2.8','4.5','1.3','versicolor')
Insert Into dbo.Iris Values('6.3','3.3','4.7','1.6','versicolor')
Insert Into dbo.Iris Values('4.9','2.4','3.3','1','versicolor')
Insert Into dbo.Iris Values('6.6','2.9','4.6','1.3','versicolor')
Insert Into dbo.Iris Values('5.2','2.7','3.9','1.4','versicolor')
Insert Into dbo.Iris Values('5','2','3.5','1','versicolor')
Insert Into dbo.Iris Values('5.9','3','4.2','1.5','versicolor')
Insert Into dbo.Iris Values('6','2.2','4','1','versicolor')
Insert Into dbo.Iris Values('6.1','2.9','4.7','1.4','versicolor')
Insert Into dbo.Iris Values('5.6','2.9','3.6','1.3','versicolor')
Insert Into dbo.Iris Values('6.7','3.1','4.4','1.4','versicolor')
Insert Into dbo.Iris Values('5.6','3','4.5','1.5','versicolor')
Insert Into dbo.Iris Values('5.8','2.7','4.1','1','versicolor')
Insert Into dbo.Iris Values('6.2','2.2','4.5','1.5','versicolor')
Insert Into dbo.Iris Values('5.6','2.5','3.9','1.1','versicolor')
Insert Into dbo.Iris Values('5.9','3.2','4.8','1.8','versicolor')
Insert Into dbo.Iris Values('6.1','2.8','4','1.3','versicolor')
Insert Into dbo.Iris Values('6.3','2.5','4.9','1.5','versicolor')
Insert Into dbo.Iris Values('6.1','2.8','4.7','1.2','versicolor')
Insert Into dbo.Iris Values('6.4','2.9','4.3','1.3','versicolor')
Insert Into dbo.Iris Values('6.6','3','4.4','1.4','versicolor')
Insert Into dbo.Iris Values('6.8','2.8','4.8','1.4','versicolor')
Insert Into dbo.Iris Values('6.7','3','5','1.7','versicolor')
Insert Into dbo.Iris Values('6','2.9','4.5','1.5','versicolor')
Insert Into dbo.Iris Values('5.7','2.6','3.5','1','versicolor')
Insert Into dbo.Iris Values('5.5','2.4','3.8','1.1','versicolor')
Insert Into dbo.Iris Values('5.5','2.4','3.7','1','versicolor')
Insert Into dbo.Iris Values('5.8','2.7','3.9','1.2','versicolor')
Insert Into dbo.Iris Values('6','2.7','5.1','1.6','versicolor')
Insert Into dbo.Iris Values('5.4','3','4.5','1.5','versicolor')
Insert Into dbo.Iris Values('6','3.4','4.5','1.6','versicolor')
Insert Into dbo.Iris Values('6.7','3.1','4.7','1.5','versicolor')
Insert Into dbo.Iris Values('6.3','2.3','4.4','1.3','versicolor')
Insert Into dbo.Iris Values('5.6','3','4.1','1.3','versicolor')
Insert Into dbo.Iris Values('5.5','2.5','4','1.3','versicolor')
Insert Into dbo.Iris Values('5.5','2.6','4.4','1.2','versicolor')
Insert Into dbo.Iris Values('6.1','3','4.6','1.4','versicolor')
Insert Into dbo.Iris Values('5.8','2.6','4','1.2','versicolor')
Insert Into dbo.Iris Values('5','2.3','3.3','1','versicolor')
Insert Into dbo.Iris Values('5.6','2.7','4.2','1.3','versicolor')
Insert Into dbo.Iris Values('5.7','3','4.2','1.2','versicolor')
Insert Into dbo.Iris Values('5.7','2.9','4.2','1.3','versicolor')
Insert Into dbo.Iris Values('6.2','2.9','4.3','1.3','versicolor')
Insert Into dbo.Iris Values('5.1','2.5','3','1.1','versicolor')
Insert Into dbo.Iris Values('5.7','2.8','4.1','1.3','versicolor')
Insert Into dbo.Iris Values('6.3','3.3','6','2.5','virginica')
Insert Into dbo.Iris Values('5.8','2.7','5.1','1.9','virginica')
Insert Into dbo.Iris Values('7.1','3','5.9','2.1','virginica')
Insert Into dbo.Iris Values('6.3','2.9','5.6','1.8','virginica')
Insert Into dbo.Iris Values('6.5','3','5.8','2.2','virginica')
Insert Into dbo.Iris Values('7.6','3','6.6','2.1','virginica')
Insert Into dbo.Iris Values('4.9','2.5','4.5','1.7','virginica')
Insert Into dbo.Iris Values('7.3','2.9','6.3','1.8','virginica')
Insert Into dbo.Iris Values('6.7','2.5','5.8','1.8','virginica')
Insert Into dbo.Iris Values('7.2','3.6','6.1','2.5','virginica')
Insert Into dbo.Iris Values('6.5','3.2','5.1','2','virginica')
Insert Into dbo.Iris Values('6.4','2.7','5.3','1.9','virginica')
Insert Into dbo.Iris Values('6.8','3','5.5','2.1','virginica')
Insert Into dbo.Iris Values('5.7','2.5','5','2','virginica')
Insert Into dbo.Iris Values('5.8','2.8','5.1','2.4','virginica')
Insert Into dbo.Iris Values('6.4','3.2','5.3','2.3','virginica')
Insert Into dbo.Iris Values('6.5','3','5.5','1.8','virginica')
Insert Into dbo.Iris Values('7.7','3.8','6.7','2.2','virginica')
Insert Into dbo.Iris Values('7.7','2.6','6.9','2.3','virginica')
Insert Into dbo.Iris Values('6','2.2','5','1.5','virginica')
Insert Into dbo.Iris Values('6.9','3.2','5.7','2.3','virginica')
Insert Into dbo.Iris Values('5.6','2.8','4.9','2','virginica')
Insert Into dbo.Iris Values('7.7','2.8','6.7','2','virginica')
Insert Into dbo.Iris Values('6.3','2.7','4.9','1.8','virginica')
Insert Into dbo.Iris Values('6.7','3.3','5.7','2.1','virginica')
Insert Into dbo.Iris Values('7.2','3.2','6','1.8','virginica')
Insert Into dbo.Iris Values('6.2','2.8','4.8','1.8','virginica')
Insert Into dbo.Iris Values('6.1','3','4.9','1.8','virginica')
Insert Into dbo.Iris Values('6.4','2.8','5.6','2.1','virginica')
Insert Into dbo.Iris Values('7.2','3','5.8','1.6','virginica')
Insert Into dbo.Iris Values('7.4','2.8','6.1','1.9','virginica')
Insert Into dbo.Iris Values('7.9','3.8','6.4','2','virginica')
Insert Into dbo.Iris Values('6.4','2.8','5.6','2.2','virginica')
Insert Into dbo.Iris Values('6.3','2.8','5.1','1.5','virginica')
Insert Into dbo.Iris Values('6.1','2.6','5.6','1.4','virginica')
Insert Into dbo.Iris Values('7.7','3','6.1','2.3','virginica')
Insert Into dbo.Iris Values('6.3','3.4','5.6','2.4','virginica')
Insert Into dbo.Iris Values('6.4','3.1','5.5','1.8','virginica')
Insert Into dbo.Iris Values('6','3','4.8','1.8','virginica')
Insert Into dbo.Iris Values('6.9','3.1','5.4','2.1','virginica')
Insert Into dbo.Iris Values('6.7','3.1','5.6','2.4','virginica')
Insert Into dbo.Iris Values('6.9','3.1','5.1','2.3','virginica')
Insert Into dbo.Iris Values('5.8','2.7','5.1','1.9','virginica')
Insert Into dbo.Iris Values('6.8','3.2','5.9','2.3','virginica')
Insert Into dbo.Iris Values('6.7','3.3','5.7','2.5','virginica')
Insert Into dbo.Iris Values('6.7','3','5.2','2.3','virginica')
Insert Into dbo.Iris Values('6.3','2.5','5','1.9','virginica')
Insert Into dbo.Iris Values('6.5','3','5.2','2','virginica')
Insert Into dbo.Iris Values('6.2','3.4','5.4','2.3','virginica')
Insert Into dbo.Iris Values('5.9','3','5.1','1.8','virginica')

-- Add Extended Properties
EXEC sys.sp_addextendedproperty @name=N'Notes', @value=N'
Sepal Length of Iris plant in centimetres.
' , @level0type=N'SCHEMA',
	@level0name=N'dbo', 
	@level1type=N'TABLE',
	@level1name=N'Iris', 
	@level2type=N'COLUMN',
	@level2name=N'SepalLength'
GO


EXEC sys.sp_addextendedproperty @name=N'Notes', @value=N'
Sepal width of Iris plant in centimetres.
' , @level0type=N'SCHEMA',
	@level0name=N'dbo', 
	@level1type=N'TABLE',
	@level1name=N'Iris', 
	@level2type=N'COLUMN',
	@level2name=N'SepalWidth'
GO



EXEC sys.sp_addextendedproperty @name=N'Notes', @value=N'
Petal Length of Iris plant in centimetres.
' , @level0type=N'SCHEMA',
	@level0name=N'dbo', 
	@level1type=N'TABLE',
	@level1name=N'Iris', 
	@level2type=N'COLUMN',
	@level2name=N'PetalLength'
GO

EXEC sys.sp_addextendedproperty @name=N'Notes', @value=N'
Petal Width  of Iris plant in centimetres.
' , @level0type=N'SCHEMA',
	@level0name=N'dbo', 
	@level1type=N'TABLE',
	@level1name=N'Iris', 
	@level2type=N'COLUMN',
	@level2name=N'PetalWidth'
GO

 
EXEC sys.sp_addextendedproperty @name=N'Notes', @value=N'
Species of Iris plant.
' , @level0type=N'SCHEMA',
	@level0name=N'dbo', 
	@level1type=N'TABLE',
	@level1name=N'Iris', 
	@level2type=N'COLUMN',
	@level2name=N'Species'
GO

  
 
 

Now, you should have a table called Iris with some basic metadata about the columns. While you can navigate to the extended properties through the object explorer, it would be nice to access this information through a query.

-- View the Code book for the Iris Table and some information about the data structure
SELECT 
	[major_id], 
	[minor_id], 
	[t.name] AS [Table Name], 
	[c.name] AS [Column Name], 
	[value] AS [Extended Property],
	infos.[Data_Type],
	infos.[is_nullable],
	infos.[Numeric_Precision],
	infos.[Numeric_Scale]
FROM 
	sys.extended_properties AS ep
	inner join
	sys.tables AS t 
	ON ep.major_id = t.object_id 
	inner join
	sys.columns AS c 
	ON ep.major_id = c.object_id 
	  and ep.minor_id = c.column_id
	inner join 
	INFORMATION_SCHEMA.COLUMNS infos
	on infos.table_name = t.name 
	  and infos.column_name = c.name
Where 
	class = 1
	and t.name ='Iris' /* This is our Table name */



-- Drop Iris Table to avoid cluttering up your database
Drop Table dbo.Iris

Now you have the framework for creating a data code book that can be self contained within the table itself. This will prove most useful when you can can share a sql server table with someone else.

 

Leave a Comment

Filed under Sql Server

Unexpected behavior with summary function in R

Reading Time: 2 minutes

I often find myself working with data that includes dates and times. Sometimes I am interested in looking at what happened on on a particular calendar day. I usually avoid dealing with actual datetime formats when working at the day level and prefer to use an integer for representing a day. For instance the first day of each quarter can be represented as integers (numeric also works for this example). For instance if I wanted to know the oldest date in my dataset I can just take the minimum since I am using a numerical data structure.

mydates  <- c(20140101L,20140401L,20140701L,20141001L)
min(mydates)

[1] 20140101

For some reason which I don't recall, I tried using the summary() function on my dates. The only values that would be valid are the minimum and the maximum, or so I thought.

summary(mydates)

The output:
# Min.            1st Qu.       Median      Mean          3rd Qu.       Max.
# 20140000 20140000 20140000 20140000 20140000 20140000

This behavior which seemed odd to me, is caused by the way summary() deals with numerical data. So I decided to look at what summary is actually doing. To view the code behind the summary function type:

summary.default

The portion of code that we are interested in is this:

else if (is.numeric(object)) {
nas <- is.na(object)
object <- object[!nas]
qq <- stats::quantile(object)
qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.",
"Max.")
if (any(nas))
c(qq, `NA's` = sum(nas))
else qq
}

This code chunk gets executed when the object that we pass to summary is numeric.  If we substitute in our object 'mydates' we get the following code.

digits = max(3L, getOption("digits") -     3L)
nas <- is.na(mydates)
mydates<- mydates[!nas]
qq <- stats::quantile(mydates)
qq <- signif(c(qq[1L:3L], mean(mydates), qq[4L:5L]), digits)

If you step through the code line by line, you will notice that after line 4, summary produces what we would expect to see for a min and max value. However, after you execute line 5, the numbers are changed because they are not the actual numbers but they are changed to be significant figures. For example try:

signif(20140101,digits)

[1] 20140000

So be careful when using generic functions if you don't know what they are doing. I would encourage to take a look at the code behind some of the R functions you use the most. For instance using the fivenum() function does not change my min and max values the same way summary did.

Leave a Comment

Filed under R

Building a productivity system in R, Part 1

Reading Time: 3 minutes

I recently came to the conclusion that I need a more meaningful way to track my productivity than the spreadsheet I am currently using, so my next few posts are going to be about building a system in R to track this.  If you're building your own productivity tracking system then by all means take this as inspiration, but don't expect it to suit your needs.  I'm making it to suit my needs using terminology that is common in my workplace and you'll have to figure out what will work for your needs in your workplace.

As with all such endeavors, the thing that is really going to make or break this tracking is the data model, so let's define that first.

At the very top level I have projects.  Each client will have one or more projects.  I'm not interested in tracking work for particular clients (at least for now) so I'm skipping that level, but it is necessary to note that each client has a 4 digit number.  Each project also has a 4 digit number, so the combination of the client digits and the project digits form a partial billing code.  The addition of the task-level 4 digit number makes a complete billing code that can be entered into my timesheet, but we're not there yet.  At the project level, the first two quartets is all that is necessary.  Additionally, we're going to have a name for the project, the date the project gets added, and the date the project gets removed.  Projects can often be multi-year endeavors, so understanding just how long you've been working on various tasks for a project can be useful.  For referencing across different datasets in this data model a project ID will also be defined.

Below the project level, as mentioned, are tasks.  Each task is a concrete goal that has been assigned for me to work on for that project.  Sometimes I only have one task for an entire project, other times I might have several tasks simultaneously. Some tasks may also depend on the completion of other tasks.   So we're going to want the following things: task ID, task name, project ID, complete 12 digit billing code, if the task depends on the completion of another task, add date, complete date, budgeted hours, total used hours (will be cumulative), impact, effort, and notes.  I'm using the impact and effort fields to automatically assign priorities.  They will each be given a value from 1 to 10, with 10 being the highest.  I'm not going to get into how impact and effort will be used to create the priority since I will go into more detail about that in a future post, but see this article for my inspiration.

Finally, I want to track the actual hours in the day that I do the work.  So for this dataset I just want the task ID, the date/time in, and the date/time out.

Since I want all of this to appear as a single object I'm going to use a list containing three data frames.  Below is a function that will actually generate this object.  I expect I'll only ever have to use it once, but it's still useful to me to think in this way.  My next post will get into adding projects and tasks.

createStructure <- function() {
  require(lubridate)
  Projects <- data.frame(ProjectID = character(),
                         ProjectName = character(),
                         BillingCode = character(), #(possibly partial)
                         AddDate = ymd(),
                         RemoveDate = ymd(),
                         stringsAsFactors=FALSE)
  Tasks <- data.frame(TaskID = character(),
                      TaskName = character(),
                      ProjectName = character(),
                      BillingCode = character(), #(should be complete)(multiple codes spill into Notes field)
                      AddDate = ymd(),
                      CompleteDate = ymd(),
                      BudgetHours = numeric(),
                      TotalUsedHours = numeric(),
                      Impact = integer(),
                      Effort = integer(),
                      Notes = character(),
                      stringsAsFactors=FALSE)
  Hours <- data.frame(TaskID = character(),
                      TimeIn = ymd_hms(),
                      TimeOut = ymd_hms(),
                      stringsAsFactors=FALSE)
  return(list(Projects, Tasks, Hours))
}

 

Leave a Comment

Filed under Functional Programming, Productivity, R, Uncategorized

Assumption Checking - Part I

Reading Time: 2 minutes

Often when working, we are under deadlines to produce results in a reasonable timeframe. Sometimes an analyst may not check his assumptions if he is under a tight deadline. A simple example to illustrate this would be a one sample t-test. You might need to test your sample to see if the mean is different from a specific number. One assumption of a t-test that is often overlooked, is that the sample needs be drawn randomly from the population and the population is suppose to follow a Gaussian distribution. When is the last time in the workplace that you heard of someone performing a normality test before running a t-test? It is considered an extra step that is not usually taken. It should really not be considered a burden and can easily be accomplished with a wrapper function in R.

mytest <- function(x, value=0) {
xx <- as.character(substitute(x))
if(!is.numeric(x)) stop(sprintf('%s is not numeric', xx))
if(shapiro.test(x)$p.value>.10){
print(t.test(x, mu=value))
}else{
print(wilcox.test(x, mu=value))
}}

We can combine that with another function to produce a density plot.

myplot <- function(x,color="blue"){
xx <- as.character(substitute(x))
if(!is.numeric(x)) stop(sprintf('%s is not numeric', xx))
title <- paste("Density Plot","\n","Dataset = ",deparse(substitute(x)))
mydens <- density(x)
plot(mydens,main=title,las=1)
polygon(mydens,col=color)
}

Now, let's see how our functions work.  If we generate some random values from a Gaussian distribution, we would expect it to "normally" pass a normality test and a t-test to be performed. However, if we had data that was generated from another distribution that is not 'normal', than typically we would expect to see the results from the Wilcox test.

set.seed(123)
n <- 1000
normal <- rnorm(n,0,1)
chisq <- rchisq(n,df=5)

mytest(normal)
myplot(normal)

#Test for difference from 5 for chi-square data
mytest(chisq,value=5)
myplot(chisq ,color="orange")

Density Plots

Results from 'mytest(normal)':

One Sample t-test
data: x
t = 0.5143, df = 999, p-value = 0.6072
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.04541145 0.07766719
sample estimates:
mean of x
0.01612787

Results from 'mytest(chisq,value=5)':

Wilcoxon signed rank test with continuity correction

data: x
V = 214385, p-value = 8.644e-05
alternative hypothesis: true location is not equal to 5

Conclusion
The benefit of working ahead can be seen. Once you have these functions written you can add them to your personal R package that you host on github. Then you will be able to use them whenever you have an internet connection and the whole R community has the chance to benefit. Also, it is easy to combine these two functions into one.

 

#Combine the functions
PlotAndTest <- function(x){
mytest(x)
myplot(x)
}

 


Leave a Comment

Filed under Assumption Checking, R