The first day started off with a great lecture on basic software engineering principles that all Statisticians should know. Paul Teetor, gave the talk "What Can We Learn from Software Engineers?". He covered some basic but very important principles including:
- Coding Standards
- Defensive Programming
- Version Control
- Unit Testing
I appreciated him introducing me to the difference between "programming in the small and programming in the large". Often I find myself writing code "in the small" and in hindsight, that is not great. Paul walked our group through building a basic R package and putting it under version control in less than 20 minutes! So far, this has been my favorite session. The only other thing that is worth mentioning was a conversation that I had with a representative from Wolfram Alpha. This gentleman was explaining the features of their products and how great their software is. I was listening to his pitch when he caught me off guard with an odd statistic about how many lines of code their software had. It went something like this:
Sales_Rep : We have more code than the human genome!
Me: You guys should really write more efficient code
He quickly explained that their code is efficient, but I had to leave shortly after that to avoid laughing out loud. We ended the first day with a poster session and some socializing. Overall, it was a good first day.
I often find myself needing to check if a table is unique with respect to a particular column. For instance, if I have a table of customer data I want to ensure that a customer only exists once. There are several ways that you can do this. If I want to write a quick select if usually looks like this.
-- Method 1
Select count(*),Count(Distinct ColumnofInterest)
A while back I got tired of writing the above query over and over, so I created a stored procedure (SP). The SP below take two arguments, a table name and a column name and will determine if your column is unique. Currently it is written using a
count(*) , which is not meant for large tables, i.e. > 2,000,000,000 rows .
-- Method 2
SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
Create Procedure [dbo].[DupeChecker]
-- Construct SQL Statement Using Name Provided
Declare @MyQuery as nvarchar(max)='
Select @DupeFlag =case when count(*) = count(Distinct ' + @Column + ') then 0 else 1 end
From ' + @Table + ' with(nolock)
Declare @MyNumber smallint
Set @MyNumber = 0
exec sp_executesql @MyQuery, N'@DupeFlag smallint out', @MyNumber out
-- Raise Error if Table is not uqniue
Declare @ErrorText Varchar(100)
Set @ErrorText = 'Your column ' + @Column + ' on table ' + @Table + ' is not unique.'
If (@MyNumber =1)
-- If unique print a message to confirm
Declare @ErrorText2 Varchar(100)
Set @ErrorText2 = 'Your column ' + @Column + ' on table ' + @Table + ' is unique.'
If (@MyNumber =0)
There is one more method that I commonly use to determine if my table is unique and that is an index.
-- Method 3
Create Unique NonClustered Index [NCI_Dupe_Checker] on TableName
( MyColumn )
So, now you have three different ways to check your table to make sure that your column is unique. Keep in mind that adding an index will take up additional space, where methods 1 and 2 will not.