Are you dealing with large (100 million row +) datasets that live in a sql database? Have you found your old methods not to be satisfactory? Here are a couple of tips from my own experience.
- Do not use count(*) to figure out how big of a table you have.
- Do not use the max function to figure out when the last record was inserted
I found this out the hard way when I actually got an error when I tried to run
Select Count(*) From Table and received an arithmetic overflow error. I was puzzled at first and after some searching I found the problem. A 'count' in sql server returns the datatype int, which means it can only be 2^31-1 or 2,147,483,647. The table that I was working with had more than 2.1 billion rows, so that caused a problem. Now, you may be thinking that you could just use a Count_Big instead, but that is probably not the right answer. Try using sp_spaceused instead. If you are interested in turning this into a shortcut for SSMS, look here.
Sometimes I need to figure out when the latest record was inserted. Instead of taking a max on a datetime field. I can often get to my answer by using information about an index. Hopefully your table has a unique auto incrementing primary key that can aid you in finding the last record inserted. Make sure you understand the process that builds or alters your table. It could be that the maximum value of your primary key is not related to the most recent records.
Select max(LocalTimeStamp) From Table ( Slow )
Select LocalTimeStamp From Table Where PrimaryKey = ( Select Max(PrimaryKey) From Table ) ( Quick )
Have some useful tips to add? Please post them in the comments.
Occasionally when I am jotting some code I find myself creating several temporary variables with the intention of later getting rid of them. These variables involve quick names that are defined in a local scope and get quite confusing out of their context. If the project expands, however, I find myself with three options for playing with the next level of the project: remember what I've already used and try to avoid it, rename all of the temporary variables (e.g. rewrite the code of the base level of the project), or wipe the variables for use later. This decision is usually made based on how the project is going. If the project is going well, I'll go back and dutifully rewrite the initial code to track variables in a more unique way. If the project is still a bit shaky, I will clear the variable names that I tend to use as temporary variables and keep exploring. Remembering variable names never turns out well for me; I inevitably forget that a variable was defined in a previous section, use it thinking I had redefined it (when I didn't), and wonder at the strange results I get.
Clearing all variables names can be a task, though. The problem comes as, in order to move on to the next stage, I actually wish to keep a few of the variables and get rid of all of the rest. After playing with a few ideas (removing the unwanted variables one at a time, writing out the variables that I wanted to keep, etc.), I decided on the idea of writing a
keep function. The
keep function does just what it says: given a list of variables as arguments, it keeps those variables and removes the rest. For an example, consider the vectors "x" and "y", which are combined to give a matrix "initMod." Using
keep(initMod) keeps the matrix and will eliminate all other objects in the global environment (including "x" and "y"), allowing me to reuse the variable names "x" and "y" as temporary variables again (say, for "modifiedMod").
keep can be found on GitHub, here. Note that the function will self delete if defined in the global environment, so add in a segment to always leave
keep. You could also simply grab the containing personal package (myStuff) off of GitHub using the code below.
if(!require(devtools)) install.packages("devtools", dependencies=TRUE)
Note that when using the function from the package you don't have to worry about it being deleted as it isn't in the global environment.