Monthly Archives: March 2016

Correctly Reporting P-Values in Summary Tables Reported with xtable

Reading Time: 3 minutes

Often when writing a manuscript in using knitr and xtable I am flustered by my p-values. In simple summary tables, R conveniently rounds my p-values to be 0: a mathematically inappropriate task. A colleague recently commented on the poor reporting of my table (shown below using print.xtable with the type="html" argument), inspiring a much needed change.

Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.001704 0.000005 100409.770956 0.000000
sizemedium 0.000046 0.000005 90.534705 0.000000
sizesmall 0.000003 0.000005 0.294331 0.587458
time -0.000004 0.000001 11.614917 0.000654

 

The fix is actually fairly straight forward, and can be summarized in a simple function: "fixp", with the code shown below:

fixp <- function(x, dig=3){
  x <- as.data.frame(x)
  
  if(substr(names(x)[ncol(x)],1,2) != "Pr")
    warning("The name of the last column didn't start with Pr. This may indicate that p-values weren't in the last row, and thus, that this function is inappropriate.")
  x[,ncol(x)] <- round(x[,ncol(x)], dig)
  for(i in 1:nrow(x)){
    if(x[i,ncol(x)] == 0)
      x[i,ncol(x)] <- paste0("< .", paste0(rep(0,dig-1), collapse=""), "1")
  }
  
  x
}

All that's going on: the function is pulling in the summary table (usually through a $coef), trying to turn it into a dataframe (some already are, though some tables are numeric (e.g. lm)), throwing a warning if the last heading doesn't begin with "Pr" (as it may not be the column that contains p-values), and editing any values that were rounded to 0 (at the user specified rounding point) to be < the smallest number that could be rounded to (e.g. <.01). Then we output the edited table, all ready for reporting! To mimic what was above, we set our digits to be equal to 6 (so go out 6 decimal places for the p-value), and re-run:

Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.001704 0.000005 100409.770956 < .000001
sizemedium 0.000046 0.000005 90.534705 < .000001
sizesmall 0.000003 0.000005 0.294331 0.587458
time -0.000004 0.000001 11.614917 0.000654

 

Much better! Also, to report a two digit p-value (for some writing styles), we simply set dig = 2:

Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.001704 0.000005 100409.770956 < .01
sizemedium 0.000046 0.000005 90.534705 < .01
sizesmall 0.000003 0.000005 0.294331 0.59
time -0.000004 0.000001 11.614917 < .01

 

By design, the p-values can be manipulated independent of the estimates. This allows reporting of the estimated coefficients in meaningful units (in the above example, very small units), while reporting the p-values on a scale that many writing styles request.

Want to try this yourself? Here's an example that you can try with just a built in dataset in R:

#this gives a summary table with a small p-value. Trying to report this with xtable would results in an R rounding issue!
(mod <- coef(summary(lm(uptake ~ conc + Treatment + Type + Plant, data=CO2))))

#this fixes the p-value to 2 digits, correctly reporting p-values that would have been rounded to 0
fixp(mod,dig=2)

Here's the final output via print.xtable (dig=2 for fixp and xtable):

Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.42 4.67 8.00 < .01
conc 0.02 0.00 7.96 < .01
Treatmentchilled -12.50 5.10 -2.45 0.02
TypeMississippi -23.33 6.01 -3.88 < .01
Plant.L 21.58 11.14 1.94 0.06
Plant.Q -4.62 2.27 -2.03 0.05
Plant.C 1.46 5.10 0.29 0.78
Plant^4 2.34 2.27 1.03 0.31
Plant^5 -0.48 5.77 -0.08 0.93
Plant^6 -0.04 2.27 -0.02 0.99
Plant^7 -1.91 3.64 -0.53 0.6
Plant^8 -3.28 2.27 -1.44 0.15
Plant^10 0.55 2.27 0.24 0.81

 

Limitations (ish):

  1. Again, this assumes that the last column is the one to be transformed. This is by design, though may be inconvenient in some situations. If needed, the change is easily made through the definition of the function.
  2. When the last column is manipulated, it becomes a character column in the dataframe. Alternatively, when it is rounded but no entry rounds to 0, it is numeric.
  3. This assumes a dataframe-style format of your table. Thus, this method will NOT be effective at correcting reported p-values for an individual test: say a t-test, where only the statistic is reported (and not a table). Personally this is not a concern, as I deal with these situations in other ways, but for some users seeking an overall "p-value fixing" method this may not be the answer.

As with other functions I write posts on, this function is available in my package (creatively named "myStuff")  via Github. If you'd like to play with the most current version of the function, I'd encourage you to check it out here. Alternatively, to have access to other fun functions, install the package directly from GitHub with the code below (requires devtools):

devtools::install_github("flor3652/myStuff")

Leave a Comment

Filed under R

Quick Bar Graph in Sql Server

Reading Time: < 1 minute

Have you ever needed a bar graph and didn't want to leave Sql Server? Well if you don't have Sql Server 2016 yet I have some code for you. The code below creates a table with the numbers 1-10. We then sample repeatedly from that table and make a histogram of our sample. We should have something approaching a uniform distribution when we are done.

 

If Object_id('tempdb..#Test','U') is not null 
Drop Table #Test
Create Table #Test ( numbers float )

If Object_id('tempdb..#RandomNumbers','U') is not null 
Drop Table #RandomNumbers
Create Table #RandomNumbers ( numbers float )


-- Create Table with numbers 1 -10
Declare @Counter int
Set @Counter=1
While @Counter <=10
Begin
    Insert Into #Test Select @counter
    Set @Counter=@Counter+1
End

-- Repeated Sample with replacement from table
Declare @Counter2 int
Set @Counter2=1
While @Counter2 <=1000
Begin
    Insert Into #RandomNumbers Select Top 1 Numbers From #test Order By newid()
    Set @Counter2=@Counter2+1
End

-- Make histogram
Select
    numbers,
    replicate('>',count(*)) [Levels]
From
    #RandomNumbers
Group By
    numbers
Order By
    count(*) desc

You should now have a nice little histogram showing you the distribution of your data. Another useful thing is to approach it like a pareto chart and order the data descending.

Select
    numbers,
    replicate('>',count(*))[Levels]
From
    #RandomNumbers
Group By
    numbers
Order By
    count(*) desc

sqlhist

Leave a Comment

Filed under Sql Server

Vectorize a Function

Reading Time: 3 minutes

I was recently working and decided to write a function to assist in the process. It assigns a label to a number based upon the value. My first attempt worked, but only for one value at a time.

KMO_adequacy2 <- function(x){
  if ( x > .90 ) {result <- "Marvelous"}
  if (x <=.90 & x >.80 ) { result <-"Meritorious" }
  if (x <= .8 & x >.70 ) { result <- "Middling" }
  if (x <=.7 & x>.60)    { result <-"Mediocre" }
  if (x <=.6 & x>.50)    {result <-"Miserable" }
  if (x <=.5)            { result <- "Unacceptable" }
  return(result)
}

This works.

 KMO_adequacy2(.95)
#[1] "Marvelous"

This does not work.

 KMO_adequacy2(c(.95,.50))
#[1] "Marvelous"
# Warning messages:
#   1: In if (x > 0.9) { :
#       the condition has length > 1 and only the first element will be used
#     2: In if (x <= 0.9 & x > 0.8) { :
#         the condition has length > 1 and only the first element will be used
#       3: In if (x <= 0.8 & x > 0.7) { :
#           the condition has length > 1 and only the first element will be used
#         4: In if (x <= 0.7 & x > 0.6) { :
#             the condition has length > 1 and only the first element will be used
#           5: In if (x <= 0.6 & x > 0.5) { :
#               the condition has length > 1 and only the first element will be used
#             6: In if (x <= 0.5) { :
#                 the condition has length > 1 and only the first element will be used

I should have thought more about the end goal of the function before I started coding, but I didn't. I started searching for good ways to vectorize a function in R. I found there is a function in base R called 'Vectorize'. All I needed was to create a new function using 'Vectorize' and I was done.

KMO_adequacy <- Vectorize(KMO_adequacy2) 
KMO_adequacy(c(.95,.50))
#[1] "Marvelous"    "Unacceptable"

This allowed me to easily add a column to my data frame containing individual KMO measures and associate a label. I reached out to my fellow blogger Jeremy and he gave me a quick re-write of my original function. Here is his approach.

# Jeremy's re-write
KMO_J <- function(x){
   names(x) <- x
   names(x) <- ifelse(x > .90, "Marvelous", names(x))
   names(x) <- ifelse(x <=.9 & x >.80, "Meritorious", names(x))
   names(x) <- ifelse(x <=.8 & x >.70, "Middling", names(x))
   names(x) <- ifelse(x <=.7 & x>.60, "Mediocre", names(x))
   names(x) <- ifelse(x <=.6 & x>.50, "Miserable", names(x))
   names(x) <- ifelse(x <=.5, "Unacceptable", names(x))
   return(names(x))
} 

KMO_J(c(.95,.50))
#[1] "Marvelous"    "Unacceptable"

We can do a quick check to make sure that we are getting the same output.

test <- seq(0,1,.1)
all.equal ( KMO_adequacy(test) , KMO_J(test) )
#[1] TRUE

My next question, is there a major performance difference between the two?  I ran a short simulation which is summarized in the plot below and shows that there is not a large difference in performance for the samples tested.

Rplot

Have a better way to solve this problem? Post it in the comments below. If you are wondering what this KMO thing is all about, it is a measure of sampling accuracy (MSA) for conducting exploratory factor analysis (EFA). The cutoffs and names were taken from:

  1. Barbara A. Cerny , Henry F. Kaiser
    Multivariate Behavioral Research
    Vol. 12, Iss. 1, 1977

Leave a Comment

Filed under R_local, Statistics