So I think I’m going to move into more of an ad hoc format for these blogs. Much easier to manage betwixt classwork and other projects.

Anyway, the subject of this post is on various sorts of heteroskedasticity robust standard errors in R. In Stata, this is trivially easy: you simply add “, robust” to the end of your regression command (or , cluster(var) for cluster robust SEs). In R, there’s a bit more flexibility, but this comes at the cost of a little added complication. Simplest first.


This code generates the default heteroskedastic robust standard errors in R. The sandwich package is necessary for reestimating the variance–covariance matrix, and the lmtest package for easily performing summarizing tests on the coefficients using the corrected variance estimation. If you don’t need to retest the model’s output, you might not need lmtest. First, I should flash the appropriate matrix notation for the asymptotic distribution of our regressors. In particular, the variance may be written as follows:

With fixed regressors, we can rewrite this more simply as:

where the bit of interest is . If we can get away with assuming homoskedasticity, its trivial to see that . When we can’t assume homoskedasticity, we can’t simplify so easy, and we must come up with a smart way to estimate . Which is exactly what heteroskedastic robust SEs try to do!

By default, vcovHC will return HC0 Standard Errors. That is,


This may cause some confusion though, especially if you are surrounded by people using Stata. This is not the same robustness procedure Stata’s robust option uses. Instead, they use HC1 robust SEs, which include a degree of freedom correction:


To get this output in R is very simple. Simply add the option “type” to vcovHC with the argument “HC1”. There are several other ways to estimate robust SEs which I won’t address right now.

The main point of this post was actually to get at cluster robust SEs, however. This is when you believe that errors are correlated within certain groups, but not between groups. An example would be in panel data. You would likely expect errors to be correlated for a given state (and you would probably also use fixed or random effects in this situation. Fixed effects and cluster robust SEs go together like milk and Oreos). Another example would be if you are trying to estimate a treatment effect, but you weren’t able to randomize treatment at the individual level. Perhaps you were only able to randomize over certain groups (perhaps by neighborhood, county, or family).

In this case, we want to correct for the fact that certain groups are correlated. We do this using the following estimator:

Where the variables h represent clusters/groups, and H is the number of clusters. Stata also applies a degree of freedom correction, however, so it use the estimator:

But we don’t particularly care about how Stata does things, since we want to know how to do it in R. Mahmood Arai provides a function for doing so, which I have modified along the lines of J. Harden (although his version does not work directly as written). My version is as below (it uses the same degree of freedom calculation as does Stata): <- function(model, cluster){
 M <- length(unique(cluster))
 N <- length(cluster)
 K <- model$rank
 dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
 uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
 rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N) <- coeftest(model, rcse.cov)

To execute this function requires the existence (again) of both the sandwich and lmtest packages, available on CRAN. After running a model, simply call the function with the saved model output and the variable on which you wish to cluster, as in below.

#To save only the variance-covariance matrix
#To save the coeftest object,
# which is a table containing the results
# of a test of the coefficients for significance
# using the corrected SEs

I personally like attaching the variance covariance matrix to the model object in, for instance, the element $se. This allows for easy interoperability with the package apsrtable.

A word of warning on cluster robust standard error calculation, however. It is necessary that you are careful in the crafting of the cluster variable. The factor variable passed in should have levels which are the same as the names of the associated dummy variables in your data frame. For instance, I recently was working with a dataset which required state-level clustered SEs (but did not use state level fixed effects). I had first added the state level dummy variables using the dummies package, generating variables named in the format of “state.washington, state.georgia” and so forth. My factor variable on which I wanted to cluster had levels named “washington, georgia” and so on. This will return an error, as the estimation routine cannot find the appropriate dummies in the data frame (and the error isn’t particularly clear as to the specific problem). There are two ways to fix this issue directly (and possible an easier fix of which I am unaware.). You can rename all of the variable names in your data frame to match the factor levels, or you can just create a new variable with the appropriate variable names as factors. I chose the latter. This may easily be accomplished with the following code:

#Create the new variable with appropriate level names.
#Save the coefficient test output to an element in the model object

This was something that confused me for a while, but once you realize how the functions are operating under the hood is a no-brainer. I did want to get this fix up on the web, though, as it stumped me for a while, and I’d imagine other people have been in a similar situation. To be honest, I was nearly to the point of writing an entirely new estimation routine from the ground up, just so I knew exactly what was going on beneath the layers and layers of function calls in the function.