Sampling weights are used to correct for the over-representation or under-representation of key groups in a survey. For example, if 51% of a population are female, but a sample is only 40% female, then weighting is used to correct for this imbalance. Most statistical software gives incorrect statistical testing results when used with sampling weights. This blog post first shows that standard tests in both R and SPSS are incorrect. Then, it introduces a solution (Taylor Series Linearization). Finally, the weaknesses of some common hacks are reviewed - using the unweighted sample size, testing using unweighted data, scaling weights to have an average of 1, and using the effective sample size.
A simple case study is used to illustrate the issue. It contains two categorical variables, one measuring favorite cola brand (Pepsi Max vs Other), and the other measuring gender. The data set contains five weights, each of which is used to illustrate a different aspect of the problem. The data is below. All the calculations illustrated in this post are in this Displayr document.
Consider first an analysis that doesn't use any of the weight variables (an unweighted analysis). The output below shows a crosstab created in SPSS. A few key points to note:
If you studied statistics, there's a good chance you were taught to do a Z-test to compare the two proportions. The calculations below perform this test. They are written in the R language, but if you take the time you will find it easy to follow the calculations. The computation of the statistic is different from the chi-square test statistic but the tests are equivalent in this unweighted case with identical p-values.
The output below shows the SPSS results with a weight ( wgt1 ) applied. These results are incorrect for two reasons. The first reason is that, by default, SPSS has rounded the cell counts, and used these rounded cell counts in the chi-square test and the percentages, meaning that every single number in the analysis is, to a small extent, wrong. (In fairness to SPSS, this routine was presumably written in the mid-1960s, and this type of hack may have been necessary for computational reasons back then.)
The output below uses the un-rounded results. The first thing to note is that that the p-value is much lower than with the unweighted result (.016 versus .08). If testing at the .05 level, you would conclude the difference is significant. So, you may end up with a completely different conclusion. Which, if either, is correct? In this case, the unweighted result is correct. Let me explain:
The example above can't be reproduced in R, because its standard tools for comparing proportions don't even allow you to use weights of any kind. However, we can illustrate that the problem exists in routines in R that do support weights, such as regression.
It's possible to redo the statistical test above using a logistic regression. This is illustrated below. The yellow highlighted number of 0.0887 is the p-value of interest. It's basically the same as we got with the standard chi-square test used at the beginning of the post (it had a p-value of .083). This statistical test in the logistic regression is assessing something technically slightly different and it's typical that tests that present slightly different technical answers get slightly different results.
R's logistic regression does allow us to provide a weight. The output is shown below. Note that once again the p-value has gotten very small. The reason is that the weighted regression is, in its internals, making exactly the same mistake we saw with SPSS's chi-square test: it's assuming that the weighted sample size is the same thing as the actual sample size. Note that it basically gets the same result as with the chi-square test as well. So, we get the wrong answer by supplying a weight.
In the world of statistical theory, this is a solved problem. All of the techniques above correctly compute the percentages with the weights. More generally, all the parameters calculated using traditional statistical techniques, be they percentages, regression coefficients, cluster means, or pretty much anything else are typically computed correctly with weights. What goes wrong relates to the calculation of the standard errors. Below I reproduce the proportions test code shown earlier in the post. When the data is weighted, the lines of the code that are wrong are on lines 5-6. You can see based on first principles that it must be wrong, as it has no place to insert the weights. A correct solution is to rewrite lines 5-6, using a technique known as Taylor Series Linearization to calculate the standard error (se) in a way that addresses the weights appropriately. This is, sadly, not a trivial thing to do. I will disappoint you and not provide the math describing how it works, as the math will take a page, and, unless you have good math skills it's not a page that will help you at all. Last time I checked there were no good web pages about this topic. If you want to dig into how the math works, as well explore alternatives to Taylor Series Linearization, please read Chapter 9 of Sharon Lohr's (2019): Sampling: Design and Analysis. But, warning, it's hard going.
So, how do we do the math properly? The secret is to use software designed for the problem. This means:
For example, the table below on the left shows the data in a Displayr crosstab that is unweighted. This gives the same result as we saw for the unweighted chi-square test in SPSS. The table on the right is using Taylor Series Linearization as a part of a test called the Second-Order Rao-Scott Chi-Square Test. It basically gets the same answer. That is, the Taylor Series Linearization gets the correct answer. You don't need to do anything special in Displayr and Q to have the Taylor Series Linearization applied; when you apply the weight it does this automatically.
Similarly, here's the same calculation done using R:
And, if we use the appropriate logistic regression, designed for survey weights, it also gives the same conclusion (again, it is using slightly different assumptions, so the result is slightly different):
Most experienced survey researchers are familiar with these problems, and they typically employ one of four hacks as an attempt to fix the problem. These hacks are often better than ignoring the problem, but they are all inaccurate. In each case, they are attempts to limit the size of the mistakes that are made with using software not designed for sampling weights. None of the hacks solves the real problem, which is to validly compute the standard error(s).
In the example above, we saw that the result was wrong because the weighted sample was used in the analysis, and this was clearly wrong. Further, in this example, we can compute the correct result using the weighted proportions and the unweighted sample size. However, this doesn't usually work. It only works in the example above because we were comparing percentages between the genders, and gender was the variable that was used in the weighting. It is easy to construct examples where using the unweighted sample size fails.
To appreciate the problem I re-run the same example, but using a different weight, wgt2 , which weights based on the row variable in the table. You will recall before that we have been comparing 10.4% with 19.1%. With the new weight, the percentages have changed dramatically, and we now have preferences for Pepsi Max of 44.8% and 62.6%. Note that the p-value in this example remains at .086.
However, when we compare the weighted proportions, but use the unweighted sample size, we get a very different result. Again, this incorrect analysis suggests a highly significant relationship, which is wrong.
If you have been carefully reading through the above analyses, you will have realized that in each of the examples, the correct answer was obtained by performing the test using the unweighted data. However, this isn't generally a smart thing to do. The only reason that it works in the examples above is because we are testing two proportions and the weights are perfectly correlated with one of those variables in each of the tests above. It is easy to construct an example where using the unweighted data for testing gives the wrong answer. In the table below, after the weight has been applied ( wgt3 ), there is no difference between the genders. As you would expect, the p-value, computed using the Taylor Series Linearization, is 1. However, the unweighted data still would show a p-value of .086, which is not sensible given there is no difference between the proportions.
When using statistical routines not designed for sampling weights, it is easy to get obviously wrong results in situations where the weights are large or small. In academic and government statistics it is common to create a weight that sums up to 1. When this is done, if using software that treats the weighted sample size as the actual sample size, nothing is ever shown as statistically significant. It is also common to creates weights that sum to the population size, as this means that all analyses are extrapolated to the population. This has the effect of making all analyses statistically significant. A common hack for both of these problems is to scale the weight to have an average value of 1. This is certainly better than using the unscaled weights, but it doesn't fix the problem. This has already been shown above, as all the calculations above have been performed using a weight with an average value of 1, and, as shown, the p-value was not computed correctly.
In addition to correctly computing the p-values, the useful thing about using routines designed for sampling weights, is that you don't need to rescale the weights at all. This means that you can use weights that gross up to the population size. In the table below, for example, which is from Displayr, the weighted sample size of each gender is 100,000,000, and the p-value is unchanged.
The last of the hacks is to use Kish's effective sample size approximation when computing statistical significance. This is typically implemented in one of three ways:
This hack is widespread in software. It is used in all stat tests in most of the older survey software packages (e.g., Quantum, Survey Reporter). We use it in some places in Q and Displayr, where the math for Taylor Series Linearization is intractable, inappropriate, or is yet to be derived. For example, there is no nice solution to dealing with weights when computing correlations, so we rescale the weights to the effective sample size.
This approach is typically better than any of the other hacks. But, it is still a hack and will almost always get a different (and thus wrong) answer to that obtained using Taylor Series Linearization. For example, below I've used the effective sample size for the first weight, and same (incorrect) result as obtained when using the unweighted sample size (they will not always coincide).
To summarize: if you have sampling weights you are best advised to use statistical routines specifically designed for sampling weights. Using software not designed for sampling weights leads to the wrong result, as do all the hacks.
All the non-SPSS calculations illustrated in this post are in this Displayr document.