Stats help would be very much appreciated...

This topic has been archived, and won't accept reply postings.
 Bilberry 19 Feb 2021

I'm way out of my depth on some, superficially quite simple, data analysis.  I'd be happy to ship some beers (or mint tea/charitable giving/whatever) to anyone who can give me some structured help please?

I have some rows of data (about 125).  Each row is a geographic population (like a county or city); collectively it's the population of England.

I have categorised each into one of three sets, depending on a pervailing policy in that area. Set A= places with policy A; B= policy B and C= no policy in place.

I have a rate/100,000 of population/year of an event for each row.

Imagine for example, house fires where some places provide free smoke alarms, others you pay and others are silent on the matter (it's not this, but it's an easier example to explain).

There's variaiton within each policy set.  I can get mean (population weighted) and standard deviation.  I can show that A, B and C are all statistically different from each other (at p<0.01 or better).  So policy makes a clear difference.

I also have a poor correlation of the event with deprivation score (R^2 =~0.25).  More deprived areas; more fires.

What I want to work out is *how much* of the variation in the total population is accounted for by policy, how much by deprivation , and how much by "other things".  I have no idea how to do this.

I'm kinda daydreaming that someone will say "send me the Excel (125 rows; population, category, rate) and some beer and I can tell you"....or even just pointers about what tools to try.  Thanks for your help!

 wintertree 19 Feb 2021
In reply to Bilberry:

> What I want to work out is *how much* of the variation in the total population is accounted for by policy, how much by deprivation , and how much by "other things".  I have no idea how to do this.

Can the "policy" dimension fairly be given a monotonic numeric score? If so, if you remove the sets and instead assign each row policy and deprivation scores, a "principle component analysis" will look at how the data and the policy contribute to the variance in your data over your 125 rows.    It might for example say that the policy index contributes 70% of the variance in your data, the deprivation index 20% and other, unidentified sources (which can include measurement noise, however that translates to your measure) the other 10%.

I don't know about Excel, but you'll find a lot of examples in Python and "R" if you google around.  

 Jon Read 19 Feb 2021
In reply to Bilberry:

Sounds like you need a generalised linear regression model, where you can estimate the association of each explanatory variable with your outcome of interest, so adjusting for deprivation to see the effect of policy group. The residuals would be the variation remaining. You may have to make the assumption that each location can be treated as independent (ie the closer places are to each other doesn't affect how similar they are) in the first instance, unless you want to get into geospatial stats! You would want to account for the population size within each location (larger populations may generate more events per time period, all else being equal), so I would be thinking a Poisson regression model with an offset term for population size....

 Jon Read 19 Feb 2021
In reply to Bilberry:

Can I reframe your question? I would be *much* more interested in knowing if places where the policy was to freely install fire alarms had a lower rate of fires (after adjusting for deprivation and population size) than places with no policy or a pay-for-it-yourself policy ... than what % of variation that variable mopped up. Of course, you may not be that interested in the relative effect size of policies .

 wintertree 19 Feb 2021
In reply to Jon Read:

> You may have to make the assumption that each location can be treated as independent (ie the closer places are to each other doesn't affect how similar they are)

Good point.  The OP could do a plot and regression of (distance from location A to location B) vs (measure for A - measure for B) using population normalised measures.  Is there a pattern in that?  

If there is, a crude way of incorporating this (assuming it's the UK) in to a PCA and I assume a Poisson regression is to use the latitude in particular and the longitude as variables going in to the analysis.  This isn't a test for correlation with distance between areas, but given the geographical relationships, it's not unrelated.  

Post edited at 16:29
 Jon Read 19 Feb 2021
In reply to wintertree:

Or map the residuals?

If you really are using deprivation at a LTLA or city level, there may be not that much variation between locations at that aggregated scale, and the aggregation is possibly causing all sorts of ecological fallacy issues (apologies ecologists!). just something to be wary of, rather than a deal-breaker.

 wintertree 19 Feb 2021
In reply to Jon Read:

> Or map the residuals?

I like it.  

 Bilberry 19 Feb 2021
In reply to wintertree:

Thanks for this; starting to go over my head!

> Can the "policy" dimension fairly be given a monotonic numeric score?

Yes - A B and C are fixed types.  Theyre not quantifiable and there are variations in policy wording, but they fall in those categories.  Is that what you mean?

> If so, if you remove the sets and instead assign each row policy and deprivation scores, a "principle component analysis" will look at how the data and the policy contribute to the variance in your data over your 125 rows.    It might for example say that the policy index contributes 70% of the variance in your data, the deprivation index 20% and other, unidentified sources (which can include measurement noise, however that translates to your measure) the other 10%.

That's the output I want, but the deprivation scores are essentially quintiles of deprivation and the sets are type groupings. So I could do 1-3-5 and 1-2-3-4-5

> I don't know about Excel, but you'll find a lot of examples in Python and "R" if you google around.  

 Bilberry 19 Feb 2021
In reply to Jon Read:

> Sounds like you need a generalised linear regression model, where you can estimate the association of each explanatory variable with your outcome of interest, so adjusting for deprivation to see the effect of policy group. The residuals would be the variation remaining. You may have to make the assumption that each location can be treated as independent (ie the closer places are to each other doesn't affect how similar they are) in the first instance, unless you want to get into geospatial stats! You would want to account for the population size within each location (larger populations may generate more events per time period, all else being equal), so I would be thinking a Poisson regression model with an offset term for population size....


Thanks Jon.  Afraid you ae losing me.  The populations are independent - the policy applies to the residents; no proximity impact.  There's no scale impact on frequency.

I can cut/paste the table here if folk want!

 Bilberry 19 Feb 2021
In reply to Jon Read:

> Can I reframe your question? I would be *much* more interested in knowing if places where the policy was to freely install fire alarms had a lower rate of fires (after adjusting for deprivation and population size) than places with no policy or a pay-for-it-yourself policy

I already have this from ANOVA

> ... than what % of variation that variable mopped up. Of course, you may not be that interested in the relative effect size of policies .

This bit is what I need!

 Bilberry 19 Feb 2021
In reply to Jon Read:

> Or map the residuals?

> If you really are using deprivation at a LTLA or city level, there may be not that much variation between locations at that aggregated scale, and the aggregation is possibly causing all sorts of ecological fallacy issues (apologies ecologists!). just something to be wary of, rather than a deal-breaker.


I've got a deprivation score for each area.

What does mapping residuals involve?  I'm out of my area by 1000 miles here!

 rif 19 Feb 2021
In reply to Bilberry:

Sounds like it could be done by multiple regression using dummy variables to code for policy. Dependent variable is rate/100k, and the three predictors are deprivation rate, dummy variable A, and dummy variable B. Dummy variable A is set to 1 for policy A, else 0; dummy variable B = 1 for B, else 0 (so both zero implies policy C). The p values for the predictors tell you whether each has a significant effect with the others held constant, and the coefficient values for the two dummy varables tell you how much higher/lower the rate is compared to policy C at the same deprivation level. The Rsq for this three-predictor regression tells you the overall % variance explained.

Then leave deprivation rate out; Rsq will be lower, and the difference is its contribution after allowing for policy.

Of course this ignores any geographical effects, whch as others have said could be investigated by inspecting the residuals.

caveat: ought really to do checks on linearity, heteroscedasticity, etc before trusting the results of regression analysis

Rob F

 wintertree 19 Feb 2021
In reply to Bilberry:

> Yes - A B and C are fixed types.  Theyre not quantifiable and there are variations in policy wording, but they fall in those categories.  Is that what you mean? [...] the deprivation scores are essentially quintiles of deprivation and the sets are type groupings. So I could do 1-3-5 and 1-2-3-4-5

Exactly.  No need to go for 1-3-5 rather than 1-2-3 - but the key thing is if you can stand by that assignment.  You'd need to do this for the PCA and I assume you'd need to do it for the multivariate regression approaches as well.

> What does mapping residuals involve?  I'm out of my area by 1000 miles here!

The PCA and I assume (don't know) the multivariate regression approaches give you the quantity of variance in the data that is not explained by your variables.  You then take this unexplained variance for each area and use it to colour a heat map of the areas, and see what pops out at you when you look at it.  It's a qualitative analysis, but a powerful one, and one that might clue you up about what questions to ask.  Here's an example of some data being turned in to a heat map (it's not residual variance, though)...

https://www.ukclimbing.com/forums/off_belay/friday_night_covid_plotting_5-729286?v=1#x9362864

In reply to rif:

> Sounds like it could be done by multiple regression using dummy variables to code for policy. Dependent variable is rate/100k, and the three predictors are deprivation rate, dummy variable A, and dummy variable B. Dummy variable A is set to 1 for policy A, else 0; dummy variable B = 1 for B, else 0 (so both zero implies policy C)

I prefer that to assigning a numerical score to ABC as I suggested.  Your approach is more defensible although interpreting the results might need a bit more thought?

Post edited at 17:29
 duncan b 19 Feb 2021
In reply to rif:

> Sounds like it could be done by multiple regression using dummy variables to code for policy. Dependent variable is rate/100k, and the three predictors are deprivation rate, dummy variable A, and dummy variable B. Dummy variable A is set to 1 for policy A, else 0; dummy variable B = 1 for B, else 0 (so both zero implies policy C). The p values for the predictors tell you whether each has a significant effect with the others held constant, and the coefficient values for the two dummy varables tell you how much higher/lower the rate is compared to policy C at the same deprivation level. The Rsq for this three-predictor regression tells you the overall % variance explained.

> Then leave deprivation rate out; Rsq will be lower, and the difference is its contribution after allowing for policy.

This is broadly how I'd tackle the problem. However, I think you'll need to carefully account for the correlation between the independent variables. I think, by construction, the dummy variables A and B will have non zero correlation. This StackExchange post discusses the issue. https://stats.stackexchange.com/questions/79399/calculate-variance-explained-by-each-predictor-in-multiple-regression-using-r


This topic has been archived, and won't accept reply postings.
Loading Notifications...