UKC Forums - Maximum entropy probability distribution

Maximum entropy probability distribution

This topic has been archived, and won't accept reply postings.

ablackett 10 Feb 2022

I was teaching Y13 A Level Maths Normal Distribution, when the resource I was using from

https://www.drfrostmaths.com/resource.php?rid=337

Said that "The distribution, with a given mean 𝜇 and given standard deviation 𝜎, that ‘assumes the least’ (i.e. has the maximum possible ‘entropy’) is… the Normal Distribution!"

Now, I've got a maths degree from 20 years ago and have some understanding of the 2nd law of thermodynamics that entropy 'the degree of disorder' will always increase. I'm really struggling to put the above statement together with my (rather patchy) understanding of entropy to understand what it is saying.

Can anyone convince me the above statement is true in such a way that I can explain it to my students.

Xharlie 10 Feb 2022

In reply to ablackett:

I don't have time to do the research but you might find a satisfactory explanation to give your students if you look to the field of Information Theory, not Thermodynamics. Thermodynamic Entropy and Information Theory Entropy *are* related but, when one is talking about distributions, it is probably the Information Theory concept that you want.

wintertree 10 Feb 2022

In reply to ablackett:

I watch the thread with interest. Both to get a better explanation than mine and hopefully to get mine marked for correctness by someone with a better understanding...

My understanding that this is an information theory generalisation of the concept of the entropy of a discrete, "real" or "physical" system to a continuous distribution. It is not entropy as we know it.

I would say that in this context, "maximum entropy" is a property which means that a discrete set of y-axis values drawn from the normal distribution using random x-axis positions will be maximally random compared to any other distribution with the same number of orthogonal parameters. (Key point: the entropy is only comparable between distributions with the same number of independent parameters)

Consider instead a step function distribution that has a value of 1/(2h) for values a half-width h either side of the mean µ and 0 elsewhere. Draw values at random from that and they will be much less random than if drawn from the normal distribution .

For the normal distribution, I can only make an abysmally poor job of predicting the next value from a random draw. For the step function I can make a pretty good guess, having only two choices... The higher the entropy of the distribution, the more likely I am to be wrong at predicting the next value.

I suppose the most predictable two parameter distribution is a pair of Dirac-deltas at (µ ± h) although that breaks my random draw test as you'd never hit the precise values with random sampling; shows the limits of my noddy way of explaining it vs the information theory maths.

I could be spouting absolute nonsense here!

Post edited at 17:16

jonny taylor 10 Feb 2022

In reply to wintertree:

Wintertree, did you look at the slides? There is more to the footnote than ablackett quoted - it links to https://en.wikipedia.org/wiki/Differential_entropy#Maximization_in_the_norm.... That certainly looks like it's backing up what you're saying. I can't claim to have even attempted to follow most of the arguments in that link, but I do get the sense that it's talking about the same concepts you are angling at with:

> For the normal distribution, I can only make an abysmally poor job of predicting the next value from a random draw.

> The higher the entropy of the distribution, the more likely I am to be wrong at predicting the next value.

Blackett, for Y13 you could maybe pitch it informally that the normal distribution is the most "boring" histogram that has a particular µ and 𝜎? You could offer them a selection of other distributions that are shaped like e.g. the profile of an animal, a house, or whatever. All have the same µ and 𝜎, but are in some sense "telling you more" than a boring normal distribution is.

edit: this is UKC - you should probably include the profile of the matterhorn or something, too. If you were in Glasgow I would say use Ben Lomond. For you... Roseberry Topping?

Post edited at 18:30

DaveHK 10 Feb 2022

In reply to jonny taylor:

> You could offer them a selection of other distributions that are shaped like e.g. the profile of an animal, a house, or whatever. All have the same µ and 𝜎, but are in some sense "telling you more" than a boring normal distribution is.

Dones the fact that it's a recognisable pattern "tell you more" or is just a case of pattern recognition? Could that be the sort of explanatory simplification that creates problems later?

Edit: I've thought about it more and I'm wrong.

Post edited at 18:39

petwes 10 Feb 2022

In reply to ablackett:

There is hope for us after all if this level of statistics is being taught to upper sixth (showing my age) at A levels. Back in the 70's statistics, even at university, was delivered at a very basic level. It is only in the tail end of my career that I really grasped stats and understood how important it is in the modern world of manufacturing and having a realist concept of chance and probability when "facts" are delivered by politicians and the media.

wintertree 10 Feb 2022

In reply to jonny taylor:

I’ve previously failed to be illuminated by that wiki page beyond my poor explanation above.

Want to bet that the maximum entropy distribution has the minimum surface free energy if you treat the shape of the probability curve as an interface between two dissimilar liquids?

In reply to DaveHK:

> Does the fact that it's a recognisable pattern "tell you more" or is just a case of pattern recognition?

The pattern is recognisable (eg Matterhorn vs Snowdon) because it has information in it beyond the mean and the standard deviation; if you try and find a distribution with the least extra information in it, you end up with the Gaussian - least information so most random, most entropic. I think that’s what it’s angling at. One way of thinking about it is if you “melted” and “annealed” the Matterhorn or Snowdon profiles with the only physical constraints being the mean and standard deviation, their differentiating information would be “lost” (that’s another discussion) and they’d both revert to the same Gaussian.

Post edited at 18:48

DaveHK 10 Feb 2022

In reply to wintertree:

> The pattern is recognisable (eg Matterhorn vs Snowdon) because it has information in it beyond the mean and the standard deviation;

Yes, that was what I realised with further reflection. Also, with a more complex pattern there are fewer ways to rearrange the data (that's maybe the wrong word?) and still keep the same shape but with the normal distribution there are lots of ways to rearrange it and still keep that shape. And that's what is meant when people equate entropy with disorder.

Is that right?

Post edited at 19:29

wintertree 10 Feb 2022

In reply to DaveHK:

> Also, with a more complex pattern there are fewer ways to rearrange the data (that's maybe the wrong word?) and still keep the same shape

I think it's that sort of thing.

> but with the normal distribution there are lots of ways to rearrange it and still keep that shape. And that's what is meant when people equate entropy with disorder.

Again I think it's that sort of thing, but I'm never very happy trying to match words to maths, especially when I've muddled thorough the higher maths.

So, the Wintertree brute-force-and-ignorance "I'll model it" approach to understanding.

I just made an annealer of sorts for probability distributions. Give it a series of discrete values drawn from a probability distribution and it repeatedly re-arranges them by:

Add random noise to all their values (move them left or right a bit)
Pick a value at random, and move it left if the mean is to +ve, or right if the mean is to -ve
Pick a value at random and move it towards 0 if the stdev is to large, or move it away from 0 if the stdev is to small.

I fed this annealer a distribution made of two pillars that had 𝜇=0 and 𝜎=1. Within a thousand iterations or so it's sloughed out in to an approximation of a normal distribution (the black curve). If the annealer had been given a bit more thought and finesse I think it would be a lot smoother... (The values are discretised in steps of 0.04 to give good sampling across a 𝜎=1 gaussian. There is some biassing effect with the annealer failing to elegantly handle items at x=0 that's sort of plastered over by an egregious level of randomisation...)

Any other distribution input in to this will anneal to the same result. It's a lot like how if you make wax figurines of different mountains, each with the same mass of wax, it doesn't matter which one you melt, you end up with the same shaped blob of wax.

In reply to ablackett:

It wouldn't be much work to make a few videos of this distribution and a Rosebery Topping distribution sloughing out to a gaussian, but I don't know if you could explain the annealing process within the bounds of the course or not?

Post edited at 19:59

The heat death of your probability distribution

coinneach 10 Feb 2022

In reply to wintertree:

My brain hurts

mbh 10 Feb 2022

In reply to wintertree:

That is really cool. And presumably, if you start off with a standard normal distribution, it doesn't go anywhere?

Is there some link of reasoning here to the central limit theorem, whereby the distribution of means of samples drawn from any distribution end up themselves being normally distributed, this being the distribution that imposes least 'order' on the collection of sample means and thus maximises entropy? I can't reason this mathematically, and even intuitively it seems odd. You would think that a uniform distribution had less 'order' than a normal distribution, where things are clumped together in the middle.

Seems maybe so:

https://mathoverflow.net/questions/182752/central-limit-theorem-via-maximal...

Post edited at 20:31

DaveHK 10 Feb 2022

In reply to wintertree:

> I'm never very happy trying to match words to maths

Me neither but I don't have the maths so words is all I got.

> I just made an annealer of sorts for probability distributions...

Yeah, that doesn't help me at all I'm afraid.

wintertree 10 Feb 2022

In reply to mbh:

> And presumably, if you start off with a standard normal distribution, it doesn't go anywhere?

Exactly; it stays normal, except for the fuzz from the annealer.

> is there some link of reasoning here to the central limit theorem, whereby the distribution of means of samples drawn from any distribution end up themselves being normally distributed, this being the distribution that imposes least 'order' on the collection of sample means and thus maximises entropy?

I have to go away and think on that. It sounds sensible.

> you would think that a uniform distribution had less 'order' than a normal distribution, where things are clumped together in the middle.

My probably broken thinking's on that earlier tonight was that a uniform distribution is zero-valued everywhere if it extends to ± infinity and so is kind of meaningless to consider. If it has finite bounds to give it a non-zero value then there is a high degree of order at the sharp cutoffs.

Re: your edit and the Stack Overflow - now my brain hurts too. Some good reading there for daylight.

In reply to DaveHK:

Are you a fan of Terry Pratchett? Recall the way the ghosts in the castle in Lancre become indistinct blobs as they lost their morphogenic fields? The gaussian function is the blob any other probability distribution ends up with.

>> I'm never very happy trying to match words to maths

> Me neither but I don't have the maths so words is all I got.

How are we ever going to have an internet argument this way?

Post edited at 20:41

DaveHK 10 Feb 2022

In reply to wintertree:

> Are you a fan of Terry Pratchett?

Nope.

> How are we ever going to have an internet argument this way?

2735. And that's my last word on the matter.

Richard J 10 Feb 2022

In reply to ablackett:

As others have said, this is using entropy in the information theoretic sense rather than the thermodynamic sense (although the concepts are linked at a deep level).

A piece of jargon I was taught about this, that I think is actually very graphic, is that the maximum entropy probability distribution is the "maximally non-committal" distribution. So, say there's some quantity that has to be between 0 and 1, and you know nothing at all about the probability distribution apart from that, you're best bet is simply to assume that the probability distribution for that quantity is simply flat - it's as likely to be 0.2 as 0.6 or any other real number. But as you find out more information about the distribution this constrains the shape the distribution takes. If you know the mean and variance, then the maximum entropy distribution turns out to be a Gaussian. For any value of mean and variance, you could think of other distributions with different shapes that have the same mean and variance, but if you chose one of these you'd be assuming knowledge that you haven't got - you might choose a distribution that skews left - but how do you know the skew isn't rightwards? The maximum entropy distribution subject to some constraints is the distribution than embodies the minimum information possible, no more information than is contained in the constraints you put in.

Formally you use calculus of variations to find the function f that has a stationery value of the integral of f(x) ln(f(x)) subject to any constraints you have (e.g. integral of x f(x) = the mean, if that's what you know).

But just remember, it's about being maximally non-committal. A good principle for life, really.

Post edited at 21:02

mbh 10 Feb 2022

In reply to ablackett:

Another thought. Why are variables in the natural and physical world so often normally distributed? If they are the result of many independent binary processes (eg gene expresses or not), then the overall distribution will surely reflect that and you will end up with whichever distribution imposes least order on the data.

This, one might guess is going to be the normal distribution, since that is what you get if you toss enough coins or roll enough dice and look at the distribution of heads or scores.

That's not reasoning as such, more of an observation, but it's as far as I have managed to get.

Post edited at 21:16

jonny taylor 10 Feb 2022

In reply to wintertree:

> a uniform distribution is zero-valued everywhere if it extends to ± infinity and so is kind of meaningless to consider.

And it definitely doesn’t have a stdev of 1.

OP ablackett 10 Feb 2022

In reply to ablackett:

Thanks all, I’ve just got in and there are too many complicated words for this time of night, I’ll have a look in the morning. Much obliged for the time folk have put in thinking about this.

To answer one point up thread, it’s very much beyond the scope of the course. I’m just trying to understand it out of interest and so that perhaps I can teach it better in future.

OP ablackett 11 Feb 2022

In reply to wintertree:

For anyone else reading Wintertree’s comment and trying to understand what he has done.

”Anealer (n): A device or process that anneals.”

You’re welcome.

CantClimbTom 11 Feb 2022

In reply to ablackett:

Yes, it should be simple to explain as most things are at a vague concept level without getting technical and discussing annealing (sorry wintertree, maybe if it was quantum computing..). If my Bayesian viewpoint offends... tough

You could mention that "stuff" in the universe tends to get less ordered and naturally evens out given time. Increasing entropy is just a fancy way of saying all mixed up and evened out.

The normal distribution curve is the most evened out shape we see when looking at stuff that's happened.

So the two things are the same, the normal distribution shape is what we see when everything is mixed up and evened out... maximum entropy

wintertree 11 Feb 2022

In reply to CantClimbTom:

> it should be simple to explain as most things are at a vague concept level without getting technical and discussing annealing

> the normal distribution shape is what we see when everything is mixed up and evened out.

To my reading you’ve just explained, in non technical terms, the process of annealing any probability distribution and getting the normal distribution out.

So, discussing annealing without actually naming it!

Yours is a great explanation for kids I’d have thought. But then I’ve never taught in a school…

Robert Durran 11 Feb 2022

In reply to mbh:

> Another thought. Why are variables in the natural and physical world so often normally distributed?

But are they? (I don't know the answer).

The central limit theorem means that, if say, a plant grows by an amount each day which follows some distribution, then its height after a reasonable length of time will be approximately normally distributed (this is the sort of thing I talk about when teaching).

However, I suspect that normal distributions occur far less often than you might think from questions in school exam papers; if the heights of elephants are normally distributed then their weights and surface areas cannot be (though they would be related), yet all three would, I am sure, be routinely assumed to be by textbook writers and exam setters.

I am very far from being an expert (I have self taught just enough stats to be able to teach it at school level), but I have pondered these things.

Jamie Wakeham 11 Feb 2022

In reply to ablackett:

...and this is why I refuse to teach anything but the mechanics component of Maths A-level!

Robert Durran 11 Feb 2022

In reply to Jamie Wakeham:

> ...and this is why I refuse to teach anything but the mechanics component of Maths A-level!

It is certainly true that the theory behind school level stats is way, way beyond school level maths. Ever tried to derive a t, chi squared or even the normal distribution? As for where those degrees of freedom come from😱.

wintertree 11 Feb 2022

In reply to ablackett:

> ”Anealer (n): A device or process that anneals.”

As opposed to a person who kneels!

It turns out Rosebery Topping was pretty normal anyhow and it's annealing was pretty undramatic. The annealer definitely needs a bit more finesse - probably lowering the temperature (random displacements) as it progresses, and not doing that is going to bug me all day as I do work-work.

Post edited at 09:44

If the other side of the summit falls off, it'll be even more normal

As boring as watching heat pulses diffuse away

DaveHK 11 Feb 2022

In reply to Robert Durran:

> It is certainly true that the theory behind school level stats is way, way beyond school level maths. Ever tried to derive a t, chi squared or even the normal distribution? As for where those degrees of freedom come from😱.

I teach a bit about statistics for the Advanced Higher Geography course. I think I've got a sort of good lay persons grasp on it in that I can explain how to use the various tests and what the results mean. However, I'm acutely aware that I'm dipping a toe into some very,very deep water and I'm at that dangerous stage of understanding where you know just enough to lead you out of your depth...

I know there are a lot of teachers delivering AH Geography who really don't have a clue about the stats side of it.

yeti 11 Feb 2022

In reply to coinneach:

aye mine too, and no one has explained yet why Japanese currency is involved (Y13)

; )

Jamie Wakeham 11 Feb 2022

In reply to Robert Durran:

It does strike me that the basis for an awful lot of the subject (certainly at A level) is 'trust Karl Pearson'...

CantClimbTom 11 Feb 2022

In reply to ablackett:

At risk of offending people with huge generalisation based on my own ignorance and stereotypes (aren't those the best kind of generalisation?) since I'm neither a teacher nor mathematician... Oh and this most definitely doesn't apply to the OP who came here asking how to explain. This is intended as good natured banter, so stay calm!

My son (Y9) was moaning only the other day how rubbish and difficult maths is as a subject and it's all irrelevant rubbish anyway, etc, etc, moan, moan.

I told him that actually it was a great subject and really useful too. The problem was that maths teachers are almost always people who have maths degrees and people who have maths degrees aren't normal they live on a different plane. They understand it all so deeply, but because they're different, they just can't explain it to normal people. The problem is that maths is good but it's the worst taught subject - and maths teachers are the worst people to teach it.

OK, yes mildly offensive stuff and pretty stereotyped. But in his case this explanation nudged his attitude slightly towards maybe maths isn't so bad. So even if not quite accurate the explanation had utility, in his case anyway

toad 11 Feb 2022

In reply to CantClimbTom:

Once upon a time I had to teach basic stats to social science undergrads. They absolutely refused to engage, the "there's no point in maths" ethos had become so deeply entrenched by school that it was a lost cause

DaveHK 11 Feb 2022

In reply to CantClimbTom:

That's the difference between being good at a subject and good at teaching. Some people are both.

Robert Durran 11 Feb 2022

In reply to DaveHK:

> I teach a bit about statistics for the Advanced Higher Geography course. I think I've got a sort of good lay persons grasp on it in that I can explain how to use the various tests.

> I know there are a lot of teachers delivering AH Geography who really don't have a clue about the stats side of it.

The trouble with stats is that it is used by so many people who don't know very much about it. A friend of mine doing a biology phd got help with her stats from her ex who was a professional statistician. When she submitted her phd she was told that her stats were all wrong and that she had used the wrong tests. So she did them again and they were accepted. It is probably a complete guess who was right. And it does make you wonder how much inappropriate or wrong stats goes on out there.

Doug 11 Feb 2022

In reply to Robert Durran:

sounds familar. I remember having a session with the head of the statistics dept when I started my PhD (plant ecology) & he helped me plan the experimental design. A couple of years later I had another session with the new head of stats who basically said the design was stupid.

Thankfully my examiners where biologists rather than statisticians

Richard J 11 Feb 2022

In reply to Robert Durran and mbh:

>Why are variables in the natural and physical world so often normally distributed?

>> But are they? (I don't know the answer). The central limit theorem means that, if say, a plant grows by an amount each day which follows some distribution, then its height after a reasonable length of time will be approximately normally distributed...However, I suspect that normal distributions occur far less often than you might think from questions in school exam papers;

It is the magic of the central limit theorem that you're seeing here. If you have some quantity that depends on two uncorrelated variables, the probability distribution for that quantity is essentially a convolution of the two probability distributions for the two variables. It turns out that as you convolute more and more distributions, no matter what their starting shape (as long as they're vaguely pointy) the function this process converges on is a Gaussian - this is the central limit theorem. It's worth trying this out (e.g.) with starting functions that are triangles. (I suspect Wintertree's annealing process amounts in effect to a cheap and cheerful numerical convolution algorithm).

So the prevalence of Gaussian/normal distributions is just a function of the fact that lots of things in nature arise as the result of being the outcome of the combined effect of lots of uncorrelated variables, each of which may well be drawn from non-Gaussian distributions.

There is a twist to this that isn't widely enough appreciated, though. The central limit theorem is, as its name suggests, a limit, which applies rigorously when the number of variables/distributions you're combining goes to infinity. If you look at how this limit is approached, what you usually find is that, while it only takes a few convolutions to get the middle of the distribution looking pretty Gaussian, the tails take a lot longer to converge.

In fact, lots of distributions you find in nature look Gaussian near their peaks, but have fat tails - that is, the probability decays to zero much more slowly as you move away from the centre than you'd predict for a Gaussian.

This means that for many random processes, extreme events are often much less rare than you might think. You probably should know this if you run an insurance company or a hedge fund...

profitofdoom 11 Feb 2022

In reply to toad:

> Once upon a time I had to teach basic stats to social science undergrads. They absolutely refused to engage, the "there's no point in maths" ethos had become so deeply entrenched by school that it was a lost cause

EXACTLY the same thing happened with me and humanities students. They all, without exception, said they were hopeless with numbers and with statistics. I wondered then what was going on, where this came from. It was so pervasive. However, I managed to lead them through it

Richard J 11 Feb 2022

In reply to Robert Durran:

I think the trouble here is that the classical foundations of statistics, based on "frequentist" definitions of probability, are a bit of a conceptual mess. Bayesian approaches, which explicitly take a subjective definition of probability - i.e. that it is a measure of one's knowledge of a system, or lack of it, and can be updated as more information comes in - are conceptually clearer and more appealing. But they are technically more tricky to implement for practical problem solving than the classical methods, hence the rather unsatisfactory cook-book approach you end up having to teach at A level.

Jamie Wakeham 11 Feb 2022

In reply to CantClimbTom:

You do have a bit of a point. Some mathematicians are, perhaps, a little disconnected from reality, and knowing your subject and being able to teach it are very different things. That's true in all subjects but maybe it's more pronounced in physics and maths...

For me, as a physicist (who has ended up teaching a lot of maths because that's where the demand is) the key has always been in being able to answer the question 'But why?' for anything I taught. The depth of that explanation is context dependent - I'll answer it differently for a Y9 and a Y13 - but I think a student's confidence in you and what you're telling them is terribly linked to their knowing that they can always ask you to say why this thing happens and get a decent response. In some cases this takes you down some pretty serious rabbit holes (!) but as long as they know you're answering with knowledge and conviction, they'll trust you.

So, for me, that's stats out. Anything more serious than GCSE stats and I can't justify it to myself, so I wouldn't dare try to justify it to a student! I did once try to learn how it all worked, when I was helping my wife with an MSc that had a serious stats component, and found it so alien to the way I like to think, that I ran away and never went back. Probably a character flaw.

Incidentally, I think this is why many students hate GCSE chemistry with a passion. If you ask the question 'Why does the periodic table look like that?' then there's no good answer at that level. You have to get to A level before it can even begin to make sense.

wintertree 11 Feb 2022

In reply to Richard J:

> (I suspect Wintertree's annealing process amounts in effect to a cheap and cheerful numerical convolution algorithm).

My intuition is disagreeing but the rest of my brain is on leave today. I would say the two approaches both remove all extraneous information from the distribution, corresponding to a removal of the higher frequency Fourier components, but that they do this by different methods. (Consider how the a distribution with two pillars and zero elsewhere anneals rapidly to a Gaussian but takes a long time to ring down to something resembling one through repeated convolution with the original)

Both are like blurring operations, but I don't see how a random walk process could equate to a convolutional one. This might be a flaw on my behalf...

> turns out that as you convolute more and more distributions, no matter what their starting shape (as long as they're vaguely pointy) the function this process converges on is a Gaussian - this is the central limit theorem. It's worth trying this out (e.g.) with starting functions that are triangles

Or with Roseberry Topping, which looks pretty gaussian after its first correlation with itself. I normalised the height of each peak (lowering in reality as they slough out) for a nicer plot, below.

Post edited at 14:27

Robert Durran 11 Feb 2022

In reply to Jamie Wakeham:

> You do have a bit of a point. Some mathematicians are, perhaps, a little disconnected from reality, and knowing your subject and being able to teach it are very different things. That's true in all subjects but maybe it's more pronounced in physics and maths...

I think you have to know maths to teach it but knowing it does not mean you can teach it.

> For me, as a physicist (who has ended up teaching a lot of maths because that's where the demand is) the key has always been in being able to answer the question 'But why?' for anything I taught.

> So, for me, that's stats out. Anything more serious than GCSE stats and I can't justify it to myself, so I wouldn't dare try to justify it to a student!

As a matter of interest, how do you answer the question: "But why is there an n-1, not n in the sample standard deviation formula?"?

Robert Durran 11 Feb 2022

In reply to CantClimbTom:

> The problem was that maths teachers are almost always people who have maths degrees and people who have maths degrees aren't normal they live on a different plane. They understand it all so deeply, but because they're different, they just can't explain it to normal people. The problem is that maths is good but it's the worst taught subject - and maths teachers are the worst people to teach it.

No, I think the problem (or one of them) is that we are forced (and over my career increasingly) by exam systems to teach pupils how to pass maths exams, not to do or understand maths or to enjoy it.

The trouble with the way maths is taught is that almost everyone is taught it to their breaking point, where they are just trying to get through one last exam before giving the subject up. So most people's last year or two's experience of maths is a rubbish one, leaving them with a bad impression of the subject, struggling with procedures they don't understand and will probably never use. I firmly believe that in an ideal world everyone would stop learning new maths a year before they drop the subject and instead apply what they do know, develop problem solving skills and generally have the sort of fun that maths should be about.

Jamie Wakeham 11 Feb 2022

In reply to Robert Durran:

You have identified, precisely, the point at which I stop understanding stats. I can (just about) justify why SD is a useful measure of dispersion about the mean (I usually use a picture of several stickmen extending above and below their mean height to do this) but what the hell that n-1 is doing there... <mumbles something something sample mean versus population mean?>

I'm a physicist who does maths because all my clients ask me to. For a long time I taught Physics up to A level but only did GCSE maths (and mechanics only up to A level, because that's just physics). A few years ago I taught myself how to teach pure maths up to AS, and actually rather enjoyed it - I'd forgotten how lovely the graphical derivation of calculus was, for example. Then I looked at AS stats, and ran away bravely.

wintertree 11 Feb 2022

In reply to Robert Durran:

> As a matter of interest, how do you answer the question: "But why is there an n-1, not n in the sample standard deviation formula?"?

Determining a mean from the samples disregards the information from one sample, as it’s 100% determinable from the other samples and the sample mean. If it’s bringing no information to the party, it’s not bringing any variance either and so it can be disregarded and shouldn’t be counted…

(edit: marks out of 10?)

Post edited at 18:44

Robert Durran 11 Feb 2022

In reply to wintertree:

I see you have deleted your post after I had been thinking about it, so I'll answer anyway!

Is what you are saying sort of equivalent to saying that using the residuals from the sample mean tends to make them smaller than they should be because the sample mean is by definition in the "middle" of the sample and so dividing by N-1 rather than N tends to correct this?

What you say seems to be saying that we have lost a degree of freedom (something I've never really got my head around in general!).

I have wondered whether it is the fact that samples can include a member of the population more than once (the names are put back in the hat). The theory assumes this, even if, in practice, sampling is never done with replacement, because it makes the maths much simpler than not replacing. So lots of possible samples will be bunched up by having repeated members of the population in them and so will tend to underestimate the population variance. Dividing by N-1 rather than N will compensate for this. Consider the extreme case of a sample size of 2 from a population of 2; Two of the four possible samples (AA, AB, BA, BB) will have zero variance.

This may be nonsense though!

Edit: I see you have rephrased your post now!

Post edited at 18:50

wintertree 11 Feb 2022

In reply to Robert Durran:

Sorry yes re posted as I wanted to be more concise.

> Is what you are saying sort of equivalent to saying that using the residuals from the sample mean tends to make them smaller than they should be because the sample mean is by definition in the "middle" of the sample

Yes; the true mean is additional information to the samples; the sample mean brings no new information, being computed from them. The samples won’t be as well balanced about the true mean as they (perfectly) are about the sample mean. So using a sample mean under estimates the variance

> and so dividing by N-1 rather than N tends to correct this?

More precise than “tends to”; using a mean computed from the samples effectively removed one samples worth of information (and so variance) from a calculation using the true mean.

That’s my take anyhow.

OP ablackett 11 Feb 2022

In reply to Robert Durran:

> As a matter of interest, how do you answer the question: "But why is there an n-1, not n in the sample standard deviation formula?"?

I know this isn't right, but I tell the students that dividing by n-1 rather than n makes the standard deviation bigger to compensate for the fact that you don't have all the members of the population in the sample.

If your sample is small it makes it quite a bit bigger, if your sample is large then it only makes it a little bit bigger. They seem happy with this and we move on.

mbh 11 Feb 2022

In reply to ablackett:

From a practical point of view, if the difference between n and n-1 matters, then you probably don't have a big enough sample.

From another point of view, the difference of each of the observations from the sample mean will on average be less than its difference from the population mean. Dividing by n-1 rather than n compensates for that.

On reading more thoroughly, I see that my second point more or less repeats what you and Robert have said and is that wintertree is getting at in a more precise way.

Post edited at 19:16

Jamie Wakeham 11 Feb 2022

In reply to ablackett:

That seems reasonable - and I think that's the explanation I half remembered.

Wintertree's explanation of why there are n-1 degrees of freedom makes perfect sense to me, but I have no feeling at all for why I'd want to divide by the number of degrees of freedom. If there were ten people in my sample then surely I'd want to divide by ten..?

I am so much happier working out where the point-like projectile lands when the light inextensible string snaps!

wintertree 11 Feb 2022

In reply to Jamie Wakeham:

I find the DOFs a bit of a diversion when thinking about it; you’re doing an average and so you divide by the number of data points in that average. The loss of a DOF by using a sample mean means that you in effect only have variance contributions from one fewer samples (as the last one is fully defined by the mean and the other samples) and so that’s the number you divide by.

In reply to mbh:

> From a practical point of view, if the difference between n and n-1 matters, then you probably don't have a big enough sample.

Good answer! More and better data is always my preferred solution…

Robert Durran 11 Feb 2022

In reply to wintertree:

> I find the DOFs a bit of a diversion when thinking about it; you’re doing an average and so you divide by the number of data points in that average. The loss of a DOF by using a sample mean means that you in effect only have variance contributions from one fewer samples (as the last one is fully defined by the mean and the other samples) and so that’s the number you divide by.

While it makes qualitative sense this whole idea of replacing N by N-x when you lose x degrees of freedom seems like witchcraft rather than mathematics to me!

Robert Durran 11 Feb 2022

In reply to ablackett:

> I know this isn't right....

😃

> ......but I tell the students that dividing by n-1 rather than n makes the standard deviation bigger to compensate for the fact that you don't have all the members of the population in the sample.

Yes, but why does it compensate?

> They seem happy with this.....

Well they shouldn't be!

I just carefully explain that it tends to give a better estimate of the population standard deviation (I discuss the difference), but come clean and say that the proof is too technical and to ask me again in a couple of years if they want to.

RobAJones 11 Feb 2022

In reply to CantClimbTom:

I appreciated the caveats made before these points

> The problem was that maths teachers are almost always people who have maths degrees

Not in the schools I've worked in and it's becoming less likely, is that a good thing?

>and people who have maths degrees aren't normal they live on a different plane.

You might have a point there 😊

>They understand it all so deeply, but because they're different, they just can't explain it to normal people.

They exist and would certainly be happiest teaching motivated A level students. I'd argue however that they only have a superficial understanding, a deep understanding is required to make progress with a 15 year old who is essentially caring for his drug addict mother and can't yet do his times tables

>The problem is that maths is good but it's the worst taught subject

Most popular A level and one of the few without a gender imbalance?

> and maths teachers are the worst people to teach it.

From having to manage departments that required non specialists to teach maths that was certainly a minority view based on parental concerns I had to deal with.

CantClimbTom 11 Feb 2022

In reply to RobAJones:

There was some seriousness in my comments, but at least the same tongue in cheek if not more it seemed to "work" as an explanation to my son, so that was the best outcome for me.

Although I still contend that maths has more of a problem about this than other subjects for whatever reason

RobAJones 11 Feb 2022

In reply to CantClimbTom:

> Although I still contend that maths has more of a problem about this than other subjects for whatever reason

Nationally or just in a couple of schools?

The cynic in me would suggest that could be because we removed GCSE coursework (in 2005?) so maths teachers couldn't do the work for those who didn't want to.

Edit. Being serious the poor retention of maths teachers compared to some other subjects probably has an effect. Although less than 50% of teachers teaching maths have a related degree, this is significantly less (30%) in disadvantaged areas

Post edited at 21:45

Jamie Wakeham 11 Feb 2022

In reply to Robert Durran:

> ...but come clean and say that the proof is too technical and to ask me again in a couple of years if they want to.

That's exactly how I deal with questions about the periodic table, too

Robert Durran 12 Feb 2022

In reply to RobAJones:

> Although less than 50% of teachers teaching maths have a related degree, this is significantly less (30%) in disadvantaged areas.

Which is a shocking statistic and probably much better explains why so many kids are put off maths than saying it is because they are taught by people with maths degrees.

Robert Durran 12 Feb 2022

In reply to mbh:

> From another point of view, the difference of each of the observations from the sample mean will on average be less than its difference from the population mean. Dividing by n-1 rather than n compensates for that.

> On reading more thoroughly, I see that my second point more or less repeats what you and Robert have said and is that wintertree is getting at in a more precise way.

Having laid awake part if the night thinking about it, I am no longer at all convinced by this. I need to be convinced that the smaller the sample the more it will tend to underestimate the variance, but I am not. I may, of course, just be being a bit dim.

RobAJones 12 Feb 2022

In reply to Robert Durran:

>much better explains why so many kids are put off maths

I think it may because they think/are finding it hard. That view is certainly supported by the drop in A level numbers following "harder' exams (2001!) I'm general I think kids say it's boring to deflect from them finding it difficult.

Regarding your n n-1 issue I seem to remember AQA producing some information in the days when there was a coursework element to S1 (2005ish?) Although my, perhaps faulty, memory was the using n-1 in their coursework was always OK as far as the exam board was concerned. I do however have some recollection of discussing the reasons behind it with some of the more inquiring students.

jk25002 12 Feb 2022

In reply to wintertree:

> Both are like blurring operations, but I don't see how a random walk process could equate to a convolutional one. This might be a flaw on my behalf...

Interesting point. Perhaps this:

The pdf of the sum of two random variables is the convolution of the pdfs of each variable.

A random walk process involves taking a start point, and repeatedly adding a random variable (a step).

The pdf at the end of the random walk is therefore the repeated convolution of the pdf of the steps.

If the steps have a uniform distribution, the pdf of the random walk tends to a Gaussian through increasing orders of polynomial.

Robert Durran 12 Feb 2022

In reply to jk25002:

> Interesting point. Perhaps this:

> The pdf of the sum of two random variables is the convolution of the pdfs of each variable.

> A random walk process involves taking a start point, and repeatedly adding a random variable (a step).

> The pdf at the end of the random walk is therefore the repeated convolution of the pdf of the steps.

> If the steps have a uniform distribution, the pdf of the random walk tends to a Gaussian through increasing orders of polynomial.

Yes, for example Bin(n, 0.5) is just the sum of the r.v. taking the values 0 and 1 each with probability 0.5 (which is a simple random walk - either take a ostep or don't) added to itself n times. And it converges on N(0.5n, 0.25n).

Post edited at 10:34

Robert Durran 12 Feb 2022

Well, this morning's chat in Chulilla was about the use of Bayesian statistics and beta to update the chances of the flash. Thinking about 0.1 unless someone tells us about a hidden pocket or something.

OP ablackett 12 Feb 2022

In reply to Robert Durran:

> Having laid awake part if the night thinking about it, I am no longer at all convinced by this. I need to be convinced that the smaller the sample the more it will tend to underestimate the variance, but I am not. I may, of course, just be being a bit dim.

Robert, you are a far far better mathematician than me, so I would assume it is me being a bit dim here, but here goes.

If you think of variance as how spread out the readings are, to me, it is clear that if one has a smaller sample they will be less spread out compared to having a larger sample. Consider dropping 10 grains of rice on the floor, you won’t have to hunt as far to find them as if you dropped all the rice.

If I’ve entirely missed the point you are making, I apologise.

mbh 12 Feb 2022

In reply to Robert Durran:

>> From another point of view, the difference of each of the observations from the sample mean will on average be less than its difference from the population mean. Dividing by n-1 rather than n compensates for that.

It is true. I have attached images for a series of 1000 samples of various sizes drawn from populations that were distributed either uniformly on the interval [0,1] or as a standard normal N(0,1). In each case I show on the left the distribution of sample variances if calculated using sample size N as the denominator, and on the right using N-1 as the denominator. The population variances (1 for the normally distributed population, 1/12 for the uniformly distributed one) are shown as a blue line.

You can see that the sample variance underestimates the population one for small N if we use N as the denominator - ie it is a biased estimator for small sample sizes, whereas if we use N-1 as the denominator, it remains an unbiased estimator even for small sample sizes.

Post edited at 14:06

mbh 12 Feb 2022

In reply to ablackett:

Samples drawn from a population will have a variance that wobbles around that of the population, however small they are. So you are just as likely to find a fraction of your rice grains on the far side of the kitchen regardless of whether you spill a few or most of the packet.

Robert Durran 12 Feb 2022

In reply to mbh:

> >> From another point of view, the difference of each of the observations from the sample mean will on average be less than its difference from the population mean.

But isn't what is needed is for the difference of each observation from the sample mean to be less on average than the average distance of all possible observations from the population mean?

Dividing by n-1 rather than n compensates for that.

> It is true.

Yes, I know it is true. I can prove it rigorously (that the mean of the variances of all possible samples of size n using n-1 is equal to the population variance, ie that it is an unbiased estimator of the the population variance). But it would probably take me an hour or so to reconstruct the proof now and it doesn't provide any intuitive insight into why using n tends to underestimate the variance (which is what I would like), and I'm now not convinced anyone on here has given any either!

I'd be interested to know what anyone thinks about my argument about repeated observed values which I gave at 18.49 yesterday.

Robert Durran 12 Feb 2022

In reply to mbh:

> Samples drawn from a population will have a variance that wobbles around that of the population, however small they are. So you are just as likely to find a fraction of your rice grains on the far side of the kitchen regardless of whether you spill a few or most of the packet.

Obviously with more grains of rice you are more likely to get one the far side of the kitchen, but you will also get a big pile in the middle, so I sm.not convinced by this argument!

Robert Durran 12 Feb 2022

In reply to wintertree:

> Determining a mean from the samples disregards the information from one sample, as it’s 100% determinable from the other samples and the sample mean. If it’s bringing no information to the party, it’s not bringing any variance either and so it can be disregarded and shouldn’t be counted…

How can it be disregarded when it is being used in the calculation of the sample mean? I may be missing something....

mbh 12 Feb 2022

In reply to Robert Durran:

> Obviously with more grains of rice you are more likely to get one on the far side of the kitchen, but you will also get a big pile in the middle, so I am not convinced by this argument!

What is the probability that I will get one grain of rice going right over to the fridge (and probably under it) if I spill a) a tea spoon or b) the whole packet on the floor? That is not quite how I posed it, which was about the fraction of the spill that would make it over there.

Clearly I would rather spill a tea spoon full rather than a packet as it would be less work to clear up, but the proportion of the mess ending up wherever would on average be the same.

mbh 12 Feb 2022

In reply to Robert Durran:

>But it would probably take me an hour or so to reconstruct the proof now and it doesn't provide any intuitive insight into why using n tends to underestimate the variance (which is what I would like), and I'm now not convinced anyone on here has given any either!

What do you mean by intuitive? Several of us have given an argument around the fact of observations in a sample being on average closer to the sample mean than they are to the population mean and we can all prove the n-1 thing algebraically given time. What do you want?

> I have wondered whether it is the fact that samples can include a member of the population more than once (the names are put back in the hat). The theory assumes this, even if, in practice, sampling is never done with replacement, because it makes the maths much simpler than not replacing. So lots of possible samples will be bunched up by having repeated members of the population in them and so will tend to underestimate the population variance. Dividing by N-1 rather than N will compensate for this. Consider the extreme case of a sample size of 2 from a population of 2; Two of the four possible samples (AA, AB, BA, BB) will have zero variance.

In practice, sampling may or may not be done with replacement. It depends what you are looking at. Birds flit about and are hard to tell apart if the same species so may well be sampled more than once. Mussels, less so.

tom_in_edinburgh 12 Feb 2022

In reply to ablackett:

Proakis, Digital Communications has a section where he defines entropy in the context of information theory. Probably it has all the maths you need, but don't ask me to explain it. Not sure it will help explain it to sixth year students. Maybe Post Docs!

Robert Durran 12 Feb 2022

In reply to mbh:

> What do you mean by intuitive? Several of us have given an argument around the fact of observations in a sample being on average closer to the sample mean than they are to the population mean and we can all prove the n-1 thing algebraically given time.

As I said, that is obvious but irrelevant. What is needed is that they are closer on average to the sample mean than the whole population is on average to the population mean, but I have not seen a convincing intuitive argument that this is the case. Maybe intuitive is the wrong word, but what I mean is a non-technical and non-algebraic explanation which I could explain to a reasonably bright and interested 15 year old.

> In practice, sampling may or may not be done with replacement. It depends what you are looking at. Birds flit about and are hard to tell apart if the same species so may well be sampled more than once. Mussels, less so.

The n-1 correction assumes replacement. There is a correction for non-replacement which I found harder to prove. It involves the population size N which in practice you often do not know, but for a large population it is a negligible correction anyway.

Incidentally, did your simulations allow replacement? If not it presumably means my argument that bunching up due to replacement allowing repeated observations explains the lower mean sample variance is not the answer (though it seems pretty convincing for the trivial extreme example I gave for a sample of size 2 from a population of size 2, and for other simple cases I have done by hand).

mbh 12 Feb 2022

In reply to Robert Durran:

>The n-1 correction assumes replacement.

Does it? Why?

The simulations I put up here did not allow for replacement, but when I make them do so, it does not alter that the estimator of the variance calculated using a denominator of 1/N underestimates population variance for small sample size N whereas that which uses 1/(N-1) does not.

wintertree 12 Feb 2022

In reply to Robert Durran:

> What is needed is that they are closer on average to the sample mean than the whole population is on average to the population mean, but I have not seen a convincing intuitive argument that this is the case.

I don't think this is the problem ,in particular the "than" part. Consider instead:

Consider the sum of the square displacement (ΣSD) of the points in a sample from a location Q; do you accept that this is minimal when Q is the sample mean?
If you accept (1), do you accept that ΣSD of all all points in the population from Q is minimal when Q is the population mean?
Do you accept that the sample mean gets closer to the population mean with more samples (on average)?
If you accept (2) and (3), then it should be clear that ΣSD of sample points from the population mean must be larger than from the sample mean, because the sample mean is the turning point of the ΣSD vs Q relationship and represents the position giving the lowest ΣSD. (3) Puts the population mean away from the sample mean, and so a measure of variance using the true population mean would be higher.
- So, the measured ΣSD is biased low because using the sample mean in the calculation instead of the population mean removes some true variance from the sum going in to the ΣSD calculation - because the ΣSD will be lower for the sample mean than for the population mean.
The larger N is, the closer the sample mean is to the population mean (on average) because you're better sampled and you're tending to the formal definition of the sample mean. So, the larger N is, the less the measured variance is biased low.

Now the key point is that the quantity of variance removed from the sum is equivalent to the variance contributed by one data point, hence dividing by N-1 to compensate.

If (1) or (2) aren't coming across intuitively I'd suggest working out the algebraic expansion of the square displacement from a point Q on paper, taking the derivative and setting it to 0. This makes it clear that the minimum ΣSD for a sample is form the sample mean, and the minimum SD from a population is from the population mean. Using the sample mean as a stand in for the population mean therefore gives a lower ΣSD then when using the population mean (it's the minimal value to measure ΣSD from, and is not generally coincident with the population mean).

Proving that the geometric mean of a set of numbers is the position from which their ΣSD is minimal is a simple proof - expand the squares, derive wrt. Q and set to zero, solve for Q.

Post edited at 19:12

wintertree 12 Feb 2022

In reply to jk25002:

> Interesting point. Perhaps this:

> The pdf of the sum of two random variables is the convolution of the pdfs of each variable.

> A random walk process involves taking a start point, and repeatedly adding a random variable (a step).

> The pdf at the end of the random walk is therefore the repeated convolution of the pdf of the steps.

> ??

> If the steps have a uniform distribution, the pdf of the random walk tends to a Gaussian through increasing orders of polynomial.

You've just about convinced me. The annealer also implants a force landscape to preserve the mean and the variance which clearly isn't convolutional, but I'm happy to set that aside.

I suspect the distribution of the random walk's steps doesn't actually matter, given the central limit theorem.

Robert Durran 12 Feb 2022

In reply to mbh:

> >The n-1 correction assumes replacement.

> Does it? Why?

To make the maths easier I presume! Once you don't allow replacement, two observations are not independent so you lose even simple convenient things like Var(X+Y)=Var(X)+Var(Y) for the sum of two observations and things get rapidly more complicated.

> The simulations I put up here did not allow for replacement, but when I make them do so, it does not alter that the estimator of the variance calculated using a denominator of 1/N underestimates population variance for small sample size N whereas that which uses 1/(N-1) does not.

Ok, so it looks like replacement does not cause the underestimation.

Robert Durran 12 Feb 2022

In reply to wintertree:

> > What is needed is that they are closer on average to the sample mean than the whole population is on average to the population mean, but I have not seen a convincing intuitive argument that this is the case.

> I don't think this is the problem ,in particular the "than" part.

Isn't the "than" crucial? We are trying to argue that the sample variance is on average less "than" the population variance if you use n.

> Consider the sum of the square displacement (ΣSD) of the points in a sample from a location Q; do you accept that this is minimal when Q is the sample mean?

Yes, easily proved.

> If you accept (1), do you accept that ΣSD of all all points in the population from Q is minimal when Q is the population mean?

Yes

> Do you accept that the sample mean gets closer to the population mean with more samples (on average)?

Yes, a standard result, easily proved.

> If you accept (2) and (3), then it should be clear that ΣSD of sample points from the population mean must be larger than from the sample mean, because the sample mean is the turning point of the ΣSD vs Q relationship and represents the position giving the lowest ΣSD. (3) Puts the population mean away from the sample mean, and so a measure of variance using the true population mean would be higher.

Yes.

> So, the measured ΣSD is biased low because using the sample mean in the calculation instead of the population mean removes some true variance from the sum going in to the ΣSD calculation - because the ΣSD will be lower for the sample mean than for the population mean.

I'm afraid you have lost me here. However I have, with a bit of algebra arrived at the same conclusion from (1) snd (2) but I can't see it without trusting the algebra.

> Now the key point is that the quantity of variance removed from the sum is equivalent to the variance contributed by one data point, hence dividing by N-1 to compensate.

Sorry, can't see that leap at all!

wintertree 12 Feb 2022

In reply to Robert Durran:

>> What is needed is that they are closer on average to the sample mean than the whole population is on average to the population mean, but I have not seen a convincing intuitive argument that this is the case.

> Isn't the "than" crucial? We are trying to argue that the sample variance is on average less "than" the population variance if you use n.

The intuitive argument seems trivial and is that the samples don't cover the full range of the population; the more samples you have, the more of the range you cover. I muddled my reading and thought you said closeness of the samples to the two different means; sorry.

> > So, the measured ΣSD is biased low because using the sample mean in the calculation instead of the population mean removes some true variance from the sum going in to the ΣSD calculation - because the ΣSD will be lower for the sample mean than for the population mean.

> I'm afraid you have lost me here. However I have, with a bit of algebra arrived at the same conclusion from (1) snd (2) but I can't see it without trusting the algebra.

It should make clear sense without the algebra.

If you measure the ΣSD between the sample points and the sample mean, that is the lowest possible ΣSD value they can have from any point, which you accept in accepting (1).
The sample mean is not the same as the population mean, it contains a random sampling error.
If we measure the ΣSD between the sample points and the population mean, it must therefore be larger than that measured from the sample mean, because the sample mean provably gives the lowest value, and the population mean is at a different point (no sampling error unlike in the sample mean) and so it must give a higher value of the ΣSD, because it can't give a lower one, because the sample mean provably gives the lowest.
- This can be used for an estimate of σ that contains the sampling error of the limited number of points but not the effect of that sampling error on the position of the mean; so we know it's a better estimate of σ than one using the sample mean, and we know it's larger than one using the sample mean, so we know that using the sample mean under-estimates the measure.

If you calculated σ using the population mean you would get a random sampling error in σ. If you repeated this process many times and averaged σ over the repeats, the random sampling errors would average out and your value would tend to the population value.

Now, in the real world we often don't have the population mean, we have the sample mean, and so we use that instead. But we know that the sample mean is some distance from the population mean (sampling error), and we know that the sample mean delivers the lowest possible measure of ΣSD and so σ (my point 1), and we know that the sample mean is wrong due to sampling error, so we know the ΣSD and σ measured this way are too low, because the only possible way they can be wrong is by by being to low (my point 1). So, if we repeat the whole process many times, the new error introduced by using the sample mean will not average out of σ over the repeats, because every repeat has a low bias introduced by using the sample mean instead of the population mean. So, our final value of σ will be too low.

Does that explain it clearly? I'm struggling to put it in clearer words.

> Now the key point is that the quantity of variance removed from the sum is equivalent to the variance contributed by one data point, hence dividing by N-1 to compensate.

That's the DOFs way of looking at it. To calculate σ from N samples, we need N+1 pieces of information - the N x samples and 1 x population mean. Without the population mean, we estimate it from the N samples. We're a piece of information short; that's got to be balanced. With the sample mean determined entirely by the samples, 1/Nth of the displacement from the mean of each sample is effectively absorbed into defining sample mean. In effect, the sample mean is allowed to wander around away from the population mean in a way that minimises the ΣSD and so σ vs using the population mean. The sample mean will always land in the place that delivers the lowest possible ΣSD, it's constrained by the datapoint and not the population mean. The more samples we have, the less the sample mean can wander away from the population mean.

One way to see this...

What's the square displacement of one sample from the population mean? Excepting the rare chance you happen to draw the mean, it's always a positive value.
Now, what's the square displacement of one sample from the sample mean? It's always zero, because the displacement of that sample has been entirely absorbed in to the calculation of the mean. If we had two samples, half the displacement from each would go in to the mean, leaving us with one sample's worth; and so on.

So, using the population mean, making N=1 sampling runs, I could make very poor estimates of σ, but given enough repeats those estimates would tend to the correct population value. But using the sample mean, I would measure σ=0 because all the variance is absorbed by the sample mean moving about to minimise the ΣSD.

Another way is to consider that this comes down to what the difference is between the sample mean and the population mean, and how this varies with N. Obviously as N gets larger, the magnitude of the sampling error in the sample mean gets smaller.

Post edited at 21:05

DaveHK 12 Feb 2022

In reply to ablackett:

I'm going to read all of this thread when I'm not drunk. I expect I'll either learn a lot or be utterly bamboozled. I've already learned a lot from the bits I have read.

Post edited at 20:53

John Stainforth 12 Feb 2022

In reply to ablackett:

If I was teaching normal distributions, I would keep it very simple and start with physical explanations, e.g., Brownian motion and diffusion. I would show how Brownian Motion can lead to Gaussian distributions, emphasing that the maths follows the physics and not the other way round. Only then would I introduce entropy, if at all. Entropy is a wonderful concept in closed, isolated systems, but not so useful elsewhere, i.e., in open systems. In my view, entropy has been put on a god-like pedestal, which leads to many misconceptions. It doesn't always increase, even in a closed system and when a system is open it can be doing anything (and can be mighty hard to compute). We can't really say that the entropy of the (observable) universe is always increasing because we don't know the boundary conditions. The initial condition (the Big Bang) is even more problematical: the popular idea is that at time 0, a great dollop of mass/energy is instantaneously dumped down in one place (at a "singularity"), and this mass (according to the popular viewpoint) must have low entropy and then increase thereafter. I think that is very implausible... sorry to go off on a big tangent!

Robert Durran 12 Feb 2022

In reply to wintertree:

> Does that explain it clearly? I'm struggling to put it in clearer words.

Thanks. I think your clarification is equivalent to the rigorous algebra I did earlier.

I doubt it could be put much clearer and given that it is dependent on the slightly technical least squares result (for a star), I am beginning to think there is probably not an intuitive way of seeing that "a sample is on average less spread out than the whole population" without doing maths beyond most 15 or 16 year olds.

Robert Durran 12 Feb 2022

In reply to John Stainforth:

> If I was teaching normal distributions, I would keep it very simple and start with physical explanations, e.g., Brownian motion and diffusion. I would show how Brownian Motion can lead to Gaussian distributions, emphasing that the maths follows the physics and not the other way round.

What do you actually mean by "the maths follows the physics"?

John Stainforth 12 Feb 2022

In reply to Robert Durran:

I mean that a Gaussian distribution is just a description of a probability distribution, and in a similar way entropy is just a measure of thermodynamic probability; neither entropy nor the Gaussian distribution are fundamental.

Fellover 12 Feb 2022

In reply to wintertree:

This is a really good explanation (the two posts with the various bullet points about N-1). Thanks.

Gordon Stainforth 13 Feb 2022

In reply to John Stainforth:

John, as a non-scientist I'm (like Robert Durran) a bit baffled by what you mean by "the maths follows the physics". I think you must mean, "In this case the particular maths follows the physics". My problem is that, as a layman, I can't see how maths ever follows anything. All human beings did was in a sense discover it, the properties of numbers. This was nothing we in any way "invented". It's just out there, and has been since eternity, so utterly primal. Even before "God" said "Let there be light" there was mathematics, surely, even if no living being was yet doing it?

Post edited at 00:11

Robert Durran 13 Feb 2022

In reply to Gordon Stainforth:

So are you of the school of thought that mathematics exists outside of the universe in the sense that it is outside of reality itself? Might you even go further to say that all there is are mathematical objects, with sufficiently complex ones sort of spontaneously breathing life into themselves as universes (as Max Stegmark postulates)? Or does that require "God" to do the breathing bit?

Robert Durran 13 Feb 2022

In reply to John Stainforth:

> I mean that a Gaussian distribution is just a description of a probability distribution, and in a similar way entropy is just a measure of thermodynamic probability; neither entropy nor the Gaussian distribution are fundamental.

Is it not more a case of, given the fundamental laws of physics, we can predict that certain measurements will follow certain probability distributions. There are any number of probabilty distributions out there which will not occur in nature, but we tend to study ones which do occur, such as the normal one (because of the central limit theorem, or even its maximal entropy!).

Post edited at 01:44

Robert Durran 13 Feb 2022

In reply to John Stainforth:

Another interesting question is whether physics follows reality (ie is just a description of what is out there) or does reality follow physics (ie the laws of physics exist outside of reality and give rise to it).

Post edited at 01:49

Richard J 13 Feb 2022

In reply to John Stainforth:

> ... neither entropy nor the Gaussian distribution are fundamental.

I'm not sure what you mean by fundamental, but I'd certainly want to argue that (thermodynamic, Boltzmann) entropy is a centrally important concept in trying to understand the world. It's at the heart of any understanding of the arrow of time, why there's a difference between the past and the future, which is surely where there is the biggest gulf between our own lived experience as beings moving forward in time and the timeless (i.e. time reversible) equations of physics.

But to return to the original question about Gaussians, I think that's about information (Shannon) entropy rather than thermodynamic entropy - it rests on the idea that the Gaussian is, in the rather picturesque jargon, the "maximally non-committal' distribution that has a mean and a variance, i.e the distribution that contains no more information than those two quantities.

There are those that claim that the connection between the two types of entropy reflects a central role for information and knowledge in the laws of physics. Interpretations like quantum Bayesianism point us towards wondering whether the question might not be, as Robert puts it "does physics follow reality, or or does reality follow physics", but rather does physics tell us, not about the nature of reality itself, but only what it is possible to know about reality?

Robert Durran 13 Feb 2022

In reply to Robert Durran:

> Thanks. I think your clarification is equivalent to the rigorous algebra I did earlier.

> I doubt it could be put much clearer and given that it is dependent on the slightly technical least squares result (for a star), I am beginning to think there is probably not an intuitive way of seeing that "a sample is on average less spread out than the whole population" without doing maths beyond most 15 or 16 year olds.

So, following last night's insomniac ponderings and this morning's discussion on the drive to the airport from Chulilla (sandbagging polished crap - don't bother*):

If there is an intuitive way of seeing that "a sample is on average less spread out than the whole population", it ought to apply to other measure of spread than variance/standard deviation. The most intuitive measure of spread is, I think, mean deviation, and we only use standard deviation (which, after all, exaggerates contributions from data points with larger deviations via the squaring) because squaring is algebraically much simpler to manipulate than the instruction "ignore negative signs"**. However, the least squares result crucial to wintertree's argument does not have an equivalent with deviations: the sum of the deviations from a number is not necessarily minimised by making that number the mean**" (it is easy to come up with a data set which shows this). By playing with some toy data sets which we could cope with mentally while driving, we do, however, suspect strongly that the sample mean deviation will still on average underestimate the population mean deviation. Because of the intractability of the algebra of "ignoring negative signs" I suspect that a proof would be somewhat challenging (I havn't tried yet!). So, what I now suspect is that there ought, in fact, be a more general, less technical (intuitive, if you like) way of seeing that "a sample is on average less spread out than the whole population".

*If "disliking" this post please be courteous enough to tell me whether you think my views on Chulilla are rubbish, or whether you think my maths is rubbish, or whether I am evil to be about to get on a plane and trash the planet

** I actually always teach mean deviation briefly because it is so intuitive and simple, but then explain that people actually tend to use standard deviation because, although it is more complicated for the examples the pupils will be doing by hand, the algebra in more advanced work is easier.

Post edited at 10:25

John Stainforth 13 Feb 2022

In reply to Robert Durran:

> Another interesting question is whether physics follows reality (ie is just a description of what is out there) or does reality follow physics (ie the laws of physics exist outside of reality and give rise to it).

I thought the consensus (which could be wrong) was that reality had to follow the physics, because reality could only work with a particular set of laws, and even more mysteriously, a particular set of universal constants. (But are the constants really constant? - yet another question). If the Big Bang is the start of everything, it means the laws have to come into existence instantaneously at that moment. Much simpler if the Big Bang is not the start of everything and that our Universe is one of a multitude in time and space, i.e., "it's turtles all the way down".

John Stainforth 13 Feb 2022

In reply to Robert Durran:

> So, following last night's insomniac ponderings and this morning's discussion on the drive to the airport from Chulilla (sandbagging polished crap - don't bother*):

Aren't rock climbs, in general, gradually becoming "sandbagging polished crap" - an inevitable result of "climbing entropy"!

I may be missing the point of your last para, which is quite heavy, but isn't this just another way of saying that a sample is always a somewhat inadequate representation of an entire population?

Jamie Wakeham 13 Feb 2022

In reply to John Stainforth:

For what it's worth, my take is this. The universe follows laws of physics, and my feeling is that they are immutable and always present - as you say, they are part of reality itself.

I've always been partial to the idea that the universe is cyclic, and each big bang is the result of a previous big crunch. And I also quite like the conception that perhaps, although the laws are fixed, the values of the physical constants might not be, so in the next universe the value of G might be a bit bigger, or e a bit smaller. It's surprising how little you'd need to tinker with a few constants to produce a universe in which life wouldn't work - you can describe a universe in which all stars burn out in only a few million years (so not long enough for evolution to take place), or won't form at all. So the anthropic principle has to come into play, and we can only explain the 'lucky' coincidence that our universe has these constant so well tuned by realising that the only situation in which we might observe the universe would be if those constants were already so.

Of course, there is another meaning to the word physics, and that's the ragtag collection of laws and theorems that a bunch of intelligent apes have strung together to try to explain what the hell is going on around them. The universe is under no compunction to follow these, and we frequently have to update them as the universe demonstrates that it does not!

OP ablackett 13 Feb 2022

In reply to ablackett:

Well, my esoteric stats question seems to have sparked a discussion about why life exists and the nature of reality.

Does that make me a good teacher? Or have you all gone off task?

Jamie Wakeham 13 Feb 2022

In reply to ablackett:

Gotta love UKC thread drift.

Perhaps we could develop some sort of metric that measured the divergence of each reply from the thread topic, and then summed their squares. I still have no idea if I should then divide by n or n-1, though.

John Stainforth 13 Feb 2022

In reply to Jamie Wakeham:

Isn't there already a law - Hubble's Law - for thread shift?

Robert Durran 13 Feb 2022

In reply to ablackett:

> Well, my esoteric stats question seems to have sparked a discussion about why life exists and the nature of reality.

> Does that make me a good teacher? Or have you all gone off task?

If your lessons stimulate as much interesting discussion as you have here, then I think you must be a brilliant teacher!

seankenny 13 Feb 2022

In reply to Robert Durran:

> I am beginning to think there is probably not an intuitive way of seeing that "a sample is on average less spread out than the whole population" without doing maths beyond most 15 or 16 year olds.

Here's my stab at this - might well be wrong/too complex but see what you think.

Imagine we have our population which is the numbers 1 - 10. Say a set of ten scores on a test marked out of ten. If we were to lop off the highest and lowest scores, then the variance of this subset is clearly going to be less than the original population, as it spans fewer scores. There is less "spread out-ness". Lop off the next highest and lowest scores, the variance is less again. And so on till we get to just 5 and 6 which has the lowest variance possible.

Smarter kids may notice what happens if you remove two numbers from the middle, say 5 and 6. The variance goes up. But... it goes up by less than the variance went down in the first example. (Quite obvious if you work out the variance of each sub-sample, obviously treating each as a population because you want to show where the bias comes from.) This matters because you're looking at the samples on average, so altho some may be a bit bigger most are smaller. This effectively brings the average down when considering all the samples. (Remember that it doesn't bring the average down by much - the bias is quite small.)

Of course there is plenty of handwaving going on here but even just the first bit on its own gives a basic intuition, I hope!

Edit: is Chulilla really that polished? If you want a little more excitement and less polish I can recommend nearby Montanejos… well old skool.

Post edited at 19:13

seankenny 13 Feb 2022

In reply to Gordon Stainforth:

> My problem is that, as a layman, I can't see how maths ever follows anything. All human beings did was in a sense discover it, the properties of numbers. This was nothing we in any way "invented". It's just out there, and has been since eternity, so utterly primal. Even before "God" said "Let there be light" there was mathematics, surely, even if no living being was yet doing it?

This is what people generally thought, until non-Euclidean geometry was invented. Suddenly it turns out that "eternal truths" which we effectively found "out there" could be contradicted by equally valid "truths" which we had just invented. So there is some room for considerable debate* about what mathematical objects really are.

* Enough to easily fill a 3rd year undergrad course in the philosophy of mathematics with complex ideas that I have mostly forgotten in the quarter century since I took it.

mbh 13 Feb 2022

In reply to ablackett:

I have been mulling this idea that, for a given mean (why does that matter?) and variance, a Gaussian random variable is the one that maximises the Shannon entropy.

This surely can be understood through the fact that a binomial distribution approximates to a Gaussian distribution for large n. Thus, a Gaussian random variable is the result of many independent binary processes, each with an equal chance of success.

How could you possibly impose less order on a system than that?

Post edited at 21:00

OP ablackett 13 Feb 2022

In reply to John Stainforth:

When are we going to discover who the dark energy is?

OP ablackett 13 Feb 2022

In reply to seankenny:

I think that’s it.

If you ‘miss out’ the extreme values it has a more significant effect (downward) on the SD than if you miss out the middle values (upward). So on average the effect of sampling is to underestimate the SD, so the n-1 is needed.

Brilliant.

Now, if this is right, my intuition is telling me that for Robert Duran’s measure of Average Deviation which he described up thread, the n-1 wouldn’t be needed as it doesn’t have the squaring problem which gives additional weighting to extreme values.

I’ll find some time to check this tomorrow unless anyone knows the answer now?

seankenny 13 Feb 2022

In reply to ablackett:

> I think that’s it.

> If you ‘miss out’ the extreme values it has a more significant effect (downward) on the SD than if you miss out the middle values (upward). So on average the effect of sampling is to underestimate the SD, so the n-1 is needed.

It’s not entirely that simple as you can get very large variances with just the extreme values. But that’s a rare sample so effects are muted.

> Now, if this is right, my intuition is telling me that for Robert Duran’s measure of Average Deviation which he described up thread, the n-1 wouldn’t be needed as it doesn’t have the squaring problem which gives additional weighting to extreme values.

> I’ll find some time to check this tomorrow unless anyone knows the answer now?

If you do the algebra for this (I just followed the method in an introductory econometrics text book) then it’s easy to see that the squaring gives rise to a negative covariance term, between each random variable in the sample and the sample mean. Clearly this covariance itself is always positive but gets smaller the bigger the sample gets.

I find with this sort of thing it’s easier to work it through and then try to get the intuition from that.

Robert Durran 13 Feb 2022

In reply to seankenny:

> Imagine we have our population which is the numbers 1 - 10. Say a set of ten scores on a test marked out of ten. If we were to lop off the highest and lowest scores, then the variance of this subset is clearly going to be less than the original population, as it spans fewer scores. There is less "spread out-ness". Lop off the next highest and lowest scores, the variance is less again. And so on till we get to just 5 and 6 which has the lowest variance possible.

> Smarter kids may notice what happens if you remove two numbers from the middle, say 5 and 6. The variance goes up. But... it goes up by less than the variance went down in the first example. (Quite obvious if you work out the variance of each sub-sample, obviously treating each as a population because you want to show where the bias comes from.) This matters because you're looking at the samples on average, so although some may be a bit bigger most are smaller. This effectively brings the average down when considering all the samples. (Remember that it doesn't bring the average down by much - the bias is quite small.)

But consider the even simpler case of removing just one value rather than a pair. Certainly removing a value near the mean increases the variance by less than removing a value a long way from the mean decreases it (due to the disproportionate effect of the squaring of larger deviations). However, there will be more numbers whose removal increases the variance than there will be whose removal decreases the variance (the mean of the squares of the numbers 1,2,3,4,5 is bigger than than the square of 3), so I am not convinced that it is clear that, on average, removing a number on average decreases the variance.

Both your argument and wintertree's depend crucially on the squaring of the deviations, so I do wonder whether not squaring the deviations (ie using the mean deviation as the measure of spread) still results in samples on average giving an underestimate. As I said, the tests we did on some simple sets of numbers seemed to suggest it does. Perhaps mbh could run simulations using the mean deviation rather than the variance to see what happens.

Robert Durran 13 Feb 2022

In reply to ablackett:

> I’ll find some time to check this tomorrow unless anyone knows the answer now?

As I said, wintertree's argument doesn't work for the mean deviation (the minimal sum of deviations does not necessarily occur when the deviations are from the mean) but I think it should work if we use "median deviation" where the deviations are measured from the median (it is easy to show that the sum of deviations is minimised when they are measured from the median), But seankenny's argument does not work because the squaring is crucial to get the disproportionate contributions for values with larger deviations.

I really would like to see mbh run simulations for the mean deviation (I'm not sure what would happen) and for "median deviation" where I think wintertree's argument means that samples would on average give an underestimate.

seankenny 13 Feb 2022

In reply to Robert Durran:

> But consider the even simpler case of removing just one value rather than a pair. Certainly removing a value near the mean increases the variance by less than removing a value a long way from the mean decreases it (due to the disproportionate effect of the squaring of larger deviations). However, there will be more numbers whose removal increases the variance than there will be whose removal decreases the variance (the mean of the squares of the numbers 1,2,3,4,5 is bigger than than the square of 3), so I am not convinced that it is clear that, on average, removing a number on average decreases the variance.

My explanation was just to give some intuitive sense of why the result holds. It’s clear, if you do a few of these calculations (which schoolkids could easily do) that the result is finely balanced. My argument was much simpler and coarser than wintertree’s, but it’s just to give a sense of why the result could be what it is. If one wants to see exactly how it works, one does the proof! Which shows exactly the size of the bias in the estimator and how it changes with the sample size.

> Both your argument and wintertree's depend crucially on the squaring of the deviations,

Well that’s from the definition of variance! “Your argument about the implications of X depend crucially upon the properties of X” and I’m sitting here thinking “no shit!” I only know this stuff because it’s a tool to answer cool economics questions. Going off piste to think about different ways of measuring variance simply isn’t as interesting to me as using these basic (but still deceptively difficult) ideas as part of the toolset to tackle fascinating social science problems…

tom_in_edinburgh 14 Feb 2022

In reply to John Stainforth:

In the case of 'entropy' in information theory the equation starts from an alphabet and the probability of seeing one symbol given that you just saw another one. It's not starting from physics.

The equation they get looks like one from thermodynamics so they decided to call it 'entropy' by analogy but there isn't any physics behind the derivation.

OP ablackett 14 Feb 2022

In reply to Robert Durran:

yes, I realised that this morning.

wintertree 14 Feb 2022

In reply to Robert Durran:

> As I said, wintertree's argument doesn't work for the mean deviation (the minimal sum of deviations does not necessarily occur when the deviations are from the mean)

Two points:

1. I am confused. The discussion was about the standard deviation, and how using the sample mean instead of the population mean biasses a measure low. The standard deviation is calculated using square deviations and not deviations.

2. The minimal deviation does occur for the mean. Write out the total deviation from a point Q, set it equal to zero (the minimal value), rearrange and.... Q is the mean.

wintertree 14 Feb 2022

In reply to ablackett:

No, it's a far more fundamental correction that comes down to a biassing error introduced by using the sample mean not the population mean. The sample mean becomes a more accurate measure with more samples. The error in the sample mean always reduces the measured variances.

Measuring any kind of deviation from a mean biasses the result down when the sample mean is used instead of the population mean. There is a sampling error that moves the sample mean away from the population mean, and the sample mean always gives the lowest mean deviation (*) and mean square deviation compared to the population mean. It does this because it provably has the lowest mean square deviation (and deviation) of any possible point, meaning all other points including the population mean would give a larger measure. So, if the erroneous value is always the lowest, the real value is always higher.

(*) Robert's assertion that this is not the case is wrong. This should be a very intuitive picture about what the mean means. Think about a physical representation such as the centre of mass of some objects for example...

Post edited at 09:30

wintertree 14 Feb 2022

In reply to seankenny:

> > Both your argument and wintertree's depend crucially on the squaring of the deviations,

> Well that’s from the definition of variance! “Your argument about the implications of X depend crucially upon the properties of X” and I’m sitting here thinking “no shit!”

Indeed. Some clarity has been lost somewhere...

Dave Garnett 14 Feb 2022

In reply to wintertree:

> So, using the population mean, making N=1 sampling runs, I could make very poor estimates of σ, but given enough repeats those estimates would tend to the correct population value. But using the sample mean, I would measure σ=0 because all the variance is absorbed by the sample mean moving about to minimise the ΣSD.

This thread precisely illustrates why I became a biologist. And why I had to do the 'stats for dummies/biologists' course at two universities.

I can't even work out how to insert Greek characters into posts, which might be useful when I feel the need to refer to protein complex subunits, or gamma/delta T-cells...

Robert Durran 14 Feb 2022

In reply to seankenny:

> My explanation was just to give some intuitive sense of why the result holds.

Yes, but, as I explained, I am not convinced it does.

> It’s clear, if you do a few of these calculations (which schoolkids could easily do) that the result is finely balanced.

Yes, I think that is a problem. You talked about a sample removing 2 data points from the population and I simplified that to one. The n-1 correction is much the most significant for small samples - replacing dividing by n with n-1 when n=2 doubles your answer, so I think it makes most sense to seek an intuitive sense of what is going on with small samples. Obviously a sample size of 1 always give a a variance of zero which is not helpful, but it might be better looking for an insight with n=2 or 3. However, even with n=2 and the large effect of n-1 I am struggling!

> Well that’s from the definition of variance! “Your argument about the implications of X depend crucially upon the properties of X” and I’m sitting here thinking “no shit!”.

Yes, but there are other measures of spread than variance and I was wondering whether other sensible measures of spread also have the property that the sample mean on average underestimates the population mean (is it a sort of universal property?). With mean deviation (not squaring the deviations and then adding but simply adding their absolute values) which is a simpler and very intuitive way of measuring spread (the squaring of deviations for variance is only a convenient algebraic device for getting rid of negatives), obviously any argument that depends on the squaring is lost. Most importantly the property that the sum of the absolute deviations from Q is minimised when Q is the mean is lost, so wintertree's argument which depended on this fact does not work for mean deviation. If the absolute deviations are measured from the median it is not lost. So wintertree's argument should still work for a measure of spread which is the mean of the absolute deviations from the median. But, because it doesn't for the mean deviation I do not know whether sample underestimation of the mean deviation occurs (as I said earlier, a proof either way seems to be problematical due to the algebraic difficulties caused by the use of absolute values to get rid of negatives).

> I only know this stuff because it’s a tool to answer cool economics questions. Going off piste to think about different ways of measuring variance simply isn’t as interesting to me as using these basic (but still deceptively difficult) ideas as part of the toolset to tackle fascinating social science problems…

Have you ever considered that the standard adoption of variance as a measure of spread (adopted for the convenience of the algebra of its mathematics) might be less than ideal because it amplifies the contributions to the variance of outliers? Using the mean deviation instead avoids this. There are probably situations where it would actually makes sense to downplay the influence of outliers - one could do this not by squaring but by adding the square roots of the absolute values of the deviations. There is nothing special about variance other than its mathematical convenience; I would have thought that people doing stats in the real world rather than playing around with algebra should be a bit unhappy about this!

Robert Durran 14 Feb 2022

In reply to wintertree:

> 1. I am confused. The discussion was about the standard deviation, and how using the sample mean instead of the population mean biasses a measure low. The standard deviation is calculated using square deviations and not deviations.

See my last post. I am wondering whether the property of samples underestimating a population measure of spread is more universal. If it is more universal, then there ought to be an intuitive argument which explains why this happens (specifically one which does not depend on squaring deviations and in particular on the least squares result).

> 2. The minimal deviation does occur for the mean. Write out the total deviation from a point Q, set it equal to zero (the minimal value), rearrange and.... Q is the mean.

But not the sum of the absolute values of the deviations (as summed when calculating mean deviation) eg the data set 1,2,3,4,5,27 has mean 7. The sum of the absolute deviations from this mean is clearly larger than the sum of the absolute deviations from 6. It is interesting that the effect will be less pronounced for more symmetrical distributions when the mean will be not far from the median (the sum of the absolute deviations from the median is minimal).

Post edited at 18:19

Robert Durran 14 Feb 2022

In reply to wintertree:

> No, it's a far more fundamental correction that comes down to a biassing error introduced by using the sample mean not the population mean.

So are you claiming that a sample will always, on average, underestimate a population measure of spread (not necessarily the variance and in particular, since I have brougt it up, the mean deviation?

> Measuring any kind of deviation from a mean biasses the result down when the sample mean is used instead of the population mean. There is a sampling error that moves the sample mean away from the population mean, and the sample mean always gives the lowest mean deviation (*)

> (*) Robert's assertion that this is not the case is wrong.

See my example at the end of my last post.

Or are we talking at cross purposes?

mbh 14 Feb 2022

In reply to Robert Durran:

>There is nothing special about variance other than its mathematical convenience;

I doubt that. There is a linear algebra justification for variance that eludes me for the moment. Perhaps a maths graduate on here can enlighten us? Something to do with L2 (Euclidean) rather than L1 (Manhattan) distances.

Besides that, one thing is that outliers in your data are very bad for the power of a test, which is its ability to detect an effect if there really is one. Using variance rather than absolute deviation makes your tests (I think, I haven't right here done the algebra) more sensitive to outliers so you can view the preference for its use as an aversion to outliers.

Doing so is a conservative approach that would rather not make the mistake of seeing an effect where none exists, preferring to make the mistake of seeing none where one does exists. Sometimes that is the better thing to do, sometimes it is not.

Whether a choice of variance over absolute deviation is a better way to nuance this choice is moot however. You can just alter the significance level to suit.

To the OP: what do think to my comment at 8:48 last night which addressed your original question?

Post edited at 19:03

Richard J 14 Feb 2022

In reply to mbh and Robert:

> >There is nothing special about variance other than its mathematical convenience;

> I doubt that. There is a linear algebra justification for variance that eludes me for the moment.

For data fitting, I think the variance is the right creature to use if you're confident your measurements are drawn from a Gaussian distribution (that's the underlying assumption in classical least squares fitting procedures, where you can show, given that assumption, that the least squares fit does have the maximum likelihood).

As Robert says, if you want your results to be robust to outliers, you can use the mean absolute deviation. Outliers of course are important if your measurements are drawn from distributions with fatter tails than a Gaussian (i.e. where larger deviations are more likely). For example a Lorentzian distribution doesn't even have a well-defined variance at all - if you do the integral it diverges.

There's a whole field of "robust estimation" devoted to deciding what measure of the width of a distribution to use for various assumptions about the shape of the distribution. For what it's worth, it turns out that the absolute deviation is actually the rigorously correct one to use if the distribution your measurements are drawn from is a double exponential.

In summary, there is something special about the variance, but only when you are confident that all your distributions are Gaussian.

seankenny 14 Feb 2022

In reply to Robert Durran:

> Yes, but, as I explained, I am not convinced it does.

I'm getting the feeling that what you're looking for is not a rough intuition but a proof in words.

> Yes, I think that is a problem. You talked about a sample removing 2 data points from the population and I simplified that to one. The n-1 correction is much the most significant for small samples - replacing dividing by n with n-1 when n=2 doubles your answer, so I think it makes most sense to seek an intuitive sense of what is going on with small samples. Obviously a sample size of 1 always give a a variance of zero which is not helpful, but it might be better looking for an insight with n=2 or 3. However, even with n=2 and the large effect of n-1 I am struggling!

I think you might have misread my method: it's about making a sample size n=8 by lopping off the highest and lowest values. Like I say, it's just a simple way to illustrate to kids where this idea comes from.

As an aside, surely a sample of size one gives an undefined variance (as opposed to a population of size one which gives a variance of zero). This makes intuitive as well as mathematical sense: if I pull one sample out of a population I simply don't have the information necessary to estimate the variance at all.

> Have you ever considered that the standard adoption of variance as a measure of spread (adopted for the convenience of the algebra of its mathematics) might be less than ideal because it amplifies the contributions to the variance of outliers?

No, us social scientists are such thickies that none of us have ever considered it.

>>rolls eyes

> There is nothing special about variance other than its mathematical convenience; I would have thought that people doing stats in the real world rather than playing around with algebra should be a bit unhappy about this!

"There is nothing special about variance, aside from the fact you can use it a lot. I would have thought that people who want to use it a lot should be unhappy about this!"

Robert Durran 14 Feb 2022

In reply to seankenny:

> I'm getting the feeling that what you're looking for is not a rough intuition but a proof in words.

No, definitely looking for rough intuition - something that would convince a class of 15 year olds with limited algebra. I think that wintertree's explanation was more like a proof in words (just establishing the ineqality, not the exact n-1 correction). To follow it I had to do the algebra and then work back to it!

> As an aside, surely a sample of size one gives an undefined variance.

Yes if you decide to try to try to divide by n-1=0 (which I was looking at as a sort of failed attempt to correct the zero when dividing by n=1).

Post edited at 19:25

Robert Durran 14 Feb 2022

In reply to Richard J:

> In summary, there is something special about the variance, but only when you are confident that all your distributions are Gaussian.

So I still want to know which is Gaussian: heights, volumes or surface areas of elephants (at school level this is my sort of currency!)

Robert Durran 14 Feb 2022

In reply to mbh:

> I doubt that. There is a linear algebra justification for variance that eludes me for the moment. Perhaps a maths graduate on here can enlighten us? Something to do with L2 (Euclidean) rather than L1 (Manhattan) distances.

Thanks, I might look that up.

Any chance you could run your simulation using mean deviation? I would be very interested to know whether sampling tends to underestimate it.

wintertree 14 Feb 2022

In reply to Robert Durran:

> Then there ought to be an intuitive argument which explains why this happens

> (specifically one which does not depend on squaring deviations and in particular on the least squares result).

I thought I made it clear that variance is under-estimated (absent an n-1 correction factor) for a mean distance as well as for a mean square distance.

Ego, I have already made it clear that this is not specific to the square distance. I have also spelt that out in a reply to someone else that I have shown your suggestion this is an aspect of square deviations to be false.

In both instances, this is due a bias introduced by the use of the sample mean.

The sample mean is closer to the samples than is the population mean. I have an intuitive picture of this - a little mental animation of a sample mean connected to the samples by little stretched springs. If I put the sample mean on the population mean and let go, the springs contract and whip it off in to the middle of the samples and get shorter in the process. Because the samples are never perfectly distributed about the population mean. The more samples there are, the better the approximation. That is the intuitive argument. The more samples, the better the estimate of the mean. The better the estimate of the mean, the less the measure low-balls reality.

> But not the sum of the absolute values of the deviations (as summed when calculating mean deviation

I reel in horror at this. My snap judgement is that absolute deviations are an abomination compared to square deviations. Searching my soul for why:

Ugly mathematical properties - absolute deviations not continuously differentiable
Lack of generality - absolute deviations apply only in 1-dimensional space but square distances apply in n-dimensional space.

So, this measure is not going to be so amenable to proofs. I'll call it the MAD for Mean Absolute Difference.

We don't need a simulation to understand how this measure will behave without any "n-1 correction". Start at the limits for n= (sample numbers):

n= 1 - the MAD still low-balls because all the absolute deviation of the sample from the population mean is absorbed in to the calculated sample mean. The deviation from the mean is always 0. This is the exact same argument I used for mean square distance and for mean distance. Two pieces of information are needed for an n=1 measure, the sample and the population mean. By using a sample mean instead, the DOFs take their share of the MAD. All of it - as if there were n-1 = 0 samples involved.
- This will happen for any measure of deviation from the mean you can conceive.
n = lots - the MAD tends to the population estimate
- This will happen for any measure of deviation from the mean you can conceive.

Between the limits - we expect some sort of sane behaviour, so the important of sample number is clearly going to fade away as n rises. Again. Just like the other measures. I think the correction factor might be slightly different, due to the abominable qualities of the MAD.

The abominable properties of this measure might allow one to find a position that gives a smaller MAD than the sample mean, but it does not - statistically speaking - prevent the sample mean from producing a lower MAD than the population mean. I put a plot in below of the average MAD measured over each of 5000 repeats for various sample sizes (n=) using both the sample mean (black) and population mean (red).

(Although I find it mathematically ugly, the close relation the Sum Absolute Difference or SAD does have a big real world benefit of underlying lots of motion based video compression and so having good SIMD support in CPUs, such as _mm256_mpsadbw_epu8)

> I am wondering whether the property of samples underestimating a population measure of spread is more universal

There is a universal property you seek, and it's that the sample mean absorbs some of the dispersion of datapoint from the population mean. Because the sample mean falls in the middle of the samples, and not on the population mean. Smaller samples => further from the population mean (on average) => more spread absorbed.

It's not the samples themselves that underestimate the dispersion, it's the error introduced by estimating the population mean from the same samples.

Edit, sorry missed your second post...

> So are you claiming that a sample will always, on average, underestimate a population measure of spread (not necessarily the variance and in particular, since I have brougt it up, the mean deviation?

Indeed. A measurement of deviation from the mean using samples and a mean derived from those samples will (statistically speaking) under-estimate any measure of spread, with a bias that decreases with increasing sample size. This comes down to the limits of n=0 and n=lots. The calculation of a sample mean form the samples is where this happens. For n=0, all deviation is lost, as the sample mean is the value of the sole sample. As you add more and more samples, the sample mean gets closer and closer to the population mean. It's down to constraints and degrees of freedom, rather than anything to do with outliers or the specifics of the chosen measure of spread.

One sample's worth of distance and one sample's worth of square distance are lost in calculation of the sum of distances and sum of square distances by looking at distances from the sample mean not the population mean, hence dividing by n-1 and not n when calculating the mean of these. The limiting cases of n=1 and n=lots are trivial.

Post edited at 20:04

Richard J 14 Feb 2022

In reply to Robert Durran:

> So I still want to know which is Gaussian: heights, volumes or surface areas of elephants (at school level this is my sort of currency!)

You're the mathematician, I'm only a grubby experimental physicist! So my answer's going to be, you can't know until you make the measurements.

mbh 14 Feb 2022

In reply to Robert Durran:

For that you have to recall the CLT, that a Gaussian will eventually result from the combined effect of many independent random variables, and ask a biologist what those variables are and what they would most directly affect: height, or volume. Probably the latter I would guess in spherical-ish things like elephants, and probably height in linear-ish things like people.

wintertree 14 Feb 2022

In reply to Richard J:

> I'm only a grubby experimental physicist!

I call shenanigans!

Experimental physicists hopefully have a much deeper understanding of statistics than theoretical physicists. Statistically speaking, of course... Sometimes they even put errorbars on their graphs...

wintertree 14 Feb 2022

In reply to wintertree:

Bugger, penultimate paragraph should be n=1 and not n=0….

Robert Durran 14 Feb 2022

In reply to wintertree:

> I thought I made it clear that variance is under-estimated (absent an n-1 correction factor) for a mean distance as well as for a mean square distance.

Ok, but with mean absolute deviation there will be some samples for which the sum of the absolute deviations from the sample mean is greater than the sum of the absolute deviations from the population mean, because, unlike with squared deviations, the sum is not minimised from the sample mean. So I just have to convince myself that this is not a significant factor (which seems, on first thoughts very resaonable).

> I reel in horror at this. My snap judgement is that absolute deviations are an abomination compared to square deviations. Searching my soul for why:

> Ugly mathematical properties.

> So, this measure is not going to be so amenable to proofs.

Yes, like I said mathematically inconvenient at anything other than a trivial level. But I am coming from a position of teaching mathematics at a trivial level (where, for instance, for many pupils, that there is difference between summing then squaring and squaring then summing is a source of problems which often obscure the point of what we are doing (trying to measure spread) - mean absolute deviation would be so much more straighforward!

Robert Durran 14 Feb 2022

In reply to wintertree:

> Bugger, penultimate paragraph should be n=1 and not n=0….

Yes, I had gathered that!

wintertree 14 Feb 2022

In reply to Robert Durran:

> Ok, but with mean absolute deviation

Indeed, so the elegant proof that the mean deviation and mean square deviation must be minimal from the sample means fails for absolute deviations, but the point I was addressing was that you were claiming "specifically one which does not depend on squaring deviations" when I had already given you a proof for one which did not depend on squaring deviations.

If you want more proof that it's not about outliers or squaring the distance, but fundamentally about the sample mean being "too close" to the samples compared to the population mean, consider a distribution consisting of the two values -1 and +1 with equal weighting, and squaring or not squaring the distance from the population mean makes no difference.

Using the sample mean always under-estimates the distance - both absolute and square - from the population mean. To see this, start at the n=1 limit and work up in n.

Trivial to see for n=1 that the sample mean fully eliminates all distance from the samples.
For n=2, consider the four possible sets of samples (-1, -1), (-1, +1), (+1, -1), (+1,+1). In two of these cases the distance to the samples is measured as 0, in two of them it is measured correctly. So, better than n=1 but still not great. On average, you get half the displacement, so that's corrected by applying a factor of 2 - equivalent to dividing by (n-1 = 1) instead of (n=2) in the average of the measure.
You can work the next few on paper to see how the accuracy improves with n...
- Indeed, working out the average biassing factor by measuring the error for each permutation of samples and propagating it through your function of choice (linear, absolute, square) you can determine the correction factor that gives the best measurement for each value of n=
- Someone smarter than me must be able to do this formulaically...

Someone really smarter than me has almost assuredly done this fully generically applicable to all (symmetric?) distributions, but I doubt I could follow that proof easily. Edit: Although I suppose the central limit theorem could come in to such a proof...

But there are trivial, A-level friendly proofs covered on here for quite a few pieces of the jigsaw. Use of sample mean always under-estimating sum of actual/square displacement, n=1 delivering zero displacement for the sample mean for actual/absolute/square displacement, calculating the correction factor needed for a specific distribution for actual/absolute/square displacement when using the sample mean, which can be expressed as a function of n. (We know it comes to n/(n-1) for the square displacement).

Post edited at 20:53

mbh 15 Feb 2022

In reply to Robert Durran:

> Any chance you could run your simulation using mean deviation? I would be very interested to know whether sampling tends to underestimate it.

I am not sure I understand what you want, but here are plots of the normalised absolute deviations from the sample mean for in each case 1000 samples of different sizes N drawn from a population that is distributed as a standard normal, where normalisation is achieved by dividing the sum of the absolute deviations from the sample mean by N-1, N-0.5 and N. The sample mean itself is calculated in the usual way.

The population value of the mean absolute deviation, 0.8, is shown as a solid line. The mean values of the normalised sample absolute deviations are shown as a dashed line.

It seems that normalising by N underestimates the population value, while normalising by N-1 overestimates it. Doing so by N-0.5 appears to give an unbiased estimator. I don't know why.

I get the same results if the population is uniformly distributed on [0,1].

Robert Durran 15 Feb 2022

In reply to mbh:

Thanks. I think that is what I was looking for. So it confirms that using N does underestimate. Interesting about N-0.5. I wonder whether that is an exact provable thing.

OP ablackett 15 Feb 2022

In reply to Robert Durran:

If it is, I wonder if it is known? Perhaps the UKC collective could have their first academic maths paper?

Robert Durran 15 Feb 2022

In reply to ablackett:

> If it is, I wonder if it is known? Perhaps the UKC collective could have their first academic maths paper?

I'm not sure how you can lose half a degree of freedom (explaining things in terms of degrees of freedom still feels like witchcraft to me anyway!). Maybe wintertree would have an idea what it might mean.

wintertree 15 Feb 2022

In reply to Robert Durran:

You throw half of it away when you take the absolute.

(Edit: I made that up whilst my mind is on the base gravy I’m making for a chilli experiment, but it sounds well clever like…)

Post edited at 18:08

John Stainforth 15 Feb 2022

In reply to Robert Durran:

Has anyone seen the n-1/2 in the textbooks (I haven't)?

mbh 15 Feb 2022

In reply to John Stainforth:

Neither have I. Despite my checking, it may be an error. If anyone else wants to check by simulation or calculation, feel free.

mbh 15 Feb 2022

In reply to thread:

Do any of you understand the Estimator paragraph at the bottom of this Wikipedia page on Average Absolute deviation?

https://en.wikipedia.org/wiki/Average_absolute_deviation

In particular, the sentence:

"...The average of all the sample absolute deviations about the mean of size 3 that can be drawn from the population is 44/81, while the average of all the sample absolute deviations about the median is 4/9. " (Both of these numbers are not far from 1/2)...."

OP ablackett 15 Feb 2022

In reply to mbh:

I think that’s saying that if you draw all possible samples of size 3 with replacement from the population {1,2,3} the average of absolute deviations from the mean is 44/81.

OP ablackett 15 Feb 2022

In reply to ablackett:

(3 * 44/81)/2.5 is very close to 2/3 which is the absolute deviation of the population from the mean.

OP ablackett 15 Feb 2022

In reply to ablackett:

> I think that’s saying that if you draw all possible samples of size 3 with replacement from the population {1,2,3} the mean of all absolute deviations from the mean is 44/81.

Yes i've just checked this and it is right.

Just to clarify my previous point, if you take all 27 absolute deviations from the mean and divide by 2.5 rather than 3, then average them you get 0.65185, which is very close to the population average deviation from the mean of 2/3.

Robert Durran 15 Feb 2022

In reply to ablackett:

> Just to clarify my previous point, if you take all 27 absolute deviations from the mean and divide by 2.5 rather than 3, then average them you get 0.65185, which is very close to the population average deviation from the mean of 2/3.

So there is not an exact n-1 rule. I don't think I find this surprising given the awkwardness of absolute deviations. Maybe some sort of approximate rule.

mbh 16 Feb 2022

In reply to ablackett:

I have played around with various populations and sample sizes. I can't yet work out quite what is going on, but dividing by N consistently underestimates the population absolute deviation, dividing by N-1 over-estimates it, and the x you need to subtract to get it dead on varies with sample size and is not exactly 1/2. Research adjourned...(I have a day job!)

wintertree 16 Feb 2022

In reply to mbh:

You’re doing the MAD from a position which is not central - the arithmetic mean doss not minimise the MAD. Find the central value that minimises the MAD and use that, then see what the correction is.

Robert Durran 16 Feb 2022

In reply to wintertree:

> You’re doing the MAD from a position which is not central - the arithmetic mean doss not minimise the MAD. Find the central value that minimises the MAD and use that, then see what the correction is.

Isn't the central position which minimises the MAD the median? So it might make more sense to look for something significant using the median. This will be the same as the mean for symmetrical populations and the population {1,2,3} used as an example above is symmetrical.

mbh 16 Feb 2022

In reply to wintertree:

I know. Haven't yet got round to trying anything else.

Robert Durran 16 Feb 2022

In reply to ablackett:

> I think that’s saying that if you draw all possible samples of size 3 with replacement from the population {1,2,3} the average of absolute deviations from the mean is 44/81.

Obviously without replacement there is only one possible sample of 3 (the whole population) and so, trivially, the average sample MAD will be equal to the poulation MAD (likewise average sample variance and population variance). It is only the repeated values allowed by replacement which brings the averages down. I commented on this much earlier in the thread and speculated that it might be significant in explaining the underestimates. I now don't think it is in general but it is interesting that it does seem to be crucial in pulling down the sample averages of MAD and variance for small populations.

wintertree 16 Feb 2022

In reply to Dave Garnett:

> I can't even work out how to insert Greek characters into posts

The same way I'm inserting them in to my computer code this morning...

Step 1 - google "Sigma"
Step 2 - google "little Sigma"
Step 3 - select the symbol in the top of the google result and paste it in to my code.

I should probably get an app for it, or at least make a text file with the ones I regularly use to hand. I particularly like the idea of a little Arduino project that has a touchscreen display of symbols and presents as a USB keyboard to the computer. The call of procrastination...

cb294 16 Feb 2022

In reply to wintertree:

> Sometimes they even put errorbars on their graphs...

I simply love it when my students choose the SEM rather than the SD for decorating their graphs, because "it makes the error bars nice and short".

I am not joking, this was an actual answer to my question what the error bars on their home assignments should indicate!

It is also a depressingly common reason for me rejecting manuscripts or, at least, demand major revision.

Anyway, DEATH TO BAR GRAPHS! Beeswarm or box, whisker, and outlier plots are almost always so much better for honestly displaying your data.

wintertree 16 Feb 2022

In reply to cb294:

Funnily enough, today I am making some measurements from data across a statistical axis, and I'm plotting the measurements for my own ends. I'm doing this with plots like the one below, with a box and whisker plot, an SD errorbar (black) and an SE errorbar (blue).

I find this a useful way of thinking about things - obviously wouldn't put something like this in to a paper. Yes, I realise there're no axis labels or legends, purposefully cropped out for UKC...

I should probably be all over bee-swarms, but the random or structural (e.g. hex packing) fuzzing of y-axis values just feels so wrong, even if the plots look so right.

> choose the SEM rather than the SD for decorating their graphs

> It is also a depressingly common reason for me rejecting manuscripts or, at least, demand major revision.

Surely the problem is not the choice of the metric, but of what it is being used to show - a significance to the difference of the mean behaviour over the populations (SEM) or a significance to the difference between the statistically dispersed distributions of the populations (SD). So long as the measure is appropriate for the question, and the question is relevant to the biology...

> I am not joking, this was an actual answer to my question what the error bars on their home assignments should indicate!

You have smart students. One of ours asked a former collage "What does that symbol mean?" in a lecture. "Which symbol?" "The two crossed lines between v and at". It took several goes for the chap to step down to their level and figure out they meant the plus symbol.

cb294 17 Feb 2022

In reply to wintertree:

beeswarm vs. box and whisker really depends on the number of data points.

Columns with error bars representing mean +/- SD is only really meaningful for data somewhat smoothly distributed around a mean, whereas in biology very often we generate data with two or more peaks, which is really best shown by actually showing the actual data points.

In the last 8 years or so my go-to experimental setup has generated data sets where in the controls I expect to get cell numbers of 0 or 1 in half the samples, and a distribution around 15ish cells in the others. The experimental question is then whether some manipulation shifts more samples to the 0 to 1 class.

Yes I could calculate means and SD, but that would be just as useful as learning that the average human has slightly more than one testicle and just below one breast.....

wintertree 17 Feb 2022

In reply to cb294:

Totally agree that mean ± SD or SEM is not appropriate for the kind of distributions you're describing.

> In the last 8 years or so my go-to experimental setup has generated data sets where in the controls I expect to get cell numbers [...]

I suppose one way of looking at it is that biologists doing sub-cellular stuff often exist before the central limit theorem comes in to play, and deal with the various underlying distributions. I quite like that as a way of classifying life scientists - which side of the normal are they? I can then smugly problem you're sub-normal and I'm super-normal. Others may disagree...

> beeswarm vs. box and whisker really depends on the number of data points.

Yes, because if you have a dozen or so you can do a box plot, and if you have hundreds you can do a proper histogram instead of a beeswarm....