Help me, stats people!

May 11, 2021

I have several small ordered sequences of data, each of about five to ten elements. For each of them, I want to calculate a metric which captures how much they vary along the sequence. I don’t want standard deviation, or anything like it, because that would consider the sequences 1 5 2 7 4 and 1 2 4 5 7 equally variable, whereas for my purposes the first of these is much more variable.

Here is a matric that I think does what I want, and will allow me to compare different sequences for variability-along-the-sequence.

For the n-1 pairs along the sequence of n elements, I take the difference (absolute value, so always positive) between elements i and i+1. Then I average all those differences. Then I divide the result by the average of the values themselves, to normalise for magnitude.

Some example calculations:

  • For the sequence 1 5 2 7 4, the differences are 4 3 5 3, for a total of 15 and an average of 3.75. The average of the values is 1+5+2+7+4 = 19/5 = 3.8, which gives me a metric of 3.75/3.8 = 0.987.
  • For the sequence 1 2 4 5 7, the differences are 1 2 1 2, for a total of 6 and an average of 1.5. The average of the values is again 3.8, which gives me a metric of 1.5/3.8 = 0.395.
  • So the first sequence is 0.987/0.395 = 2.5 times as sequentially variable as the second sequence.
  • And for the sequence 10 20 40 50 70 (which is the same as the previous one, but all values ten times greater), the differences are 10 20 10 20, for a total of 60 and an average of 15. The average of the values is 38, which gives me a metric of 15/38 = 0.395, the same as before — which is as it should be.

And now, my question! Does this metric, or something similar, already exist? If so, what is it called? Or if I should be using something else instead, what is it?

(It happens that my sequences are the aspect ratios of the cotyles of consecutive vertebrae, but that’s not important: whatever metric we land on should work for any sequences.)

Taylor 2015: Figure 8. Cervical vertebrae 4 (left) and 6 (right) of Giraffatitan brancai lectotype MB.R.2180 (previously HMN SI), in posterior view. Note the dramatically different aspect ratios of their cotyles, indicating that extensive and unpredictable crushing has taken place. Photographs by author.

13 Responses to “Help me, stats people!”

  1. Nathan Myers Says:

    I can’t tell you what they will call it, but it will be something to do with “first differences”. First difference is very often determinative in discrete sequences. Here, “first-difference magnitude”.

    Anyway that’s my story and I am sticking with it.

  2. Jake Says:

    The first part is ‘total variation’.

  3. Mike Taylor Says:

    That doesn’t sound right, Jake. Accoding to this document:

    The total variation about a regression line is the sum of the squares of the differences between the y-value of each ordered pair and the mean of y.

    total variation = (y−ȳ)²

    That takes no account of the order of the observations.

  4. Gerrick Says:

    A disclaimer: I read your blog because I love learning about dinosaurs, but I am not a paleontologist, or even a biologist. I’m going to approach this from my background as a chemical physicist.

    Also, I’ll say I really like your average sequential deviation idea, and it seems like that encapsulates what you’re after really well.

    But, if you wanted something that captures this a little differently, the way I’d usually approach understanding a ordered series like this is with correlation functions. In this, we assume there is a relationship between the data embedded within its sequence and we try to extract that information (or conversely you know the correlation and you can make predictions of subsequent data points). These are commonly done in time, so familiar examples would be predicting the stock markets based on some past indicators or predicting the value of a savings account given a certain interest rate. Anyway, a simple form of this would be

    C(t)=/ d(0)^2

    Where c(t) is our correlation function, d(0) is an initial data point, and d(t) are subsequent data points. The variable t is the dimension of interest; time is common for such an analysis, so I used “t”, but it could be anything, special or an integer index of a vertebra. Finally the indicates an average and this is usually written as a dot product, “.”, since we could have a vector of input information. In your example, we just have scaler information, so this would be multiplication. The average arises because we can consider starting at any point in the set to maximize our use of the data (the origin can be placed anywhere in a lot of these types of problems). In your case, knowing that you’re starting at vertebra C1 is important vs C4 or whatever, but we could think more about that if this is actually useful.

    Further, in many cases in physics, biology, finance, …, the amount at some point depends on the amount at some previous point, which would be an exponential process. Therefore, the correlation function is often fit to a function of the form

    C(t)=C(0) exp( k t)

    Where k is a rate constant, or growth constant and C(0) is 1 if the data is normalized. In this form, k is inversely related to how the population changes, so often k is expressed as it’s inverse so that larger means “faster” growth or vice-versa. k has units that are inverse of the units of t. The inverse k is typically called lifetime (same units as t) for time-dependent processes and given the symbol tau (spelled out in case Greek is handled in the comments), so tau=1/k. Note, sometimes you’ll see a negative sign in this if it’s explicitly a decay process, so c(t) is always decreasing as t increases. Note 2, exponential is common because of the prevalence of that functional dependence, but it need not be that and we can even look directly at the correlation function. In fact, I like trying to tease out a better functional form for your system would make more sense than the exponential assumption here.

    Now, what does this look like? For the first set of data:
    set1=(1, 5, 2, 7, 4 )
    The correlation function is then C1(t)=(1, 2.367857143, 1.8, 3.9, 4) and tau= 3.46740638

    And for the second set of data:
    set2=(1, 2, 4, 5, 7 )
    The correlation function is then C2(t)=(1, 1.6625, 2.75, 4.25, 7) and tau= 2.810567735

    So we see a larger value for the first data set, just like your (more elegant) method.

    Anyway, I thought this was interesting to think about.

  5. The slope. You’re talking about the slope.

    You’re asking how does some measurement (the numbers you’re giving) vary along some other value (the sequence they’re given in). That is, how does y vary along x.

    For monotonic data (y goes up when x goes up) like you’re second example (1 2 4 5 7), the normal linear slope is *exactly* the same as what you’re calculating (eg, 1,2,4,5,7 = y, 1,2,3,4,5 = x, gives a slope of 1.5, same as you find).

    For non-monotonic data (y does not always increase as x does), like your first example, other functional forms are what are needed.

    But basically it seems like you’re asking: as X (sequence) changes, how does Y (measurement) change. Which is a regression slope.

  6. A slope is a rate of change in y as a function of x. That is, how much does cotyle width change as sequence progresses. A first derivative then is what you’re calculating (functionally) with your differences.

    The issue is that no single value metric is ever going to perfectly capture all possible sequence variations. So in some pair of hypothetical sauropods with sequences of (7, 1, 1, 1, 1) and (5, 3, 4, 2, 1) both have an average difference of 1.5. The average values difference (2.2 vs 3) but you could play with the numbers to make them equal on both without being at all the same.

    So think of a wave line, going up and down. The amplitude then is the difference between the height of each crest/depth of each trough and the average value. You’re basically asking about the average |amplitude|. Which is fine, but I suspect that thinking of it explicitly as a regression is more useful.

  7. Oliver Says:

    I think you’re after the autocorrelation function (here the emphasis is on similarity rather than variability). You’re interested in the autocorrelation at a lag of 1, i.e. how much neighbouring points are correlated.

    Interestingly, if you look at your examples, your first sequence is _less_ variable than the second at a lag of 2, i.e. next-to-neighbouring points are actually more similar (there’s a slight periodicity). That would be a very interesting thing to detect in a vertebral sequence!

  8. Hi Mike,

    It looks like you using first-difference fluctuation or so called “poor man’s” wavelet (later normalized by the average magnitude) applied to a single (smallest) scale. You can use it further than that – without normalization you can reveal the whole spectrum of fluctuations, not only at the smallest scale. For example finding differences between consecutive pairs of observations, triplets and so on. In this way you can detect variability not just at the smallest but at all scales.

    Enjoying your content,

    Andrej Spiridonov

  9. Mike Taylor Says:

    Thank you all for really useful pointers on this metric I am trying to figure out!

  10. Kieran Says:

    What I would do is compute the RMSE of the first differences in the logged aspect ratio.

    Doing this gives you 1.152 for your first series, and 0.530 for the second.

    The advantage of using logs is that the measure becomes scale invariant – if you double all the ratios the measure is unaffected, and it is also unaffected by taking the inverse of the aspect ratios instead. And it also becomes more sensitive to small absolute changes that are large in proportion – i.e from 0.1 to 0.2 will be taken as a dramatic and not small change.

    The advantage of using the RMSE is that is is more sensitive to abrupt changes than gradual changes, i.e. 1 1 3 1 1 is taken to be more variable than 1 2 3 2 1. In this case we have 0.777 for the former and 0.567 for the latter.

    Now if you take the exponential of the RMSE you will get a measure of the average of the magnitude of the percentage change in the aspect ratio. For your first series you will get e^1.152 = 317 %, for the second series, e^0.530 = 170 %.

    Let A(v) be the aspect ratio. Then you have

    RMSE =[{(Ln[A(v2)] – Ln[A(v1)])^2 + (Ln[A(v3)] – Ln[A(v2)])^2 + . . . + (Ln[A(vn)] – Ln[A(vn-1)])^2}/n]^0.5

  11. Mike Taylor Says:

    Thanks, Kieren, your thoughts are appreciated.

  12. […] amniotes. I came up with a way of calculating this, but wondered if it already existed. In my post Help me, stats people! I asked if anyone knew of it, but it seemed no-one did. (In the end, the resubmitted paper offered […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: