[Note: Mike asked me to scrape a couple of comments on his last post – this one and this one – and turn them into a post of their own. I’ve edited them lightly to hopefully improve the flow, but I’ve tried not to tinker with the guts.]

This is the fourth in a series of posts on how researchers might better be evaluated and compared. In the first post, Mike introduced his new paper and described the scope and importance of the problem. Then in the next post, he introduced the idea of the LWM, or Less Wrong Metric, and the basic mathemetical framework for calculating LWMs. Most recently, Mike talked about choosing parameters for the LWM, and drilled down to a fundamental question: (how) do we identify good research?

Let me say up front that I am fully convicted about the problem of evaluating researchers fairly. It is a question of direct and timely importance to me. I serve on the Promotion & Tenure committees of two colleges at Western University of Health Sciences, and I want to make good decisions that can be backed up with evidence. But anyone who has been in academia for long knows of people who have had their careers mangled, by getting caught in institutional machinery that is not well-suited for fairly evaluating scholarship. So I desperately want better metrics to catch on, to improve my own situation and those of researchers everywhere.

For all of those reasons and more, I admire the work that Mike has done in conceiving the LWM. But I’m pretty pessimistic about its future.

I think there is a widespread misapprehension that we got here because people and institutions were looking for good metrics, like the LWM, and we ended up with things like impact factors and citation counts because no-one had thought up anything better. Implying a temporal sequence of:

1. Deliberately looking for metrics to evaluate researchers.
2. Finding some.
3. Trying to improve those metrics, or replace them with better ones.

I’m pretty sure this is exactly backwards: the metrics that we use to evaluate researchers are mostly simple – easy to explain, easy to count (the hanky-panky behind impact factors notwithstanding) – and therefore they spread like wildfire, and therefore they became used in evaluation. Implying a very different sequence:

1. A metric is invented, often for a reason completely unrelated to evaluating researchers (impact factors started out as a way for librarians to rank journals, not for administration to rank faculty!).
2. Because a metric is simple, it becomes widespread.
3. Because a metric is both simple and widespread, it makes it easy to compare people in wildly different circumstances (whether or not that comparison is valid or defensible!), so it rapidly evolves from being trivia about a researcher, to being a defining character of a researcher – at least when it comes to institutional evaluation.

If that’s true, then any metric aimed for wide-scale adoption needs to be as simple as possible. I can explain the h-index or i10 index in one sentence. “Citation count” is self-explanatory. The fundamentals of the impact factor can be grasped in about 30 seconds, and even the complicated backstory can be conveyed in about 5 minutes.

In addition to being simple, the metric needs to work the same way across institutions and disciplines. I can compare my h-index with that of an endowed chair at Cambridge, a curator at a small regional museum, and a postdoc at Podunk State, and it Just Works without any tinkering or subjective decisions on the part of the user (other than What Counts – but that affects all metrics dealing with publications, so no one metric is better off than any other on that score).

I fear that the LWM as conceived in Taylor (2016) is doomed, for the following reasons:

  • It’s too complex. It would probably be doomed if it had just a single term with a constant and an exponent (which I realize would defeat the purpose of having either a constant or an exponent), because that’s more math than either an impact factor or an h-index requires (perceptively, anyway – in the real world, most people’s eyes glaze over when the exponents come out).
  • Worse, it requires loads of subjective decisions and assigning importance on the part of the users.
  • And fatally, it would require a mountain of committee work to sort that out. I doubt if I could get the faculty in just one department to agree on a set of terms, constants, and exponents for the LWM, much less a college, much less a university, much less all of the universities, museums, government and private labs, and other places where research is done. And without the promise of universal applicability, there’s no incentive for any institution to put itself through the hell of work it would take to implement.

Really, the only way I think the LWM could get into place is by fiat, by a government body. If the EPA comes up with a more complicated but also more accurate way to measure, say, airborne particle output from car exhausts, they can theoretically say to the auto industry, “Meet this standard or stop selling cars in the US” (I know there’s a lot more legislative and legal push and pull than that, but it’s at least possible). And such a standard might be adopted globally, either because it’s a good idea so it spreads, or because the US strong-arms other countries into following suit.

Even if I trusted the US Department of Education to fill in all of the blanks for an LWM, I don’t know that they’d have the same leverage to get it adopted. I doubt that the DofE has enough sway to get it adopted even across all of the educational institutions. Who would want that fight, for such a nebulous pay-off? And even if it could be successfully inflicted on educational institutions (which sounds negative, but that’s precisely how the institutions would see it), what about the numerous and in some cases well-funded research labs and museums that don’t fall under the DofE’s purview? And that’s just in the US. The culture of higher education and scholarship varies a lot among countries. Which may be why the one-size-fits-all solutions suck – I am starting to wonder if a metric needs to be broken, to be globally applicable.

The problem here is that the user base is so diverse that the only way metrics get adopted is voluntarily. So the challenge for any LWM is to be:

  1. Better than existing metrics – this is the easy part – and,
  2. Simple enough to be both easily grasped, and applied with minimal effort. In Malcolm Gladwell Tipping Point terms, it needs to be “sticky”. Although a better adjective for passage through the intestines of academia might be “smooth” – that is, having no rough edges, like exponents or overtly subjective decisions*, that would cause it to snag.

* Calculating an impact factor involves plenty of subjective decisions, but it has the advantages that (a) the users can pretend otherwise, because (b) ISI does the ‘work’ for them.

At least from my point of view, the LWM as Mike has conceived it is awesome and possibly unimprovable on the first point (in that practically any other metric could be seen as a degenerate case of the LWM), but dismal and possibly pessimal on the second one, in that it requires mounds of subjective decision-making to work at all. You can’t even get a default number and then iteratively improve it without investing heavily in advance.

An interesting thought experiment would be to approach the problem from the other side: invent as many new simple metrics as possible, and then see if any of them offer advantages over the existing ones. Although I have a feeling that people are already working on that, and have been for some time.

Simple, broken metrics like impact factor are the prions of scholarship. Yes, viruses are more versatile and cells more versatile still, by orders of magnitude, but compared to prions, cells take an awesome amount of effort to build and maintain. If you just want to infect someone and you don’t care how, prions are very hard to beat. And they’re so subtle in their machinations that we only became aware of them comparatively recently – much like the emerging problems with “classical” (e.g., non-alt) metrics.

I’d love to be wrong about all of this. I proposed the strongest criticism of the LWM I could think of, in hopes that someone would come along and tear it down. Please start swinging.

You’ll remember that in the last installment (before Matt got distracted and wrote about archosaur urine), I proposed a general schema for aggregating scores in several metrics, terming the result an LWM or Less Wrong Metric. Given a set of n metrics that we have scores for, we introduce a set of n exponents ei which determine how we scale each kind of score as it increases, and a set of n factors ki which determine how heavily we weight each scaled score. Then we sum the scaled results:

LWM = k1·x1e1 + k2·x2e2 + … + kn·xnen

“That’s all very well”, you may ask, “But how do we choose the parameters?”

Here’s what I proposed in the paper:

One approach would be to start with subjective assessments of the scores of a body of researchers – perhaps derived from the faculty of a university confidentially assessing each other. Given a good-sized set of such assessments, together with the known values of the metrics x1, x2xn for each researcher, techniques such as simulated annealing can be used to derive the values of the parameters k1, k2kn and e1, e2en that yield an LWM formula best matching the subjective assessments.

Where the results of such an exercise yield a formula whose results seem subjectively wrong, this might flag a need to add new metrics to the LWM formula: for example, a researcher might be more highly regarded than her LWM score indicates because of her fine record of supervising doctoral students who go on to do well, indicating that some measure of this quality should be included in the LWM calculation.

I think as a general approach that is OK: start with a corpus of well understood researchers, or papers, whose value we’ve already judged a priori by some means; then pick the parameters that best approximate that judgement; and let those parameters control future automated judgements.

The problem, really, is how we make that initial judgement. In the scenario I originally proposed, where say the 50 members of a department each assign a confidential numeric score to all the others, you can rely to some degree on the wisdom of crowds to give a reasonable judgement. But I don’t know how politically difficult it would be to conduct such an exercise. Even if the individual scorers were anonymised, the person collating the data would know the total scores awarded to each person, and it’s not hard to imagine that data being abused. In fact, it’s hard to imagine it not being abused.

In other situations, the value of the subjective judgement may be close to zero anyway. Suppose we wanted to come up with an LWM that indicates how good a given piece of research is. We choose LWM parameters based on the scores that a panel of experts assign to a corpus of existing papers, and derive our parameters from that. But we know that experts are really bad at assessing the quality of research. So what would our carefully parameterised LWM be approximating? Only the flawed judgement of flawed experts.

Perhaps this points to an even more fundamental problem: do we even know what “good research” looks like?

It’s a serious question. We all know that “research published in high-Impact Factor journals” is not the same thing as good research. We know that “research with a lot of citations” is not the same thing as good research. For that matter, “research that results in a medical breakthrough” is not necessarily the same thing as good research. As the new paper points out:

If two researchers run equally replicable tests of similar rigour and statistical power on two sets of compounds, but one of them happens to have in her batch a compound that turns out to have useful properties, should her work be credited more highly than the similar work of her colleague?

What, then? Are we left only with completely objective measurements, such as statistical power, adherance to the COPE code of conduct, open-access status, or indeed correctness of spelling?

If we accept that (and I am not arguing that we should, at least not yet), then I suppose we don’t even need an LWM for research papers. We can just count these objective measures and call it done.

I really don’t know what my conclusions are here. Can anyone help me out?

I said last time that my new paper on Better ways to evaluate research and researchers proposes a family of Less Wrong Metrics, or LWMs for short, which I think would at least be an improvement on the present ubiquitous use of impact factors and H-indexes.

What is an LWM? Let me quote the paper:

The Altmetrics Manifesto envisages no single replacement for any of the metrics presently in use, but instead a palette of different metrics laid out together. Administrators are invited to consider all of them in concert. For example, in evaluating a researcher for tenure, one might consider H-index alongside other metrics such as number of trials registered, number of manuscripts handled as an editor, number of peer-reviews submitted, total hit-count of posts on academic blogs, number of Twitter followers and Facebook friends, invited conference presentations, and potentially many other dimensions.

In practice, it may be inevitable that overworked administrators will seek the simplicity of a single metric that summarises all of these.

This is a key problem of the world we actually live in. We often bemoan that fact that people evaluating research will apparently do almost anything than actually read the research. (To paraphrase Dave Barry, these are important, busy people who can’t afford to fritter away their time in competently and diligently doing their job.) There may be good reasons for this; there may only be bad reasons. But what we know for sure is that, for good reasons or bad, administrators often do want a single number. They want it so badly that they will seize on the first number that comes their way, even if it’s as horribly flawed as an impact factor or an H-index.

What to do? There are two options. One is the change the way these overworked administrators function, to force them to read papers and consider a broad range of metrics — in other words, to change human nature. Yeah, it might work. But it’s not where the smart money is.

So perhaps the way to go is to give these people a better single number. A less wrong metric. An LWM.

Here’s what I propose in the paper.

In practice, it may be inevitable that overworked administrators will seek the simplicity of a single metric that summarises all of these. Given a range of metrics x1, x2xn, there will be a temptation to simply add them all up to yield a “super-metric”, x1 + x2 + … + xn. Such a simply derived value will certainly be misleading: no-one would want a candidate with 5,000 Twitter followers and no publications to appear a hundred times stronger than one with an H-index of 50 and no Twitter account.

A first step towards refinement, then, would weight each of the individual metrics using a set of constant parameters k1, k2kn to be determined by judgement and experiment. This yields another metric, k1·x1 + k2·x2 + … + kn·xn. It allows the down-weighting of less important metrics and the up-weighting of more important ones.

However, even with well-chosen ki parameters, this better metric has problems. Is it really a hundred times as good to have 10,000 Twitter followers than 100? Perhaps we might decide that it’s only ten times as good – that the value of a Twitter following scales with the square root of the count. Conversely, in some contexts at least, an H-index of 40 might be more than twice as good as one of 20. In a search for a candidate for a senior role, one might decide that the value of an H-index scales with the square of the value; or perhaps it scales somewhere between linearly and quadratically – with H-index1.5, say. So for full generality, the calculation of the “Less Wrong Metric”, or LWM for short, would be configured by two sets of parameters: factors k1, k2kn, and exponents e1, e2en. Then the formula would be:

LWM = k1·x1e1 + k2·x2e2 + … + kn·xnen

So that’s the idea of the LWM — and you can see now why I refer to this as a family of metrics. Given n metrics that you’re interested in, you pick 2n parameters to combine them with, and get a number that to some degree measures what you care about.

(How do you choose your 2n parameters? That’s the subject of the next post. Or, as before, you can skip ahead and read the paper.)

References

Like Stephen Curry, we at SV-POW! are sick of impact factors. That’s not news. Everyone now knows what a total disaster they are: how they are signficantly correlated with retraction rate but not with citation count; how they are higher for journals whose studies are less statistically powerful; how they incentivise bad behaviour including p-hacking and over-hyping. (Anyone who didn’t know all that is invited to read Brembs et al.’s 2013 paper Deep impact: unintended consequences of journal rank, and weep.)

Its 2016. Everyone who’s been paying attention knows that impact factor is a terrible, terrible metric for the quality of a journal, a worse one for the quality of a paper, and not even in the park as a metric for the quality of a researcher.

Unfortunately, “everyone who’s been paying attention” doesn’t seem to include such figures as search committees picking people for jobs, department heads overseeing promotion, tenure committees deciding on researchers’ job security, and I guess granting bodies. In the comments on this blog, we’ve been told time and time and time again — by people who we like and respect — that, however much we wish it weren’t so, scientists do need to publish in high-IF journals for their careers.

What to do?

It’s a complex problem, not well suited to discussion on Twitter. Here’s what I wrote about it recently:

The most striking aspect of the recent series of Royal Society meetings on the Future of Scholarly Scientific Communication was that almost every discussion returned to the same core issue: how researchers are evaluated for the purposes of recruitment, promotion, tenure and grants. Every problem that was discussed – the disproportionate influence of brand-name journals, failure to move to more efficient models of peer-review, sensationalism of reporting, lack of replicability, under-population of data repositories, prevalence of fraud – was traced back to the issue of how we assess works and their authors.

It is no exaggeration to say that improving assessment is literally the most important challenge facing academia.

This is from the introduction to a new paper which came out today: Taylor (2016), Better ways to evaluate research and researchers. In eight short pages — six, really, if you ignore the appendix — I try to get to grips with the historical background that got us to where we are, I discuss some of the many dimensions we should be using to evaluate research and researchers, and I propose a family of what I call Less Wrong Metrics — LWMs — that administrators could use if they really absolutely have to put a single number of things.

(I was solicited to write this by SPARC Europe, I think in large part because of things I have written around this subject here on SV-POW! My thanks to them: this paper becomes part of their Briefing Papers series.)

Next time I’ll talk about the LWM and how to calculate it. Those of you who are impatient might want to read the actual paper first!

References

In praise of Jack McIntosh

December 14, 2015

A short one today, and a sad one.

I heard last night on Twitter that Jack McIntosh has died at the age of 92. It would be hard to overstate what an inspiration he’s been to me. As a professional in a non-palaeo field who went on to do crucial work in sauropod palaeontology, he blazed a trail that I have tried in my small way to follow. I think it’s true to say that, without his example, I would never have got into palaeo research — never even considered it a possibility.

Jack McIntosh, still going strong at a conference late in life. Picture from this tweet by ReBecca Hunt-Foster.

Jack McIntosh, still going strong at a conference late in life. Picture from this tweet by ReBecca Hunt-Foster. Hans-Dieter Sues for scale.

Others have written more about McIntosh’s crucial work — for example, determining the correct skull skull for Apatosaurus (McIntosh and Berman 1975), his careful historical work in collections (McIntosh 1981), his detailed monographic descriptions (e.g. McIntosh et al. 1996) and most recently his re-evaluation of Barosaurus (McIntosh 2005). When I made my own start in palaeo, around 2000, his chapter in The Dinosauria (McIntosh 1990) was the definitive overview of the sauropods.

Perhaps the best overview of his life and work is the interview that Jeff Wilson and Kristi Curry Rogers conducted with him for the afterword of the volume that they edited in his honour in 2005 (Wilson and Curry Rogers 2005). It’s well worth reading.

Pittsburgh, Pennsylvania, USA --- Leading sauropod expert Jack McIntosh beneath Apatosaurus Louisae at the Carnegie Museum of Natural History, a forty-ton vegetarian named after Andrew Carnegie's wife, which is over seventy-seven feet (over 23 meters) long and is the longest mounted dinosaur in the world. --- Image by © Louie Psihoyos/Corbis

Pittsburgh, Pennsylvania, USA — Leading sauropod expert Jack McIntosh beneath Apatosaurus Louisae at the Carnegie Museum of Natural History, a forty-ton vegetarian named after Andrew Carnegie’s wife, which is over seventy-seven feet (over 23 meters) long and is the longest mounted dinosaur in the world. — Image by © Louie Psihoyos/Corbis

I’ll close with my own brief experience of meeting Jack, a privilege that I had only once. It was the 2007 SVP meeting in Austin, Texas. I somehow got invited to a sauropod workers’ lunch one day. By careful manoeuvring, I managed to sit myself next to Jack. At that stage I had two very minor papers to my name — the 2005 note on the phylogenetic taxonomy of diplodocoids and the 2006 Mesozoic Terrestrial Ecosystems short-paper on dinosaur diversity. In short, I was a nobody.

But Jack was fascinated by what I was working on. At that time, the Xenoposeidon paper was in press — no-one had seen it but Darren (my co-author), the handling editor and three peer-reviewers. I sketched the holotype dorsal vertebra — literally on a napkin, if I remember rightly — and explained all the unique features. At this point, Jack was 84 years old and could certainly have been forgiven for just wanting to have his lunch in peace, but he was deeply interested. Even at the time I was aware of the honour of showing this work to a man who’d been at the forefront of my field for four decades.

I don’t remember whether we discussed it at the time, but I’d spent the previous week, with Matt, Randy Irmis and Sarah Werning, in the collections at the Sam Noble Oklahoma Museum of Natural History, working on the remains of a sauropod from the Hotel Mesa quarry in the Cedar Mountain Formation. When the paper finally came out four years later (Taylor, Wedel and Cifelli 2007), we named the new dinosaur Brontomerus mcintoshi in Jack’s honour. Very nearly but not quite a year earlier, Chure et al. (2010) had beat us to the punch by naming their brachiosaurid Abydosaurus mcintoshi after him.

To the best of my knowledge, that makes Jack the only person in history to have had two sauropods named after him in a year. A fitting tribute indeed.

Update 1 (16 December)

Ken Carpenter writes: “Mike, Here is the electronic card I made for McIntosh’s 90th birthday. I’d like to have posted at SVPoW.”

For Jack

Update 2 (16 December)

Jeff Wilson has written a piece that goes into much more detail about McIntosh’s scientific achievements. Well worth a read.

References

  • Chure, Daniel, Brooks B. Britt, John A. Whitlock and Jeffrey A. Wilson. 2010. First complete sauropod dinosaur skull from the Cretaceous of the Americas and the evolution of sauropod dentition. Naturwissenschaften 97(4):379-91. doi:10.1007/s00114-010-0650-6
  • McIntosh, John S. 1981. Annotated catalogue of the dinosaurs (Reptilia, Archosauria) in the collections of Carnegie Museum of Natural History. Bulletin of the Carnegie Museum 18:1-67.
  • McIntosh, John S. 2005. The Genus Barosaurus Marsh (Sauropoda, Diplodocidae). pp. 38-77 in Virginia Tidwell and Ken Carpenter (eds.), Thunder Lizards: the Sauropodomorph Dinosaurs. Indiana University Press, Bloomington, Indiana. 495 pp.
  • McIntosh, John S. 1990. Sauropoda. pp. 345-401 in: D. B. Weishampel, P. Dodson and H. Osmólska (eds.), The Dinosauria. University of California Press, Berkeley and Los Angeles.
  • McIntosh, John S., and David, S. Berman. 1975. Description of the palate and lower jaw of the sauropod dinosaur Diplodocus (Reptilia: Saurischia) with remarks on the nature of the skull of Apatosaurus. Journal of Paleontology 49(1):187-199.
  • McIntosh, John S., Wade E. Miller, Kenneth L. Stadtman and David D. Gillette. 1996. The osteology of Camarasaurus lewisi (Jensen, 1988). BYU Geolgy Studies 41:73-115.
  • Taylor, Michael P., Mathew J. Wedel and Richard L. Cifelli. 2011. A new sauropod dinosaur from the Lower Cretaceous Cedar Mountain Formation, Utah, USA. Acta Palaeontologica Polonica 56(1):75-98. doi: 10.4202/app.2010.0073
  • Wilson, Jeffrey A., and Kristina A. Curry Rogers. 2005. A conversation with Jack McIntosh. pp. 327-333 in: Kristina A. Curry Rogers and Jeffrey A. Wilson (eds.), The Sauropods: Evolution and Paleobiology, University of California Press, Berkeley, Los Angeles and London. 349 pages.

Many SV-POW! readers will already be aware that the entire editorial staff of the Elsevier journal Lingua has resigned over the journal’s high price and lack of open access. As soon as they have worked out their contracts, they will leave en bloc and start a new open access journal, Glossa — which will in fact be the old journal under a new name. (Whether Elsevier tries to keep the Lingua ghost-ship afloat under new editors remains to be seen.)

Today I saw Elsevier’s official response, “Addressing the resignation of the Lingua editorial board“. I just want to pick out one tiny part of this, which reads as follows:

The article publishing charge at Lingua for open access articles is 1800 USD. The editor had requested a price of 400 euros, an APC that is not sustainable. Had we made the journal open access only and at the suggested price point, it would have rendered the journal no longer viable – something that would serve nobody, least of which the linguistics community.

The new Lingua will be hosted at Ubiquity Press, a well-established open-access publisher that started out as UCL’s in-house OA publishing arm and has spun off into private company. The APC at Ubiquity journals is typically £300 (€375, $500), which is less than the level that Elsevier describe as “not sustainable” (and a little over a fifth of what Elsevier currently charge).

Evidently Ubiquity Press finds it sustainable.

You know what’s not sustainable? Dragging around the carcass of a legacy barrier-based publisher, with all its expensive paywalls, authentication systems, Shibboleth/Athens/Kerberos integration, lawyers, PR departments, spin-doctors, lobbyists, bribes to politicians, and of course 37.3% profit margins.

The biggest problem with legacy publishers? They’re just a waste of money.

When I separated my cat’s head from its body, the first five cervical vertebrae came with it. Never one to waste perfectly good cervicals, I prepped them as well as the skull. Here they are, nicely articulated. (Click through for high resolution.) Dorsal view at the top, then right lateral (actually, slightly dorsolateral) and ventral view at the bottom.

cat-first-five-cervicals-white

Or you may prefer the same image on a black background:

cat-first-five-cervicals

For those of us used to sauropod necks, where the atlas (C1) is a tiny, fragile ring, mammal atlases look bizarre, with their grotesque over-engineering and gigantic wings.

Follow

Get every new post delivered to your Inbox.

Join 3,595 other followers