I recently discovered the blog Slime Mold Time Mold, which is largely about the science of obesity — a matter of more than academic interest to me, and if I may say to, to Matt.

I discovered SMTM through its fascinating discussions of scurvy and citrus-fruit taxonomy. But what’s really been absorbing me recently is a series of twenty long, detailed posts under the banner “A Chemical Hunger“, in which the author contests that the principle cause of the modern obesity epidemic is chemically-induced changes to the “lipostat” that tells our bodies what level of mass to maintain.

I highly recommend that you read the first post in this series, “Mysteries“, and see what you think. If you want to read on after that, fine; but even if you stop there, you’ll still have read something fascinating, counter-intuitive, well referenced and (I think) pretty convincing.

Anyway. The post that fascinates me right now is one of the digressions: “Interlude B: The Nutrient Sludge Diet“. In this post, the author tells us about “a 1965 study in which volunteers received all their food from a ‘feeding machine’ that pumped a ‘liquid formula diet’ through a ‘dispensing syringe-type pump which delivers a predetermined volume of formula through the mouthpiece'”, but they were at liberty to choose how many hits of this neutral-tasting sludge they took.

This study had an absolutely sensational outcome: among the participants with healthy body-weight, the amount of nutrient sludge that they chose to feed themselves was almost exactly equal in caloric content to their diets before the experiment. But the grossly obese participants (weighing about 400 lb = 180 kg), chose to feed themselves a tiny proportion of their usual intake — about one tenth — and lost an astonishing amount of weight. All without feeling hunger.

Please do read the Slime Mold Time Mold write-up for the details. But I will let you in right now on the study’s very very significant flaw. The sample-size was two. That is, two obese participants, plus a control-group of two healthy-weight individuals. And clearly whatever conclusion we can draw from a study of that size is merely anecdotal, having no statistical power worth mentioning.

And now we come to the truly astonishing part of this. It seems no-one has tried to replicate this study with a decent-sized sample. The blog says:

If this works, why hasn’t someone replicated it by now? It would be pretty easy to run a RCT where you fed more than five obese people nutrient sludge ad libitum for a couple weeks, so this means either it doesn’t work as described, or it does work and for some reason no one has tried it. Given how huge the rewards for this finding would be, we’re going to go with the “it doesn’t work” explanation.

In a comment, I asked:

OK, I’ll bite. Why hasn’t anyone tried to replicate the astounding and potentially valuable findings of these studies? It beggars belief that it’s not been tried, and multiple times. Do you think it has been tried, but the results weren’t published because they were unimpressive? That would be an appalling waste.

The blog author replied:

Our guess is that it simple hasn’t been tried! Academia likes to pretend that research is one-and-done, and rarely checks things once they’re in the literature. We agree, someone should try to replicate!

I’m sort of at a loss for words here. How can it possibly be that, 58 years after a pilot study that potentially offers a silver bullet to the problem of obesity, no-one has bothered to check whether it works? I mean, the initial study is so old that Revolver hadn’t been released. Yet it seems to have just lain there, unloved, as the Beatles moved on through Sergeant Pepper, the White Album, Abbey Road et al., broke up, pursued their various solo projects, died (50% of the sample) and watched popular music devolve into whatever the heck it is now.

Why aren’t obesity researchers all over this?

Darren, the silent partner at SV-POW!, pointed me to this tweet by Duc de Vinney, displaying a tableau of “A bunch of Boners (people who study bones) Not just paleontologists, some naturalists and cryptozoologists too”, apparently commissioned by @EDGEinthewild:

As you can see, Darren, Matt and I (as well as long-time Friend Of SV-POW! Mark Witton) somehow all made it into the cartoon, ahead of numerous far more deserving people. Whatever the criterion was, and whatever reason Edge In The Wild had for wanting this, I am delighted to be included alongside the likes of Owen, Osborn, Cope, Marsh, and Bob Bakker. Even if the caricatures are not especially flattering.

Here is an edit showing only the three of us, which I am sure I will find many fruitful uses for:

My thanks to Duc de Vinney for creating this!

 

I’m currently working on a paper about the AMNH’s rearing Barosaurus mount. (That’s just one of the multiple reasons I am currently obsessed by Barosaurus.) It’s a fascinating process: more of a history project than a scientific one. It’s throwing up all sorts of things. Here’s one.

In 1992, the year after the mount went up, S. O. Landry gave a talk at the annual meeting of American Zoologist about this mount. I don’t even remember now where I saw a reference to this, or how I found it, but the untitled abstract is on JSTOR, as part of the society’s abstracts volume. Here it is, in its entirety:

I thought he’d made some good points, so I wanted to figure out whether he’d ever gone on from this 31-year-old abstract and published a paper about it.

Based on the surname, initials and affiliation, I searched here and there, and turned up a few bits and pieces. I learned that he was  a Professor of Biology at SUNY at Binghamton, specialising in hystricomorph rodents. I found out that his wife Helen died in 2007 after 57 years’ marriage. (That’s not just idle curiosity: it’s how I discovered that his first name was Stuart.) I found a photograph of him, taken in 1975, with Assemblyman James L. Tallon, and learned in the process that his middle name was Omer. I found that he was at one time the Graduate Dean at SUNY Binghamton, and opposed the 1972 rise in tuition fees from $800 per year to $1200–$1500. I learned that his BS was from Harvard College and his Ph.D from UC Berkeley, and that he is still listed as a professor emeritus at SUNY Binghamton. I discovered that he “pooh-poohs the idea that young students’ minds are “tabula rasas” – blank slates”. I know that in 1966 he translated C. C. Robin’s Voyage to the Interior of Louisiana from its original French. I learned that he was born in 1924 and died in 2015 at the age of 90, and served in the Battle of the Bulge.  More troublingly, I discovered that his father, also named Stuart Omer Landry, was known for writing racist tracts for the Pelican Publishing Company, but that he himself rose above that heritage and became known for his progressive politics.

I don’t know what to make of any of this. It seems that he never published anything substantive about Barosaurus, so in that sense, I have lost interest in him. But isn’t it strange that in trying to answer the simple question “Did the S. O. Landry who wrote an abstract about rearing Barosaurus write anything else on the subject?” has wound up opening the book of someone’s life like this?

And how strange that someone with 90 years of rich, complex life and numerous academic achievements should be, to me, just the guy who wrote an untitled abstract about Barosaurus that one time.

Imposter syndrome revisited

September 13, 2018

My wife Fiona is a musician and composer, and she’s giving a talk at this year’s TetZooCon on “Music for Wildlife Documentaries – A Composer’s Perspective”. (By the way, it looks like some tickets are still available: if you live near or in striking distance of London, you should definitely go! Get your tickets here.)

With less than four weeks to go, she’s starting to get nervous — to feel that she doesn’t know enough about wildlife to talk to the famously knowledgeable and attractive TetZooCon audience. In other words, it’s a classic case of our old friend imposter syndrome.

Wanting to reassure her about how common this is, I posted a Twitter poll:

Question for academics, including grad-students.
(Please RT for better coverage.)

Have you ever experienced Imposter Syndrome?
(And feel free to leave comments with more detail.)

Here are the results at the end of the 24-hour voting period:

Based on a sample of nearly 200 academics, just one in 25 claims not have experienced imposter syndrome; nearly two thirds feel it all the time.

The comments are worth reading, too. For example, Konrad Förstner responded:

Constantly. I would not be astonished if at some point a person from the administration knocks at my door and tells me that my work was just occupational therapy to keep me busy but that my healthcare insurance will not pay this any longer.

What does this mean? Only this: you are not alone. Outside of a tiny proportion of people, everyone else you know and work with sometimes feels that way. Most of them always feel that way. And yet, think about the work they do. It’s pretty good, isn’t it? Despite how they feel? From the outside, you can see that they’re not imposters.

Guess what? They can see that you‘re not an imposter, either.

The previous post (Every attempt to manage academia makes it worse) has been a surprise hit, and is now by far the most-read post in this blog’s nearly-ten-year history. It evidently struck a chord with a lot of people, and I’ve been surprised — amazed, really — at how nearly unanimously people have agreed with it, both in the comments here and on Twitter.

But I was brought up short by this tweet from Thomas Koenig:

That is the question, isn’t it? Why do we keep doing this?

I don’t know enough about the history of academia to discuss the specific route we took to the place we now find ourselves in. (If others do, I’d be fascinated to hear.) But I think we can fruitfully speculate on the underlying problem.

Let’s start with the famous true story of the Hanoi rat epidemic of 1902. In a town overrun by rats, the authorities tried to reduce the population by offering a bounty on rat tails. Enterprising members of the populace responded by catching live rats, cutting off their tails to collect the bounty, then releasing the rats to breed, so more tails would be available in future. Some people even took to breeding rats for their tails.

Why did this go wrong? For one very simple reason: because the measure optimised for was not the one that mattered. What the authorities wanted to do was reduce the number of rats in Hanoi. For reasons that we will come to shortly, the proxy that they provided an incentive for was the number of rat tails collected. These are not the same thing — optimising for the latter did not help the former.

The badness of the proxy measure applies in two ways.

First, consider those who caught rats, cut their tails off and released them. They stand as counter-examples to the assumption that harvesting a rat-tail is equivalent to killing the rat. The proxy was bad because it assumed a false equivalence. It was possible to satisfy the proxy without advancing the actual goal.

Second, consider those who bred rats for their tails. They stand as counter-examples to the assumption that killing a rat is equivalent to decreasing the total number of live rats. Worse, if the breeders released their de-tailed captive-bred progeny into the city, their harvests of tails not only didn’t represent any decrease in the feral population, they represented an increase. So the proxy was worse than neutral because satisfying it could actively harm the actual goal.

So far, so analogous to the perverse academic incentives we looked at last time. Where this gets really interesting is when we consider why the Hanoi authorities chose such a terribly counter-productive proxy for their real goal. Recall their object was to reduce the feral rat population. There were two problems with that goal.

First, the feral rat population is hard to measure. It’s so much easier to measure the number of tails people hand in. A metric is seductive if it’s easy to measure. In the same way, it’s appealing to look for your dropped car-keys under the street-lamp, where the light is good, rather than over in the darkness where you dropped them. But it’s equally futile.

Second — and this is crucial — it’s hard to properly reward people for reducing the feral rat population because you can’t tell who has done what. If an upstanding citizen leaves poison in the sewers and kills a thousand rats, there’s no way to know what he has achieved, and to reward him for it. The rat-tail proxy is appealing because it’s easy to reward.

The application of all this to academia is pretty obvious.

First the things we really care about are hard to measure. The reason we do science — or, at least, the reason societies fund science — is to achieve breakthroughs that benefit society. That means important new insights, findings that enable new technology, ways of creating new medicines, and so on. But all these things take time to happen. It’s difficult to look at what a lab is doing now and say “Yes, this will yield valuable results in twenty years”. Yet that may be what is required: trying to evaluate it using a proxy of how many papers it gets into high-IF journals this year will most certainly mitigate against its doing careful work with long-term goals.

Second we have no good way to reward the right individuals or labs. What we as a society care about is the advance of science as a whole. We want to reward the people and groups whose work contributes to the global project of science — but those are not necessarily the people who have found ways to shine under the present system of rewards: publishing lots of papers, shooting for the high-IF journals, skimping on sample-sizes to get spectacular results, searching through big data-sets for whatever correlations they can find, and so on.

In fact, when a scientist who is optimising for what gets rewarded slices up a study into multiple small papers, each with a single sensational result, and shops them around Science and Nature, all they are really doing is breeding rats.

If we want people to stop behaving this way, we need to stop rewarding them for it. (Side-effect: when people are rewarded for bad behaviour, people who behave well get penalised, lose heart, and leave the field. They lose out, and so does society.)

Q. “Well, that’s great, Mike. What do you suggest?”

A. Ah, ha ha, I’d been hoping you wouldn’t bring that up.

No-will be surprised to hear that I don’t have a silver bullet. But I think the place to start is by being very aware of the pitfalls of the kinds of metrics that managers (including us, when wearing certain hats) like to use. Managers want metrics that are easy to calculate, easy to understand, and quick to yield a value. That’s why articles are judged by the impact factor of the journal they appear in: the calculation of the article’s worth is easy (copy the journal’s IF out of Wikipedia); it’s easy to understand (or, at least, it’s easy for people to think they understand what an IF is); and best of all, it’s available immediately. No need for any of that tedious waiting around five years to see how often the article is cited, or waiting ten years to see what impact it has on the development of the field.

Wise managers (and again, that means us when wearing certain hats) will face up to the unwelcome fact that metrics with these desirable properties are almost always worse than useless. Coming up with better metrics, if we’re determined to use metrics at all, is real work and will require an enormous educational effort.

One thing we can usefully do, whenever considering a proposed metric, is actively consider how it can and will be hacked. Black-hat it. Invest a day imagining you are a rational, selfish researcher in a regimen that uses the metric, and plan how you’re going to exploit it to give yourself the best possible score. Now consider whether the course of action you mapped out is one that will benefit the field and society. If not, dump the metric and start again.

Q. “Are you saying we should get rid of metrics completely?”

A. Not yet; but I’m open to the possibility.

Given metrics’ terrible track-record of hackability, I think we’re now at the stage where the null hypothesis should be that any metric will make things worse. There may well be exceptions, but the burden of proof should be on those who want to use them: they must show that they will help, not just assume that they will.

And what if we find that every metric makes things worse? Then the only rational thing to do would be not to use any metrics at all. Some managers will hate this, because their jobs depend on putting numbers into boxes and adding them up. But we’re talking about the progress of research to benefit society, here.

We have to go where the evidence leads. Dammit, Jim, we’re scientists.

I’ve been on Twitter since April 2011 — nearly six years. A few weeks ago, for the first time, something I tweeted broke the thousand-retweets barrier. And I am really unhappy about it. For two reasons.

First, it’s not my own content — it’s a screen-shot of Table 1 from Edwards and Roy (2017):

c49rdmlweaaa4if

And second, it’s so darned depressing.

The problem is a well-known one, and indeed one we have discussed here before: as soon as you try to measure how well people are doing, they will switch to optimising for whatever you’re measuring, rather than putting their best efforts into actually doing good work.

In fact, this phenomenon is so very well known and understood that it’s been given at least three different names by different people:

  • Goodhart’s Law is most succinct: “When a measure becomes a target, it ceases to be a good measure.”
  • Campbell’s Law is the most explicit: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”
  • The Cobra Effect refers to the way that measures taken to improve a situation can directly make it worse.

As I say, this is well known. There’s even a term for it in social theory: reflexivity. And yet we persist in doing idiot things that can only possibly have this result:

  • Assessing school-teachers on the improvement their kids show in tests between the start and end of the year (which obviously results in their doing all they can depress the start-of-year tests).
  • Assessing researchers by the number of their papers (which can only result in slicing into minimal publishable units).
  • Assessing them — heaven help us — on the impact factors of the journals their papers appear in (which feeds the brand-name fetish that is crippling scholarly communication).
  • Assessing researchers on whether their experiments are “successful”, i.e. whether they find statistically significant results (which inevitably results in p-hacking and HARKing).

What’s the solution, then?

I’ve been reading the excellent blog of economist Tim Harford, for a while. That arose from reading his even more excellent book The Undercover Economist (Harford 2007), which gave me a crash-course in the basics of how economies work, how markets help, how they can go wrong, and much more. I really can’t say enough good things about this book: it’s one of those that I feel everyone should read, because the issues are so important and pervasive, and Harford’s explanations are so clear.

In a recent post, Why central bankers shouldn’t have skin in the game, he makes this point:

The basic principle for any incentive scheme is this: can you measure everything that matters? If you can’t, then high-powered financial incentives will simply produce short-sightedness, narrow-mindedness or outright fraud. If a job is complex, multifaceted and involves subtle trade-offs, the best approach is to hire good people, pay them the going rate and tell them to do the job to the best of their ability.

I think that last part is pretty much how academia used to be run a few decades ago. Now I don’t want to get all misty-eyed and rose-tinted and nostalgic — especially since I wasn’t even involved in academia back then, and don’t know from experience what it was like. But could it be … could it possibly be … that the best way to get good research and publications out of scholars is to hire good people, pay them the going rate and tell them to do the job to the best of their ability?

[Read on to Why do we manage academia so badly?]

References

Bonus

Here is a nicely formatted full-page version of the Edwards and Roy table, for you to print out and stick on all the walls of your university. My thanks to David Roberts for preparing it.

The European Commission is putting together a Commission Expert Group to provide advice about the development and implementation of open science policy in Europe. It will be known as the Open Science Policy Platform (OSPP).

This is potentially excellent news. The OSPP’s primary goal is to “advise the Commission on how to further develop and practically implement open science policy”.

But there’s potentially a downside here. We can be sure that the legacy publishers will attempt to stuff the committee with their own people, just as they did with the Finch committee — and that, if they succeed, they will do everything they can to retard all forms of progress that hurt their bottom line, just as they did with the Finch committee.

Unfortunately, multinational corporations with £2 billion annual revenue and £762 million annual profit (see page 17 of Elsevier’s 2014 annual report) are very well positioned to dedicate resources to getting their people onto influential committees. Those of us without a spare £762 million to spend on marketing are at a huge operational disadvantage when it comes to influencing policy. Happily, though, we do have one important thing on our side: we’re right.

So we should do what we can to get genuinely progressive pro-open candidates onto the OSPP. I know of several people who have put themselves forward, and I am briefly describing them below (in the order I hear about their candidacy). I have publicly endorsed the first few, and will go on to endorse the others just as soon as I have a moment. If you know and admire these people, please consider leaving your own endorsement — it will help their case to be taken on to the OSPP.


Björn Brembs is a neuroscientist who has been a tireless advocate for open access, and open science more generally, for many years. He has particularly acute insights into the wastefulness of our present scholarly communication mechanisms. His candidacy is announced on his blog, and I left my endorsement as a comment.

Cameron Neylon falls into the needs-no-introduction category. Every time I’ve talked to him, I’ve come away better informed and wiser, thanks to his exhaustive knowledge and understanding of the issues surrounding openness: both the opportunities is presents, and the difficulties that slow our progress. His candidacy is announced on his blog, and I left my endorsement as a comment.

Chris Hartgerink is an active researcher in text and data mining, whose work has repeatedly been disrupted by impediments deliberately imposed by barrier-based publishers. He knows what it’s like on the ground in the content-mining wars. His candidacy is announced on his blog, and I left my endorsement as a comment.

Daniel Mietchen both practices and advocates openness at every stage in the scientific process, with a special focus on the use of Wikipedia and the ways its free content can be enhanced. Fittingly, his candidacy bid is itself a wiki page, and endorsements are invited on the corresponding discussion page.

Konrad Förstner develops open source software for reasearch, works on how to make analyses reproducible, promotes the use pf pre-print servers and creates generate open educational resources. His candidacy is announced on his blog, and I left my endorsement as a comment. [H/T Daniel Mietchen]

Finally (for now), Jenny Molloy, is the manager of Content Mine and co-ordinator of OKFN, the Open Knowledge Foundation. She has announced her candidacy on a mailing list, but doesn’t yet have a web-page about it, to my knowledge. I’ll update this page as soon as I hear that this has changed.


 

That’s it for now: get out there and endorse the candidates that you like!

Have I missed anyone? Let me know, and I’ll update this post.

 

[Note: Mike asked me to scrape a couple of comments on his last post – this one and this one – and turn them into a post of their own. I’ve edited them lightly to hopefully improve the flow, but I’ve tried not to tinker with the guts.]

This is the fourth in a series of posts on how researchers might better be evaluated and compared. In the first post, Mike introduced his new paper and described the scope and importance of the problem. Then in the next post, he introduced the idea of the LWM, or Less Wrong Metric, and the basic mathemetical framework for calculating LWMs. Most recently, Mike talked about choosing parameters for the LWM, and drilled down to a fundamental question: (how) do we identify good research?

Let me say up front that I am fully convicted about the problem of evaluating researchers fairly. It is a question of direct and timely importance to me. I serve on the Promotion & Tenure committees of two colleges at Western University of Health Sciences, and I want to make good decisions that can be backed up with evidence. But anyone who has been in academia for long knows of people who have had their careers mangled, by getting caught in institutional machinery that is not well-suited for fairly evaluating scholarship. So I desperately want better metrics to catch on, to improve my own situation and those of researchers everywhere.

For all of those reasons and more, I admire the work that Mike has done in conceiving the LWM. But I’m pretty pessimistic about its future.

I think there is a widespread misapprehension that we got here because people and institutions were looking for good metrics, like the LWM, and we ended up with things like impact factors and citation counts because no-one had thought up anything better. Implying a temporal sequence of:

1. Deliberately looking for metrics to evaluate researchers.
2. Finding some.
3. Trying to improve those metrics, or replace them with better ones.

I’m pretty sure this is exactly backwards: the metrics that we use to evaluate researchers are mostly simple – easy to explain, easy to count (the hanky-panky behind impact factors notwithstanding) – and therefore they spread like wildfire, and therefore they became used in evaluation. Implying a very different sequence:

1. A metric is invented, often for a reason completely unrelated to evaluating researchers (impact factors started out as a way for librarians to rank journals, not for administration to rank faculty!).
2. Because a metric is simple, it becomes widespread.
3. Because a metric is both simple and widespread, it makes it easy to compare people in wildly different circumstances (whether or not that comparison is valid or defensible!), so it rapidly evolves from being trivia about a researcher, to being a defining character of a researcher – at least when it comes to institutional evaluation.

If that’s true, then any metric aimed for wide-scale adoption needs to be as simple as possible. I can explain the h-index or i10 index in one sentence. “Citation count” is self-explanatory. The fundamentals of the impact factor can be grasped in about 30 seconds, and even the complicated backstory can be conveyed in about 5 minutes.

In addition to being simple, the metric needs to work the same way across institutions and disciplines. I can compare my h-index with that of an endowed chair at Cambridge, a curator at a small regional museum, and a postdoc at Podunk State, and it Just Works without any tinkering or subjective decisions on the part of the user (other than What Counts – but that affects all metrics dealing with publications, so no one metric is better off than any other on that score).

I fear that the LWM as conceived in Taylor (2016) is doomed, for the following reasons:

  • It’s too complex. It would probably be doomed if it had just a single term with a constant and an exponent (which I realize would defeat the purpose of having either a constant or an exponent), because that’s more math than either an impact factor or an h-index requires (perceptively, anyway – in the real world, most people’s eyes glaze over when the exponents come out).
  • Worse, it requires loads of subjective decisions and assigning importance on the part of the users.
  • And fatally, it would require a mountain of committee work to sort that out. I doubt if I could get the faculty in just one department to agree on a set of terms, constants, and exponents for the LWM, much less a college, much less a university, much less all of the universities, museums, government and private labs, and other places where research is done. And without the promise of universal applicability, there’s no incentive for any institution to put itself through the hell of work it would take to implement.

Really, the only way I think the LWM could get into place is by fiat, by a government body. If the EPA comes up with a more complicated but also more accurate way to measure, say, airborne particle output from car exhausts, they can theoretically say to the auto industry, “Meet this standard or stop selling cars in the US” (I know there’s a lot more legislative and legal push and pull than that, but it’s at least possible). And such a standard might be adopted globally, either because it’s a good idea so it spreads, or because the US strong-arms other countries into following suit.

Even if I trusted the US Department of Education to fill in all of the blanks for an LWM, I don’t know that they’d have the same leverage to get it adopted. I doubt that the DofE has enough sway to get it adopted even across all of the educational institutions. Who would want that fight, for such a nebulous pay-off? And even if it could be successfully inflicted on educational institutions (which sounds negative, but that’s precisely how the institutions would see it), what about the numerous and in some cases well-funded research labs and museums that don’t fall under the DofE’s purview? And that’s just in the US. The culture of higher education and scholarship varies a lot among countries. Which may be why the one-size-fits-all solutions suck – I am starting to wonder if a metric needs to be broken, to be globally applicable.

The problem here is that the user base is so diverse that the only way metrics get adopted is voluntarily. So the challenge for any LWM is to be:

  1. Better than existing metrics – this is the easy part – and,
  2. Simple enough to be both easily grasped, and applied with minimal effort. In Malcolm Gladwell Tipping Point terms, it needs to be “sticky”. Although a better adjective for passage through the intestines of academia might be “smooth” – that is, having no rough edges, like exponents or overtly subjective decisions*, that would cause it to snag.

* Calculating an impact factor involves plenty of subjective decisions, but it has the advantages that (a) the users can pretend otherwise, because (b) ISI does the ‘work’ for them.

At least from my point of view, the LWM as Mike has conceived it is awesome and possibly unimprovable on the first point (in that practically any other metric could be seen as a degenerate case of the LWM), but dismal and possibly pessimal on the second one, in that it requires mounds of subjective decision-making to work at all. You can’t even get a default number and then iteratively improve it without investing heavily in advance.

An interesting thought experiment would be to approach the problem from the other side: invent as many new simple metrics as possible, and then see if any of them offer advantages over the existing ones. Although I have a feeling that people are already working on that, and have been for some time.

Simple, broken metrics like impact factor are the prions of scholarship. Yes, viruses are more versatile and cells more versatile still, by orders of magnitude, but compared to prions, cells take an awesome amount of effort to build and maintain. If you just want to infect someone and you don’t care how, prions are very hard to beat. And they’re so subtle in their machinations that we only became aware of them comparatively recently – much like the emerging problems with “classical” (e.g., non-alt) metrics.

I’d love to be wrong about all of this. I proposed the strongest criticism of the LWM I could think of, in hopes that someone would come along and tear it down. Please start swinging.

You’ll remember that in the last installment (before Matt got distracted and wrote about archosaur urine), I proposed a general schema for aggregating scores in several metrics, terming the result an LWM or Less Wrong Metric. Given a set of n metrics that we have scores for, we introduce a set of n exponents ei which determine how we scale each kind of score as it increases, and a set of n factors ki which determine how heavily we weight each scaled score. Then we sum the scaled results:

LWM = k1·x1e1 + k2·x2e2 + … + kn·xnen

“That’s all very well”, you may ask, “But how do we choose the parameters?”

Here’s what I proposed in the paper:

One approach would be to start with subjective assessments of the scores of a body of researchers – perhaps derived from the faculty of a university confidentially assessing each other. Given a good-sized set of such assessments, together with the known values of the metrics x1, x2xn for each researcher, techniques such as simulated annealing can be used to derive the values of the parameters k1, k2kn and e1, e2en that yield an LWM formula best matching the subjective assessments.

Where the results of such an exercise yield a formula whose results seem subjectively wrong, this might flag a need to add new metrics to the LWM formula: for example, a researcher might be more highly regarded than her LWM score indicates because of her fine record of supervising doctoral students who go on to do well, indicating that some measure of this quality should be included in the LWM calculation.

I think as a general approach that is OK: start with a corpus of well understood researchers, or papers, whose value we’ve already judged a priori by some means; then pick the parameters that best approximate that judgement; and let those parameters control future automated judgements.

The problem, really, is how we make that initial judgement. In the scenario I originally proposed, where say the 50 members of a department each assign a confidential numeric score to all the others, you can rely to some degree on the wisdom of crowds to give a reasonable judgement. But I don’t know how politically difficult it would be to conduct such an exercise. Even if the individual scorers were anonymised, the person collating the data would know the total scores awarded to each person, and it’s not hard to imagine that data being abused. In fact, it’s hard to imagine it not being abused.

In other situations, the value of the subjective judgement may be close to zero anyway. Suppose we wanted to come up with an LWM that indicates how good a given piece of research is. We choose LWM parameters based on the scores that a panel of experts assign to a corpus of existing papers, and derive our parameters from that. But we know that experts are really bad at assessing the quality of research. So what would our carefully parameterised LWM be approximating? Only the flawed judgement of flawed experts.

Perhaps this points to an even more fundamental problem: do we even know what “good research” looks like?

It’s a serious question. We all know that “research published in high-Impact Factor journals” is not the same thing as good research. We know that “research with a lot of citations” is not the same thing as good research. For that matter, “research that results in a medical breakthrough” is not necessarily the same thing as good research. As the new paper points out:

If two researchers run equally replicable tests of similar rigour and statistical power on two sets of compounds, but one of them happens to have in her batch a compound that turns out to have useful properties, should her work be credited more highly than the similar work of her colleague?

What, then? Are we left only with completely objective measurements, such as statistical power, adherance to the COPE code of conduct, open-access status, or indeed correctness of spelling?

If we accept that (and I am not arguing that we should, at least not yet), then I suppose we don’t even need an LWM for research papers. We can just count these objective measures and call it done.

I really don’t know what my conclusions are here. Can anyone help me out?

I said last time that my new paper on Better ways to evaluate research and researchers proposes a family of Less Wrong Metrics, or LWMs for short, which I think would at least be an improvement on the present ubiquitous use of impact factors and H-indexes.

What is an LWM? Let me quote the paper:

The Altmetrics Manifesto envisages no single replacement for any of the metrics presently in use, but instead a palette of different metrics laid out together. Administrators are invited to consider all of them in concert. For example, in evaluating a researcher for tenure, one might consider H-index alongside other metrics such as number of trials registered, number of manuscripts handled as an editor, number of peer-reviews submitted, total hit-count of posts on academic blogs, number of Twitter followers and Facebook friends, invited conference presentations, and potentially many other dimensions.

In practice, it may be inevitable that overworked administrators will seek the simplicity of a single metric that summarises all of these.

This is a key problem of the world we actually live in. We often bemoan that fact that people evaluating research will apparently do almost anything than actually read the research. (To paraphrase Dave Barry, these are important, busy people who can’t afford to fritter away their time in competently and diligently doing their job.) There may be good reasons for this; there may only be bad reasons. But what we know for sure is that, for good reasons or bad, administrators often do want a single number. They want it so badly that they will seize on the first number that comes their way, even if it’s as horribly flawed as an impact factor or an H-index.

What to do? There are two options. One is the change the way these overworked administrators function, to force them to read papers and consider a broad range of metrics — in other words, to change human nature. Yeah, it might work. But it’s not where the smart money is.

So perhaps the way to go is to give these people a better single number. A less wrong metric. An LWM.

Here’s what I propose in the paper.

In practice, it may be inevitable that overworked administrators will seek the simplicity of a single metric that summarises all of these. Given a range of metrics x1, x2xn, there will be a temptation to simply add them all up to yield a “super-metric”, x1 + x2 + … + xn. Such a simply derived value will certainly be misleading: no-one would want a candidate with 5,000 Twitter followers and no publications to appear a hundred times stronger than one with an H-index of 50 and no Twitter account.

A first step towards refinement, then, would weight each of the individual metrics using a set of constant parameters k1, k2kn to be determined by judgement and experiment. This yields another metric, k1·x1 + k2·x2 + … + kn·xn. It allows the down-weighting of less important metrics and the up-weighting of more important ones.

However, even with well-chosen ki parameters, this better metric has problems. Is it really a hundred times as good to have 10,000 Twitter followers than 100? Perhaps we might decide that it’s only ten times as good – that the value of a Twitter following scales with the square root of the count. Conversely, in some contexts at least, an H-index of 40 might be more than twice as good as one of 20. In a search for a candidate for a senior role, one might decide that the value of an H-index scales with the square of the value; or perhaps it scales somewhere between linearly and quadratically – with H-index1.5, say. So for full generality, the calculation of the “Less Wrong Metric”, or LWM for short, would be configured by two sets of parameters: factors k1, k2kn, and exponents e1, e2en. Then the formula would be:

LWM = k1·x1e1 + k2·x2e2 + … + kn·xnen

So that’s the idea of the LWM — and you can see now why I refer to this as a family of metrics. Given n metrics that you’re interested in, you pick 2n parameters to combine them with, and get a number that to some degree measures what you care about.

(How do you choose your 2n parameters? That’s the subject of the next post. Or, as before, you can skip ahead and read the paper.)

References