Simple metrics in a complex world: Matt’s pessimistic take on the LWM
January 30, 2016
[Note: Mike asked me to scrape a couple of comments on his last post – this one and this one – and turn them into a post of their own. I’ve edited them lightly to hopefully improve the flow, but I’ve tried not to tinker with the guts.]
This is the fourth in a series of posts on how researchers might better be evaluated and compared. In the first post, Mike introduced his new paper and described the scope and importance of the problem. Then in the next post, he introduced the idea of the LWM, or Less Wrong Metric, and the basic mathemetical framework for calculating LWMs. Most recently, Mike talked about choosing parameters for the LWM, and drilled down to a fundamental question: (how) do we identify good research?
Let me say up front that I am fully convicted about the problem of evaluating researchers fairly. It is a question of direct and timely importance to me. I serve on the Promotion & Tenure committees of two colleges at Western University of Health Sciences, and I want to make good decisions that can be backed up with evidence. But anyone who has been in academia for long knows of people who have had their careers mangled, by getting caught in institutional machinery that is not well-suited for fairly evaluating scholarship. So I desperately want better metrics to catch on, to improve my own situation and those of researchers everywhere.
For all of those reasons and more, I admire the work that Mike has done in conceiving the LWM. But I’m pretty pessimistic about its future.
I think there is a widespread misapprehension that we got here because people and institutions were looking for good metrics, like the LWM, and we ended up with things like impact factors and citation counts because no-one had thought up anything better. Implying a temporal sequence of:
1. Deliberately looking for metrics to evaluate researchers.
2. Finding some.
3. Trying to improve those metrics, or replace them with better ones.
I’m pretty sure this is exactly backwards: the metrics that we use to evaluate researchers are mostly simple – easy to explain, easy to count (the hanky-panky behind impact factors notwithstanding) – and therefore they spread like wildfire, and therefore they became used in evaluation. Implying a very different sequence:
1. A metric is invented, often for a reason completely unrelated to evaluating researchers (impact factors started out as a way for librarians to rank journals, not for administration to rank faculty!).
2. Because a metric is simple, it becomes widespread.
3. Because a metric is both simple and widespread, it makes it easy to compare people in wildly different circumstances (whether or not that comparison is valid or defensible!), so it rapidly evolves from being trivia about a researcher, to being a defining character of a researcher – at least when it comes to institutional evaluation.
If that’s true, then any metric aimed for wide-scale adoption needs to be as simple as possible. I can explain the h-index or i10 index in one sentence. “Citation count” is self-explanatory. The fundamentals of the impact factor can be grasped in about 30 seconds, and even the complicated backstory can be conveyed in about 5 minutes.
In addition to being simple, the metric needs to work the same way across institutions and disciplines. I can compare my h-index with that of an endowed chair at Cambridge, a curator at a small regional museum, and a postdoc at Podunk State, and it Just Works without any tinkering or subjective decisions on the part of the user (other than What Counts – but that affects all metrics dealing with publications, so no one metric is better off than any other on that score).
I fear that the LWM as conceived in Taylor (2016) is doomed, for the following reasons:
- It’s too complex. It would probably be doomed if it had just a single term with a constant and an exponent (which I realize would defeat the purpose of having either a constant or an exponent), because that’s more math than either an impact factor or an h-index requires (perceptively, anyway – in the real world, most people’s eyes glaze over when the exponents come out).
- Worse, it requires loads of subjective decisions and assigning importance on the part of the users.
- And fatally, it would require a mountain of committee work to sort that out. I doubt if I could get the faculty in just one department to agree on a set of terms, constants, and exponents for the LWM, much less a college, much less a university, much less all of the universities, museums, government and private labs, and other places where research is done. And without the promise of universal applicability, there’s no incentive for any institution to put itself through the hell of work it would take to implement.
Really, the only way I think the LWM could get into place is by fiat, by a government body. If the EPA comes up with a more complicated but also more accurate way to measure, say, airborne particle output from car exhausts, they can theoretically say to the auto industry, “Meet this standard or stop selling cars in the US” (I know there’s a lot more legislative and legal push and pull than that, but it’s at least possible). And such a standard might be adopted globally, either because it’s a good idea so it spreads, or because the US strong-arms other countries into following suit.
Even if I trusted the US Department of Education to fill in all of the blanks for an LWM, I don’t know that they’d have the same leverage to get it adopted. I doubt that the DofE has enough sway to get it adopted even across all of the educational institutions. Who would want that fight, for such a nebulous pay-off? And even if it could be successfully inflicted on educational institutions (which sounds negative, but that’s precisely how the institutions would see it), what about the numerous and in some cases well-funded research labs and museums that don’t fall under the DofE’s purview? And that’s just in the US. The culture of higher education and scholarship varies a lot among countries. Which may be why the one-size-fits-all solutions suck – I am starting to wonder if a metric needs to be broken, to be globally applicable.
The problem here is that the user base is so diverse that the only way metrics get adopted is voluntarily. So the challenge for any LWM is to be:
- Better than existing metrics – this is the easy part – and,
- Simple enough to be both easily grasped, and applied with minimal effort. In Malcolm Gladwell Tipping Point terms, it needs to be “sticky”. Although a better adjective for passage through the intestines of academia might be “smooth” – that is, having no rough edges, like exponents or overtly subjective decisions*, that would cause it to snag.
* Calculating an impact factor involves plenty of subjective decisions, but it has the advantages that (a) the users can pretend otherwise, because (b) ISI does the ‘work’ for them.
At least from my point of view, the LWM as Mike has conceived it is awesome and possibly unimprovable on the first point (in that practically any other metric could be seen as a degenerate case of the LWM), but dismal and possibly pessimal on the second one, in that it requires mounds of subjective decision-making to work at all. You can’t even get a default number and then iteratively improve it without investing heavily in advance.
An interesting thought experiment would be to approach the problem from the other side: invent as many new simple metrics as possible, and then see if any of them offer advantages over the existing ones. Although I have a feeling that people are already working on that, and have been for some time.
Simple, broken metrics like impact factor are the prions of scholarship. Yes, viruses are more versatile and cells more versatile still, by orders of magnitude, but compared to prions, cells take an awesome amount of effort to build and maintain. If you just want to infect someone and you don’t care how, prions are very hard to beat. And they’re so subtle in their machinations that we only became aware of them comparatively recently – much like the emerging problems with “classical” (e.g., non-alt) metrics.
I’d love to be wrong about all of this. I proposed the strongest criticism of the LWM I could think of, in hopes that someone would come along and tear it down. Please start swinging.
January 31, 2016 at 8:10 pm
I have read these two recent articles (Mike’s and Matt’s) with interest. I have to say I am not involved or affected by this issue personally, but can see that is is a very important one in academia, with huge impact on people’s lives.
The LWM is a lovely, fair idea. I do think that Matt’s critique is a cogent one, and that imposition by government (or internationally, cross-government) offers the best or only hope for LWM to be adopted.
February 3, 2016 at 9:40 am
I suppose where Matt and I are seeing this differently is that he’s assuming that LWMs must be imposed from the top down — which I agree seems very unlikely — whereas I see a more promising route being from the bottom up. With the increasing availability of many different metrics, including whole sets of them served via nice APIs at AltMetric and ImpactStory, I think there’s potential for people to run their own LWM experiments on publicly accessible data.
(I should totally do that. But I am so, so short of time. If anyone else wants to do this, please feel free to go ahead.)
February 3, 2016 at 9:20 pm
I tried (without any success) to trigger a related kind of discussion at academia.SE: http://academia.stackexchange.com/questions/60774/how-should-we-evaluate-our-peers and http://academia.stackexchange.com/questions/61596/when-should-we-use-publication-list-for-peer-evaluation
But once one is taking a more open point of view, trying to change the system rather than just adjust to it, then there seems to be another approach than designing a less wrong metric: mitigate the need for a metric by reducing the amount of evaluation. Less evaluation, done right.
This would follow from a reduction of the weight of grants in favour of direct recurrent lab funding; from mostly automating the career of tenured faculty (if we can’t decide who is better, then let’s them all have pay raises at the same rate); and from basically letting researcher as free as possible, with uniformly good working condition, without conditioning that to any evaluation. Of course there would be free riders, but at least they would not be incentivized to damage science in the process; of course there would still be evaluations, such as hiring and tenure, but then if we concentrate all our evaluation time to these, we should be able to do it relatively right, and f**** read the papers.
February 3, 2016 at 9:48 pm
I admit that I had missed that possibility – that there will be a sort of grass-roots program of LWM testing, that might trickle upward into academia.
I was about to write, “It still seems like a lot of work to invest for an unknown payoff”, but then that is true for almost all research. It seems plausible that there are people out there for whom cranking LWMs may be as rewarding as looking at sauropod vertebrae is for me. If so, I wish them luck.
February 3, 2016 at 10:07 pm
Well, Benoit, I think you’re on to something. My feeling about almost all forms of evaluation is that we tend to overthink it and overcomplicate it, because really what you’re mostly looking for are outliers. Whether it’s students in a course or faculty in a department, IME the great mass of them are competent, and although they could all use pointers on how to improve, it’s a waste of time trying to sort them too finely. But it’s crucial to be able to identify the bottom 10-15%, who are either struggling or being a drag on the institution or both, and the top 10-15%, where the investment of some positive recognition and reward will often be returned manyfold. Obviously the cutoff points are arbitrary and will shift from cohort to cohort – the key thing is the philosophy.
And this is actually pretty easy to put into practice, because it’s usually not difficult to identify outliers – if you’re familiar with the cohort, you probably have a good idea of who they are before you even start the formal evaluation process (although the process must be rigorous enough to properly vet those ‘nominees’, and to catch any potentially excellent or disastrous candidates hidden in the big middle).
And it has the salutary effect of de-stressing the process for many of those being evaluated. In effect you are saying to the cohort, “Just go do your work and don’t worry too much about evaluation. We’ll only get on your case if you do really poorly. And if you have the drive, the chops, and the opportunities to excel, we’ll make sure that’s recognized, too.” IME, people work better under those conditions, than when they are constantly worrying about how they stack up next to everyone around them.
February 5, 2016 at 4:39 pm
[…] At SVPOW, you’ll find a couple posts on the evaluation of researchers and the introduction of LWM (Less Wrong Metrics) by Mike Taylor. You can find the posts here, here, here, and here. […]
February 6, 2016 at 1:19 pm
Thanks for that comment in the last paragraph, Matt. Made me feel a tiny bet better.