## Choosing parameters for the Less Wrong Metric (LWM)

### January 29, 2016

You’ll remember that in the last installment (before Matt got distracted and wrote about archosaur urine), I proposed a general schema for aggregating scores in several metrics, terming the result an LWM or Less Wrong Metric. Given a set of n metrics that we have scores for, we introduce a set of n exponents ei which determine how we scale each kind of score as it increases, and a set of n factors ki which determine how heavily we weight each scaled score. Then we sum the scaled results:

LWM = k1·x1e1 + k2·x2e2 + … + kn·xnen

“That’s all very well”, you may ask, “But how do we choose the parameters?”

Here’s what I proposed in the paper:

One approach would be to start with subjective assessments of the scores of a body of researchers – perhaps derived from the faculty of a university confidentially assessing each other. Given a good-sized set of such assessments, together with the known values of the metrics x1, x2xn for each researcher, techniques such as simulated annealing can be used to derive the values of the parameters k1, k2kn and e1, e2en that yield an LWM formula best matching the subjective assessments.

Where the results of such an exercise yield a formula whose results seem subjectively wrong, this might flag a need to add new metrics to the LWM formula: for example, a researcher might be more highly regarded than her LWM score indicates because of her fine record of supervising doctoral students who go on to do well, indicating that some measure of this quality should be included in the LWM calculation.

I think as a general approach that is OK: start with a corpus of well understood researchers, or papers, whose value we’ve already judged a priori by some means; then pick the parameters that best approximate that judgement; and let those parameters control future automated judgements.

The problem, really, is how we make that initial judgement. In the scenario I originally proposed, where say the 50 members of a department each assign a confidential numeric score to all the others, you can rely to some degree on the wisdom of crowds to give a reasonable judgement. But I don’t know how politically difficult it would be to conduct such an exercise. Even if the individual scorers were anonymised, the person collating the data would know the total scores awarded to each person, and it’s not hard to imagine that data being abused. In fact, it’s hard to imagine it not being abused.

In other situations, the value of the subjective judgement may be close to zero anyway. Suppose we wanted to come up with an LWM that indicates how good a given piece of research is. We choose LWM parameters based on the scores that a panel of experts assign to a corpus of existing papers, and derive our parameters from that. But we know that experts are really bad at assessing the quality of research. So what would our carefully parameterised LWM be approximating? Only the flawed judgement of flawed experts.

Perhaps this points to an even more fundamental problem: do we even know what “good research” looks like?

It’s a serious question. We all know that “research published in high-Impact Factor journals” is not the same thing as good research. We know that “research with a lot of citations” is not the same thing as good research. For that matter, “research that results in a medical breakthrough” is not necessarily the same thing as good research. As the new paper points out:

If two researchers run equally replicable tests of similar rigour and statistical power on two sets of compounds, but one of them happens to have in her batch a compound that turns out to have useful properties, should her work be credited more highly than the similar work of her colleague?

What, then? Are we left only with completely objective measurements, such as statistical power, adherance to the COPE code of conduct, open-access status, or indeed correctness of spelling?

If we accept that (and I am not arguing that we should, at least not yet), then I suppose we don’t even need an LWM for research papers. We can just count these objective measures and call it done.

I really don’t know what my conclusions are here. Can anyone help me out?

### 17 Responses to “Choosing parameters for the Less Wrong Metric (LWM)”

1. Frosted Flake Says:

Who watches the watchers?

I mean, who decides who decides? Who rates the raters? And what makes them qualified? And who decides that?

Read this again. With the thought formost in mind that a good subset of the population thinks scientists are freeloaders who really should be doing something useful. Like making them a sandwich. Do you really want to hand them this tool?

2. Mike Taylor Says:

In general, the answer to this kind of question is a pagerank-like algorithm in which everyone can evaluate everyone else but the value of someone’s evaluation of me is determined by everyone’s evaluation of them.

How that would look in practice, I don’t know.

3. I think that one can come to one firm conclusion from your work.: that it’s impossible to summarise the value of a person in a single number, and impossible to predict the future value of individual papers.

4. […] do you choose your 2n parameters? That’s the subject of the next post. Or, as before, you can skip ahead and read the […]

5. Mike Taylor Says:

Well, I agree entirely of course. The problem is that people are going to do this, whether the number is meaningful or not. (If the history of impact factors has taught us nothing else, surely it has taught us that.) Which is why it’s of some value to make the number less wrong.

6. Michael Richmond Says:

Extreme position A is that every scientist reads many papers by many scientists and ranks every one of them in subjective ways, and then those rankings are combined in some PageRank-like way. Call this “the Google Model”.

Extreme position B is that one scientist reads all the papers by all other scientists and ranks them in subjective ways, and those rankings alone are adopted. Call this “the Yahoo Model” (those with long memories will understand).

I don’t think either of these models is more likely than the current system in which objective factors (citation count, number of figures, font size) are combined in arbitrary manners and put into spreadsheets by non-scientists. But if you are seeking to identify the best factors for ranking papers, _as judged by scientist(s)_, then I think the Yahoo Model is more likely to yield fruit.

So, it’s up to the community to identify some person who has the time and inclination to read nearly every paper in some field and then produce numerical values by which those papers can be ranked. This is at least easier than identifying tens or hundreds of such people, which is required for the Google Model.

Volunteers would be welcome, of course. Until some one(s) start reading and producing numerical values, this discussion is moot.

7. Mike Taylor Says:

I don’t understand why you think that the Yahoo! model would be more likely to yield good results here — especially in light of the fact that the Google model did in fact kick Yahoo!’s arse right over the horizon when the two were up against each other in the web-navigation arena.

8. Matt Wedel Says:

I admire the work you’ve done here, Mike, but I’m pretty pessimistic about its future.

I think there is a widespread misapprehension that people and institutions were looking for good metrics, like the LWM, and we ended up with things like impact factors and citation counts because no-one had thought up anything better. Implying a temporal sequence of:

1. Deliberately looking for a metric.
2. Finding some.
3. Trying to make those better, or replace them with better ones.

I’m pretty sure this is exactly backwards: the metrics that we use to evaluate researchers are mostly simple – easy to explain, easy to count (the hanky-panky behind impact factors notwithstanding) – and therefore they spread like wildfire, and therefore they became used in evaluation. Implying a very different sequence:

1. Metric is invented, often for a reason completely unrelated to evaluating researchers.
2. Because metric is simple, it becomes widespread.
3. Because metric is both simple and widespread, it makes it easy to compare people in wildly different circumstances (whether or not that comparison is valid or defensible!, so it rapidly evolves from being trivia about a researcher, to being a defining character of a researcher – at least when it comes to institutional evaluation.

If that’s true, then any metric aimed for wide-scale adoption needs to be as simple as possible. I can explain the h-index or i-10 index in one sentence. “Citation count” is self-explanatory. The fundamentals of the impact factor can be grasped in about 30 seconds, and even the complicated backstory can be conveyed in about 5 minutes.

And, crucially, the metric needs to work the same way across institutions and disciplines. I can compare my h-index with that of an endowed chair at Cambridge, a curator at a small regional museum, and a postdoc at Podunk State, and it Just Works* without any tinkering or subjective decisions on the part of the user (other than What Counts – but that affects all metrics dealing with publications, so no one metric is better off than any other on that score).

IMHO, the LWM as conceived here is doomed, for the following reasons. It’s too complex. It would probably be doomed if it had just a single term with a constant and an exponent (which I realize would defeat the purpose of having either a constant or an exponent), because that’s more math than either an impact factor or an h-index requires (perceptively, anyway – in the real world, most people’s eyes glaze over when the exponents come out). Worse, it requires loads of subjective decisions and assigning importance on the part of the users. And fatally, it would require a mountain of committee work to sort that out. I doubt if I could get the faculty in just one department to agree on a set of terms, constants, and exponents for the LWM, much less a college, much less a university, much less all of the universities, museums, government and private labs, and other places where research is done. And without the promise of universal applicability, there’s no incentive for any institution to put itself through the hell of work it would take to implement.

With all of that said, I support the notion 100%, and I’d really love to be proven wrong.

9. Mike Taylor Says:

Well, Matt, I would really like to disagree with some of that.

Really like to.

10. Matt Wedel Says:

HHOS. I proposed the strongest criticism I could think of, in hopes that someone would come along and tear it down.

Really, the only way I think the LWM could get into place is by fiat, by a government body. If the EPA comes up with a more complicated but also more accurate way to measure, say, airborne particle output from car exhausts, they can theoretically say to the auto industry, “Meet this standard or stop selling cars in the US” (I know there’s a lot more legislative and legal push and pull than that, but it’s at least possible). And such a standard might be adopted globally, either because it’s a good idea so it spreads, or because the US strong-arms other countries into following suit.

Even if I trusted the US Department of Education to fill in all of the blanks for an LWM, I don’t know that they’d have the same leverage to get it adopted. I doubt that the DofE has enough sway to get it adopted even across all of the educational institutions. Who would want that fight, for such a nebulous pay-off? And even if it did, what about the numerous and in some cases well-funded research labs and museums that don’t fall under the DofE’s purview? And that’s just in the US. The culture of higher education and scholarship varies a lot among countries (which may be why the one-size-fits-all solutions suck).

The problem here is that the user base is so diverse that the only way metrics get adopted is voluntarily. So the challenge for any LWM is to be:

1. Better than existing metrics – this is the easy part – and,

2. Simple enough to be both easily grasped, and applied with minimal effort. In Malcolm Gladwell Tipping Point terms, it needs to be “sticky”. Although a better adjective for passage through the intestines of academia might be “smooth” – that is, having no rough edges, like exponents or overtly subjective decisions*, that would cause it to snag.

* Calculating an impact factor involves plenty of subjective decisions, but it has the advantages that (a) the users can pretend otherwise, because (b) ISI does the ‘work’ for them.

At least from my point of view, the LWM as you’ve conceived it is awesome and possibly unimprovable on the first point (in that practically any other metric could be seen as a degenerate case of the LWM) BUT dismal and possibly pessimal on the second one, in that it requires mounds of subjective decision-making to work at all. You can’t even get a default number and then iteratively improve it without investing fully in advance.

An interesting thought experiment would be to approach the problem from the other side: invent as many new simple metrics as possible, and then see if any of them offer advantages over the existing ones. Although I have a feeling that people are already working on that, and have been for some time.

Simple, broken metrics like impact factor are the prions of scholarship. Yes, viruses are more versatile and cells more versatile still, by orders of magnitude, but compared to prions, cells take an awesome amount of effort to build and maintain. If you just want to infect someone and you don’t care how, prions are very hard to beat. And they’re so subtle in their machinations that we only became aware of them comparatively recently – much like the emerging problems with “classical” (e.g., non-alt) metrics.

11. 1. Arrow’s theorem
2. Better perhaps to make a sample of researchers that are clearly differentiated (Fields medallists vs me, for instance, and some people scattered on either side of me) where one can make a fairly clear call and fit the data to that. No one wants to ask whether Atiyah or Bhargava is the better mathematician. Then one can sample more, rate them by experts and then see where the metric puts them.
3. The previous point assumes that one wants to do this and thinks it’s better than the alternative (IF, h-values)
4. Something something machine learning.

12. LeeB Says:

If machines can learn how to play Go at a high level they can probably learn how to rate researchers.
However the learning could be complicated if you can’t tell the machine what are good and poor researchers.
I guess nobel prizes, field medals etc. would help in recognising the best researchers, and a study of the recipients early work compared to that of other researchers at the beginning of their careers might be interesting.

Of course if the machines predictive powers got really good their predictions might be used to pick researchers by universities to give jobs to; I am not sure that the people applying for the jobs would necessarily be happy with that.

LeeB.

13. […] Mike asked me to scrape a couple of comments on his last post – this one and this one – and turn them into a post of their own. I’ve edited them lightly to […]

14. Mike Taylor Says:

However the learning could be complicated if you can’t tell the machine what are good and poor researchers.

That is the key issue right there, in a nutshell. I think the problem runs much deeper than our difficulty in teaching machines what a good researcher looks like. More fundamentally, we don’t know.

I guess nobel prizes, field medals etc. would help in recognising the best researchers.

Now necessarily. At least some prominent and respected researchers think that Nobel prizes are divisive and uninformative: “The way prizes like the Nobel give disproportionate credit to a handful of individuals is an injustice to the way science really works. When accolades are given exclusively to only a few of the people who participated in an important discovery.”

15. LeeB Says:

Interesting.
Then we believe that there is such a thing as a good researcher but have difficulty in recognising or describing one.
Perhaps only history is able to tell in retrospect what was a good researcher by recognising those who significantly advanced or changed the course of their field of study.
The closer you are to the person or the event the less able you are to recognise their significance.

LeeB.

16. […] and the introduction of LWM (Less Wrong Metrics) by Mike Taylor. You can find the posts here, here, here, and […]

17. cf: people who miss out on a Fields medal because they turn 41 a day too early, but who did the work that would land them it between age 36 and 40, and so miss out on the previous round.

This site uses Akismet to reduce spam. Learn how your comment data is processed.