September 18, 2016
I have before me the reviews for a submission of mine, and the handling editor has provided an additional stipulation:
Authority and date should be provided for each species-level taxon at first mention. Please ensure that the nominal authority is also included in the reference list.
In other words, the first time I mention Diplodocus, I should say “Diplodocus Marsh 1878″; and I should add the corresponding reference to my bibliography.
What do we think about this?
I used to do this religiously in my early papers, just because it was the done thing. But then I started to think about it. To my mind, it used to make a certain amount of sense 30 years ago. But surely in 2016, if anyone wants to know about the taxonomic history of Diplodocus, they’re going to go straight to Wikipedia?
I’m also not sure what the value is in providing the minimal taxonomic-authority information rather then, say, morphological information. Anyone who wants to know what Diplodocus is would be much better to go to Hatcher 1901, so wouldn’t we serve readers better if we referred to “Diplodocus (Hatcher 1901)”
Now that I come to think of it, I included “Giving the taxonomic authority after first use of each formal name” in my list of
Idiot things that we we do in our papers out of sheer habit three and a half years ago.
Should I just shrug and do this pointless busywork to satisfy the handling editor? Or should I simply refuse to waste my time adding information that will be of no use to anyone?
- Hatcher, Jonathan B. 1901. Diplodocus (Marsh): its osteology, taxonomy and probable habits, with a restoration of the skeleton. Memoirs of the Carnegie Museum 1:1-63 and plates I-XIII.
- Marsh, O. C. 1878. Principal characters of American Jurassic dinosaurs, Part I. American Journal of Science, series 3 16:411-416.
August 31, 2016
As explained in careful detail over at Stupid Patent of the Month, Elsevier has applied for, and been granted, a patent for online peer-review. The special sauce that persuaded the US Patent Office that this is a new invention is cascading peer review — an idea so obvious and so well-established that even The Scholarly Kitchen was writing about it as a commonplace in 2010.
Well. What can this mean?
A cynic might think that this is the first step an untrustworthy company would take preparatory to filing a lot of time-wasting and resource-sapping nuisance lawsuits on its smaller, faster-moving competitors. They certainly have previous in the courts: remember that they have brought legal action their own customers as well as threatening Academia.edu and of course trying to take Sci-Hub down.
Elsevier representatives are talking this down: Tom Reller has tweeted “There is no need for concern regarding the patent. It’s simply meant to protect our own proprietary waterfall system from being copied” — which would be fine, had their proprietary waterfall system not been itself copied from the ample prior art. Similarly, Alicia Wise has said on a public mailing list “People appear to be suggesting that we patented online peer review in an attempt to own it. No, we just patented our own novel systems.” Well. Let’s hope.
But Cathy Wojewodzki, on the same list, asked the key question:
I guess our real question is Why did you patent this? What is it you hope to market or control?
We await a meaningful answer.
Long time readers may remember the stupid contortions I had to go through in order to avoid giving the Geological Society copyright in my 2010 paper about the history of sauropod research, and how the Geol. Soc. nevertheless included a fraudulent claim of copyright ownership in the published version.
The way I left it back in 2010, my wife, Fiona, was the copyright holder. I should have fixed this a while back, but I now note for the record that she has this morning assigned copyright back to me:
From: Fiona Taylor <REDACTED>
To: Mike Taylor <email@example.com>
Date: 15 August 2016 at 11:03
I, Fiona J. Taylor of Oakleigh Farm House, Crooked End, Ruardean, GL17 9XF, England, hereby transfer to you, Michael P. Taylor of Oakleigh Farm House, Crooked End, Ruardean, GL17 9XF, England, the copyright of your article “Sauropod dinosaur research: a historical review”. This email constitutes a legally binding transfer.
Sorry to post something so boring, after so long a gap (nearly a month!) Hopefully we’ll have some more interesting things to say — and some time to say them — soon!
That paper that says women are better coders than men but are judged on their gender? It doesn’t say that at all
February 20, 2016
As a long-standing proponent of preprints, it bothers me that of all PeerJ’s preprints, by far the one that has had the most attention is Terrell et al. (2016)’s Gender bias in open source: Pull request acceptance of women versus men. Not helped by a misleading abstract, we’ve been getting headlines like these:
- Study: Female Coders Better Than Men, But Perceived As Worse (LiveScience)
- Women accepted as better coders as long as no gender link (TechXplore)
- Women devs – want your pull requests accepted? Just don’t tell anyone you’re a girl (The Register)
But in fact, as Kate Jeffrey points out in a comment on the preprint (emphasis added):
The study is nice but the data presentation, interpretation and discussion are very misleading. The introduction primes a clear expectation that women will be discriminated against while the data of course show the opposite. After a very large amount of data trawling, guided by a clear bias, you found a very small effect when the subjects were divided in two (insiders vs outsiders) and then in two again (gendered vs non-gendered). These manipulations (which some might call “p-hacking”) were not statistically compensated for. Furthermore, you present the fall in acceptance for women who are identified by gender, but don’t note that men who were identified also had a lower acceptance rate. In fact, the difference between men and women, which you have visually amplified by starting your y-axis at 60% (an egregious practice) is minuscule. The prominence given to this non-effect in the abstract, and the way this imposes an interpretation on the “gender bias” in your title, is therefore unwarranted.
Your most statistically significant results seem to be that […] reporting gender has a large negative effect on acceptance for all outsiders, male and female. These two main results should be in the abstract. In your abstract you really should not be making strong claims about this paper showing bias against women because it doesn’t. For the inside group it looks like the bias moderately favours women. For the outside group the biggest effect is the drop for both genders. You should hence be stating that it is difficult to understand the implications for bias in the outside group because it appears the main bias is against people with any gender vs people who are gender neutral.
Here is the key graph from the paper:
(The legends within the figure are tiny: on the Y-axes, they both read “acceptance rate”; and along the X-axis, from left to right, they read “Gender-Neutral”, “Gendered” and then again “Gender-Neutral”, “Gendered”.)
So James Best’s analysis is correct: the real finding of the study is a truly bizarre one, that disclosing your gender whatever that gender is reduces the chance of code being accepted. For “insiders” (members of the project team), the effect is slightly stronger for men; for “outsiders” it is rather stronger for women. (Note by the way that all the differences are much less than they appear, because the Y-axis runs from 60% to 90%, not 0% to 100%.)
Why didn’t the authors report this truly fascinating finding in their abstract? It’s difficult to know, but it’s hard not to at least wonder whether they felt that the story they told would get more attention than their actual findings — a feeling that has certainly been confirmed by sensationalist stories like Sexism is rampant among programmers on GitHub, researchers find (Yahoo Finance).
I can’t help but think of Alan Sokal’s conclusion on why his obviously fake paper in the physics of gender studies was accepted by Social Text: “it flattered the editors’ ideological preconceptions“. It saddens me to think that there are people out there who actively want to believe that women are discriminated against, even in areas where the data says they are not. Folks, let’s not invent bad news.
Would this study have been published in its present form?
This is the big question. As noted, I am a big fan of preprints. But I think that the misleading reporting in the gender-bias paper would not make it through peer-review — as the many critical comments on the preprint certainly suggest. Had this paper taken a conventional route to publication, with pre-publication review, then I doubt we would now be seeing the present sequence of misleading headlines in respected venues, and the flood of gleeful “see-I-told-you” tweets.
(And what do those headlines and tweets achieve? One thing I am quite sure they will not do is encourage more women to start coding and contributing to open-source projects. Quite the opposite: any women taking these headlines at face value will surely be discouraged.)
So in this case, I think the fact that the study in its present form appeared on such an official-looking venue as PeerJ Preprints has contributed to the avalanche of unfortunate reporting. I don’t quite know what to do with that observation.
What’s for sure is that no-one comes out of this as winners: not GitHub, whose reputation has been wrongly slandered; not the authors, whose reporting has been shown to be misleading; not the media outlets who have leapt uncritically on a sensational story; not the tweeters who have spread alarm and despondancy; not PeerJ Preprints, which has unwittingly lent a veneer of authority to this car-crash. And most of all, not the women who will now be discouraged from contributing to open-source projects.
February 17, 2016
Thirteen years ago, Kenneth Adelman photographed part of the California coastline from the air. His images were published as part of a set of 12,000 in the California Coastal Records Project. One of those photos showed the Malibu home of the singer Barbra Streisand.
In one of the most ill-considered moves in history, Streisand sued Adelman for violation of privacy. As a direct result, the photo — which had at that point been downloaded four times — was downloaded a further 420,000 times from the CCRP web-site alone. Meanwhile, the photo was republished all over the Web and elsewhere, and has almost certainly now been seen by tens of millions of people.
Last year, the tiny special-interest academic-paper search-engine Sci-Hub was trundling along in the shadows, unnoticed by almost everyone.
In one of the most ill-considered moves in history, Elsevier sued Sci-Hub for lost revenue. As a direct result, Sci-Hub is now getting publicity in venues like the International Business Times, Russia Today, The Atlantic, Science Alert and more. It’s hard to imagine any other way Sci-Hub could have reached this many people this quickly.
I’m not discussing at the moment whether what Sci-Hub is doing is right or wrong. What’s certainly true is (A) it’s doing it, and (B) many, many people now know about it.
It’s going to be hard to Elsevier to get this genie back into the bottle. They’ve already shut down the original sci-hub.com domain, only to find it immediately popping up again as sci-hub.io. That’s going to be a much harder domain for them to shut down, and even if they manage it, the Sci-Hub operators will not find it difficult to get another one. (They may already have several more lined up and ready to deploy, for all I know.)
So you’d think the last thing they’d want to do is tell the world all about it.
[Note: Mike asked me to scrape a couple of comments on his last post – this one and this one – and turn them into a post of their own. I’ve edited them lightly to hopefully improve the flow, but I’ve tried not to tinker with the guts.]
This is the fourth in a series of posts on how researchers might better be evaluated and compared. In the first post, Mike introduced his new paper and described the scope and importance of the problem. Then in the next post, he introduced the idea of the LWM, or Less Wrong Metric, and the basic mathemetical framework for calculating LWMs. Most recently, Mike talked about choosing parameters for the LWM, and drilled down to a fundamental question: (how) do we identify good research?
Let me say up front that I am fully convicted about the problem of evaluating researchers fairly. It is a question of direct and timely importance to me. I serve on the Promotion & Tenure committees of two colleges at Western University of Health Sciences, and I want to make good decisions that can be backed up with evidence. But anyone who has been in academia for long knows of people who have had their careers mangled, by getting caught in institutional machinery that is not well-suited for fairly evaluating scholarship. So I desperately want better metrics to catch on, to improve my own situation and those of researchers everywhere.
For all of those reasons and more, I admire the work that Mike has done in conceiving the LWM. But I’m pretty pessimistic about its future.
I think there is a widespread misapprehension that we got here because people and institutions were looking for good metrics, like the LWM, and we ended up with things like impact factors and citation counts because no-one had thought up anything better. Implying a temporal sequence of:
1. Deliberately looking for metrics to evaluate researchers.
2. Finding some.
3. Trying to improve those metrics, or replace them with better ones.
I’m pretty sure this is exactly backwards: the metrics that we use to evaluate researchers are mostly simple – easy to explain, easy to count (the hanky-panky behind impact factors notwithstanding) – and therefore they spread like wildfire, and therefore they became used in evaluation. Implying a very different sequence:
1. A metric is invented, often for a reason completely unrelated to evaluating researchers (impact factors started out as a way for librarians to rank journals, not for administration to rank faculty!).
2. Because a metric is simple, it becomes widespread.
3. Because a metric is both simple and widespread, it makes it easy to compare people in wildly different circumstances (whether or not that comparison is valid or defensible!), so it rapidly evolves from being trivia about a researcher, to being a defining character of a researcher – at least when it comes to institutional evaluation.
If that’s true, then any metric aimed for wide-scale adoption needs to be as simple as possible. I can explain the h-index or i10 index in one sentence. “Citation count” is self-explanatory. The fundamentals of the impact factor can be grasped in about 30 seconds, and even the complicated backstory can be conveyed in about 5 minutes.
In addition to being simple, the metric needs to work the same way across institutions and disciplines. I can compare my h-index with that of an endowed chair at Cambridge, a curator at a small regional museum, and a postdoc at Podunk State, and it Just Works without any tinkering or subjective decisions on the part of the user (other than What Counts – but that affects all metrics dealing with publications, so no one metric is better off than any other on that score).
I fear that the LWM as conceived in Taylor (2016) is doomed, for the following reasons:
- It’s too complex. It would probably be doomed if it had just a single term with a constant and an exponent (which I realize would defeat the purpose of having either a constant or an exponent), because that’s more math than either an impact factor or an h-index requires (perceptively, anyway – in the real world, most people’s eyes glaze over when the exponents come out).
- Worse, it requires loads of subjective decisions and assigning importance on the part of the users.
- And fatally, it would require a mountain of committee work to sort that out. I doubt if I could get the faculty in just one department to agree on a set of terms, constants, and exponents for the LWM, much less a college, much less a university, much less all of the universities, museums, government and private labs, and other places where research is done. And without the promise of universal applicability, there’s no incentive for any institution to put itself through the hell of work it would take to implement.
Really, the only way I think the LWM could get into place is by fiat, by a government body. If the EPA comes up with a more complicated but also more accurate way to measure, say, airborne particle output from car exhausts, they can theoretically say to the auto industry, “Meet this standard or stop selling cars in the US” (I know there’s a lot more legislative and legal push and pull than that, but it’s at least possible). And such a standard might be adopted globally, either because it’s a good idea so it spreads, or because the US strong-arms other countries into following suit.
Even if I trusted the US Department of Education to fill in all of the blanks for an LWM, I don’t know that they’d have the same leverage to get it adopted. I doubt that the DofE has enough sway to get it adopted even across all of the educational institutions. Who would want that fight, for such a nebulous pay-off? And even if it could be successfully inflicted on educational institutions (which sounds negative, but that’s precisely how the institutions would see it), what about the numerous and in some cases well-funded research labs and museums that don’t fall under the DofE’s purview? And that’s just in the US. The culture of higher education and scholarship varies a lot among countries. Which may be why the one-size-fits-all solutions suck – I am starting to wonder if a metric needs to be broken, to be globally applicable.
The problem here is that the user base is so diverse that the only way metrics get adopted is voluntarily. So the challenge for any LWM is to be:
- Better than existing metrics – this is the easy part – and,
- Simple enough to be both easily grasped, and applied with minimal effort. In Malcolm Gladwell Tipping Point terms, it needs to be “sticky”. Although a better adjective for passage through the intestines of academia might be “smooth” – that is, having no rough edges, like exponents or overtly subjective decisions*, that would cause it to snag.
* Calculating an impact factor involves plenty of subjective decisions, but it has the advantages that (a) the users can pretend otherwise, because (b) ISI does the ‘work’ for them.
At least from my point of view, the LWM as Mike has conceived it is awesome and possibly unimprovable on the first point (in that practically any other metric could be seen as a degenerate case of the LWM), but dismal and possibly pessimal on the second one, in that it requires mounds of subjective decision-making to work at all. You can’t even get a default number and then iteratively improve it without investing heavily in advance.
An interesting thought experiment would be to approach the problem from the other side: invent as many new simple metrics as possible, and then see if any of them offer advantages over the existing ones. Although I have a feeling that people are already working on that, and have been for some time.
Simple, broken metrics like impact factor are the prions of scholarship. Yes, viruses are more versatile and cells more versatile still, by orders of magnitude, but compared to prions, cells take an awesome amount of effort to build and maintain. If you just want to infect someone and you don’t care how, prions are very hard to beat. And they’re so subtle in their machinations that we only became aware of them comparatively recently – much like the emerging problems with “classical” (e.g., non-alt) metrics.
I’d love to be wrong about all of this. I proposed the strongest criticism of the LWM I could think of, in hopes that someone would come along and tear it down. Please start swinging.
January 29, 2016
You’ll remember that in the last installment (before Matt got distracted and wrote about archosaur urine), I proposed a general schema for aggregating scores in several metrics, terming the result an LWM or Less Wrong Metric. Given a set of n metrics that we have scores for, we introduce a set of n exponents ei which determine how we scale each kind of score as it increases, and a set of n factors ki which determine how heavily we weight each scaled score. Then we sum the scaled results:
LWM = k1·x1e1 + k2·x2e2 + … + kn·xnen
“That’s all very well”, you may ask, “But how do we choose the parameters?”
Here’s what I proposed in the paper:
One approach would be to start with subjective assessments of the scores of a body of researchers – perhaps derived from the faculty of a university confidentially assessing each other. Given a good-sized set of such assessments, together with the known values of the metrics x1, x2 … xn for each researcher, techniques such as simulated annealing can be used to derive the values of the parameters k1, k2 … kn and e1, e2 … en that yield an LWM formula best matching the subjective assessments.
Where the results of such an exercise yield a formula whose results seem subjectively wrong, this might flag a need to add new metrics to the LWM formula: for example, a researcher might be more highly regarded than her LWM score indicates because of her fine record of supervising doctoral students who go on to do well, indicating that some measure of this quality should be included in the LWM calculation.
I think as a general approach that is OK: start with a corpus of well understood researchers, or papers, whose value we’ve already judged a priori by some means; then pick the parameters that best approximate that judgement; and let those parameters control future automated judgements.
The problem, really, is how we make that initial judgement. In the scenario I originally proposed, where say the 50 members of a department each assign a confidential numeric score to all the others, you can rely to some degree on the wisdom of crowds to give a reasonable judgement. But I don’t know how politically difficult it would be to conduct such an exercise. Even if the individual scorers were anonymised, the person collating the data would know the total scores awarded to each person, and it’s not hard to imagine that data being abused. In fact, it’s hard to imagine it not being abused.
In other situations, the value of the subjective judgement may be close to zero anyway. Suppose we wanted to come up with an LWM that indicates how good a given piece of research is. We choose LWM parameters based on the scores that a panel of experts assign to a corpus of existing papers, and derive our parameters from that. But we know that experts are really bad at assessing the quality of research. So what would our carefully parameterised LWM be approximating? Only the flawed judgement of flawed experts.
Perhaps this points to an even more fundamental problem: do we even know what “good research” looks like?
It’s a serious question. We all know that “research published in high-Impact Factor journals” is not the same thing as good research. We know that “research with a lot of citations” is not the same thing as good research. For that matter, “research that results in a medical breakthrough” is not necessarily the same thing as good research. As the new paper points out:
If two researchers run equally replicable tests of similar rigour and statistical power on two sets of compounds, but one of them happens to have in her batch a compound that turns out to have useful properties, should her work be credited more highly than the similar work of her colleague?
What, then? Are we left only with completely objective measurements, such as statistical power, adherance to the COPE code of conduct, open-access status, or indeed correctness of spelling?
If we accept that (and I am not arguing that we should, at least not yet), then I suppose we don’t even need an LWM for research papers. We can just count these objective measures and call it done.
I really don’t know what my conclusions are here. Can anyone help me out?