What if we just post PDFs on the open web?
February 18, 2012
An interesting conversation arose in the comments to Matt’s last post — interesting to me, at least, but then since I wrote much of it, I am biased. I think it merits promotion to its own post, though. Paul Graham, among many others, has written about how one of the most important reasons to write about a subject is that the process of doing so helps you work through exactly what you think about it. And that is certainly what’s happening to me in this series of Open Access posts.
Liz Smith: Director of Global Internal Communications at Elsevier
Mike Taylor: me, your co-host here at SV-POW!
Andy Farke: palaeontologist, ceratopsian lover, and PLoS ONE volunteer academic editor
In a long and interesting comment, Liz wrote (among much else):
This is where there seems to be deliberate obtuseness. Sticking a single PDF up online is easy. But there are millions of papers published every year. It takes a hell of a lot of people and resources to make that happen. You can’t just sling it online and hope somebody can find it. The internet doesn’t happen by magic.
And I replied:
Actually, you can and I do. That is exactly how the Internet works. I don’t have to do anything special to make sure my papers are found — Google and other search engines pick them up, just like they do everything. So to pick an example at random, if you search for brachiosaurus re-evaluation, the very first hit will be my self-hosted PDF of my 2009 JVP paper on that subject. [Correction: I now see that it’s the third hit; the PDF of the correction is top.] Similarly, search for xenoposeidon pdf and the top hit is — get ready for a shock! — my self-hosted PDF of my 2007 Palaeontology paper on that subject.
So in fact, this is a fine demonstration of just how obsolete much of the work that publishers do has now become — all that indexing, abstracting and aggregation, work that used to be very important, but which is now done much faster, much better, for free, by computers and networks.
Really: what advantages accrue to me in having my Xenoposeidon paper available on Wiley’s site as well as mine? [It’s paywalled on their site, so useless to 99% of potential visitors, but ignore that for now. Let’s pretend it’s freely available.] What else does that get me that Google’s indexing of my self-hosted PDF doesn’t?
Liz is quite rightly taking a break over the weekend, so she’s not yet replied to this; but Andy weighed in with some important points:
To address your final statement, I see three main advantages to having a PDF on a publisher’s site, rather than just a personal web page (this follows some of our Twitter discussion the other day, but I post it here just to have it in an alternative forum):
1) Greater permanence. Personal web pages (even with the best of intentions) have a history of non-permanence; there is no guarantee your site will be around 40 or 50 years from now. Just ask my Geocities page from 1998. Of course, there also is no guarantee that Wiley’s website will be around in 2073 either, but I think it’s safe to say there’s a greater likelihood that it will be around in some incarnation than a personal website.
2) Document security. By putting archiving in the hands of the authors, there is little to prevent them from editing out embarrassing details, or adding in stuff they wanted published but the reviewers told them to take out, or whatever. I’m not saying this is something that most people would do, but it is a risk of not having an “official” copy somewhere.
3) Combating author laziness. You have an excellent track record of making your work available, but most other authors do not, for various reasons.
It is also important to note that none of the above requirements needs a commercial publisher – in fact, they would arguably be better served by taking them out of the commercial sector. My main point is that self-hosting, although a short-term solution for distribution and archival, is not a long-term one.
Finally, just as a minor pedantic note, search results depend greatly on the search engine used. Baidu – probably the most popular search engine in China – doesn’t give your self-hosted PDF anywhere in its three pages of search results (neither does it give Wiley’s version, though).
And now, here is my long reply — the one that, when I’d finished it, made me want to post this as an article:
On permanence, there are a few things to say. One is that with the rate of mergers, web-site “upgrades” and suchlike I am actually far from confident that (say) the Wiley URL for my Xenoposeidon paper will last longer than my own. In fact, let’s make it a challenge! :-) If theirs goes away, you buy me a beer; if mine does, I buy you one! But I admit that, as an IT professional who’s been running a personal website since the 1990s — no Geocities for me! — I am not a typical case.
But the more important point is that it doesn’t matter. The Web doesn’t actually run on permanent addresses, it runs on what gets indexed. If I deleted my Xenoposeidon PDF today and put it up somewhere else — say, directly on SV-POW! — within a few days it would be indexed again, and coming out at or near the top of search-engine results. Librarians and publishers used to have a very important curation role — abstracting and indexing and all that — but the main reason they keep doing these things now is habit.
And that’s because of the wonderful loosely coupled nature of the Internet. Back when people first started posting research papers on the web, there were no search engines — CERN, famously, maintained a list of all the world’s web-sites. Search engines and crawlers as we know them today were never part of the original vision of the web: they were invented and put together from spare parts. And that is the glory of the open web. The people at Yahoo and AltaVista and Google didn’t need anyone’s permission to start crawling and indexing — they didn’t need to sign up to someone’s Developer Partnership Program and sign a non-disclosure form before they were allowed to see the API documentation, and then apply for an API Key that is good for up to 100 accesses per day. All these encumberances apply when you try to access data in publishers’ silos (trust me: my day-job employers have just spent literally months trying to suck the information out of Elsevier that is necessary to use their crappy 2001-era SOAP-based web services to search metadata. Not even content.) And this is why I can’t get remotely excited about things like ScienceDirect and Scopus. Walled gardens can give us some specific functionality, sure, but they will always be limited by what the vendor thinks of, and what the vendor can turn a profit on. Whereas if you just shove things up on the open web, anyone can do anything with them.
With that said, your point about document security is well made — we do need some system for preventing people from tampering with versions of record. Perhaps something along the lines of the DOI register maintaining an MD5 checksum of the version-of-record PDF?
You are also right that not all authors will bother to post their PDFs — though frankly, heaven alone knows why not, when it takes five minutes to do something that will triple the accessibility of work you’ve spent a year on. This seems like an argument for repositories (whether institutional or subject-based) and mandatory deposition — e.g. as a condition of a grant.
Is that the same as the Green OA route? No, I want to see version-of-record PDFs reposited, not accepted manuscripts — for precisely the anti-tampering reason you mention above, among other reasons. Green OA is much, much better than nothing. But it’s not the real thing.
Finally: if Baidu lists neither my self-hosted Xenoposeidon PDF or Wiley’s version anywhere in its first three pages of search results, then it is Just Plain Broken. I can’t worry about the existence of broken tools. Someone will make a better one and knock it off its perch, just like Google did to AltaVista.
And there, for the moment, matters stand. I’m sure that Liz and Andy, and hopefully others, will have more to say.
One of the things I like about this is the way that a discussion that was originally about publisher behaviour mutated into one on the nature of the Open Web — really, where we ended up is nothing to do with Open Access per se. The bottom line is that free systems (and here I mean free-as-in-freedom, not zero-cost) don’t just open up more opportunities than proprietary ones, they open up more kinds of opportunities, including all kinds of ideas that the original group never even thought of.
And that, really — bringing it all back to where we started — is why I care about Open Access. Full, BOAI-compliant, Open Access. Not just so that people can read papers at zero cost (important though that is), but so that we and a million other groups around the world can use them to build things that we haven’t even thought of yet — things as far advanced beyond the current state of the art as Google is over CERN’s old static list of web-sites.