What if we just post PDFs on the open web?

February 18, 2012

An interesting conversation arose in the comments to Matt’s last post — interesting to me, at least, but then since I wrote much of it, I am biased.  I think it merits promotion to its own post, though.  Paul Graham, among many others, has written about how one of the most important reasons to write about a subject is that the process of doing so helps you work through exactly what you think about it.  And that is certainly what’s happening to me in this series of Open Access posts.

Dramatis personae

Liz Smith: Director of Global Internal Communications at Elsevier
Mike Taylor: me, your co-host here at SV-POW!
Andy Farke: palaeontologist, ceratopsian lover, and PLoS ONE volunteer academic editor

Script

In a long and interesting comment, Liz wrote (among much else):

This is where there seems to be deliberate obtuseness. Sticking a single PDF up online is easy. But there are millions of papers published every year. It takes a hell of a lot of people and resources to make that happen. You can’t just sling it online and hope somebody can find it. The internet doesn’t happen by magic.

And I replied:

Actually, you can and I do. That is exactly how the Internet works. I don’t have to do anything special to make sure my papers are found — Google and other search engines pick them up, just like they do everything. So to pick an example at random, if you search for brachiosaurus re-evaluation, the very first hit will be my self-hosted PDF of my 2009 JVP paper on that subject. [Correction: I now see that it’s the third hit; the PDF of the correction is top.] Similarly, search for xenoposeidon pdf and the top hit is — get ready for a shock! — my self-hosted PDF of my 2007 Palaeontology paper on that subject.

So in fact, this is a fine demonstration of just how obsolete much of the work that publishers do has now become — all that indexing, abstracting and aggregation, work that used to be very important, but which is now done much faster, much better, for free, by computers and networks.

Really: what advantages accrue to me in having my Xenoposeidon paper available on Wiley’s site as well as mine? [It’s paywalled on their site, so useless to 99% of potential visitors, but ignore that for now. Let’s pretend it’s freely available.] What else does that get me that Google’s indexing of my self-hosted PDF doesn’t?

Liz is quite rightly taking a break over the weekend, so she’s not yet replied to this; but Andy weighed in with some important points:

To address your final statement, I see three main advantages to having a PDF on a publisher’s site, rather than just a personal web page (this follows some of our Twitter discussion the other day, but I post it here just to have it in an alternative forum):

1) Greater permanence. Personal web pages (even with the best of intentions) have a history of non-permanence; there is no guarantee your site will be around 40 or 50 years from now. Just ask my Geocities page from 1998. Of course, there also is no guarantee that Wiley’s website will be around in 2073 either, but I think it’s safe to say there’s a greater likelihood that it will be around in some incarnation than a personal website.

2) Document security. By putting archiving in the hands of the authors, there is little to prevent them from editing out embarrassing details, or adding in stuff they wanted published but the reviewers told them to take out, or whatever. I’m not saying this is something that most people would do, but it is a risk of not having an “official” copy somewhere.

3) Combating author laziness. You have an excellent track record of making your work available, but most other authors do not, for various reasons.

It is also important to note that none of the above requirements needs a commercial publisher – in fact, they would arguably be better served by taking them out of the commercial sector. My main point is that self-hosting, although a short-term solution for distribution and archival, is not a long-term one.

Finally, just as a minor pedantic note, search results depend greatly on the search engine used. Baidu – probably the most popular search engine in China – doesn’t give your self-hosted PDF anywhere in its three pages of search results (neither does it give Wiley’s version, though).

And now, here is my long reply — the one that, when I’d finished it, made me want to post this as an article:

On permanence, there are a few things to say. One is that with the rate of mergers, web-site “upgrades” and suchlike I am actually far from confident that (say) the Wiley URL for my Xenoposeidon paper will last longer than my own. In fact, let’s make it a challenge! :-) If theirs goes away, you buy me a beer; if mine does, I buy you one! But I admit that, as an IT professional who’s been running a personal website since the 1990s — no Geocities for me! — I am not a typical case.

But the more important point is that it doesn’t matter. The Web doesn’t actually run on permanent addresses, it runs on what gets indexed. If I deleted my Xenoposeidon PDF today and put it up somewhere else — say, directly on SV-POW! — within a few days it would be indexed again, and coming out at or near the top of search-engine results. Librarians and publishers used to have a very important curation role — abstracting and indexing and all that — but the main reason they keep doing these things now is habit.

And that’s because of the wonderful loosely coupled nature of the Internet. Back when people first started posting research papers on the web, there were no search engines — CERN, famously, maintained a list of all the world’s web-sites. Search engines and crawlers as we know them today were never part of the original vision of the web: they were invented and put together from spare parts. And that is the glory of the open web. The people at Yahoo and AltaVista and Google didn’t need anyone’s permission to start crawling and indexing — they didn’t need to sign up to someone’s Developer Partnership Program and sign a non-disclosure form before they were allowed to see the API documentation, and then apply for an API Key that is good for up to 100 accesses per day. All these encumberances apply when you try to access data in publishers’ silos (trust me: my day-job employers have just spent literally months trying to suck the information out of Elsevier that is necessary to use their crappy 2001-era SOAP-based web services to search metadata. Not even content.) And this is why I can’t get remotely excited about things like ScienceDirect and Scopus. Walled gardens can give us some specific functionality, sure, but they will always be limited by what the vendor thinks of, and what the vendor can turn a profit on. Whereas if you just shove things up on the open web, anyone can do anything with them.

With that said, your point about document security is well made — we do need some system for preventing people from tampering with versions of record. Perhaps something along the lines of the DOI register maintaining an MD5 checksum of the version-of-record PDF?

You are also right that not all authors will bother to post their PDFs — though frankly, heaven alone knows why not, when it takes five minutes to do something that will triple the accessibility of work you’ve spent a year on. This seems like an argument for repositories (whether institutional or subject-based) and mandatory deposition — e.g. as a condition of a grant.

Is that the same as the Green OA route? No, I want to see version-of-record PDFs reposited, not accepted manuscripts — for precisely the anti-tampering reason you mention above, among other reasons. Green OA is much, much better than nothing. But it’s not the real thing.

Finally: if Baidu lists neither my self-hosted Xenoposeidon PDF or Wiley’s version anywhere in its first three pages of search results, then it is Just Plain Broken. I can’t worry about the existence of broken tools. Someone will make a better one and knock it off its perch, just like Google did to AltaVista.

And there, for the moment, matters stand.  I’m sure that Liz and Andy, and hopefully others, will have more to say.

One of the things I like about this is the way that a discussion that was originally about publisher behaviour mutated into one on the nature of the Open Web — really, where we ended up is nothing to do with Open Access per se.  The bottom line is that free systems (and here I mean free-as-in-freedom, not zero-cost) don’t just open up more opportunities than proprietary ones, they open up more kinds of opportunities, including all kinds of ideas that the original group never even thought of.

And that, really — bringing it all back to where we started — is why I care about Open Access.  Full, BOAI-compliant, Open Access.  Not just so that people can read papers at zero cost (important though that is), but so that we and a million other groups around the world can use them to build things that we haven’t even thought of yet — things as far advanced beyond the current state of the art as Google is over CERN’s old static list of web-sites.

14 Responses to “What if we just post PDFs on the open web?”

  1. Paul Barrett Says:

    Andy makes a number of good points. The permancy one is important, and actually one of the reasons why I prefer not to publish in electronic only venues. It’s worth noting that it is actually in the publisher’s interest to promote permanent access: the PDFs are assets and have financial value. Even if a publisher goes bust, whoever buys it will take these on as an asset and will want to continue to make them available commercially. There have been a lot of journals that have shifted publisher, yet the new publishers always make the backlists available. Finally, if you are hosting PDFs on your own website you are probably in violation of he copyright agreements you signed when you submitted the papers to this journals. I know lots of people do this as common practice, but it’s a violation just the same as reprints/PDFs are made available for personal distribution, whereas mounting them on a website undermines the publisher by making the work generally available.


  2. I think there is another point here which is that publishers haven’t traditionally been the holders of the archive, that is what libraries are for. When we moved from purchasing print copies to renting electronic ones this job somehow got passed to the publishers, because libraries are no longer legally allowed to play this role.

    And some publishers have done a very poor job. Both LOCKSS and CLOCKSS have been put in place as stop gap measures in case publishers go under but a lot of the impetus for that came from the experience of the early digitization programs when it became clear that many publishers didn’t even have their own print back catalogue. So how could they be trusted with the electronic ones? These efforts are actually a pretty good example of how libraries and publishers have worked together to find a solution that is acceptable, but it still has huge holes. Recent story going around of a publisher purchased by another and no longer recognizing previous contracts for access…


  3. Paul, you’re bending over backwards to appease BigPublishing! Invoking “copyright agreements” for this hypothetical scenario is ridiculous: obviously, nobody suggests doing it for all past papers, but for future work. And for that you can ensure you have the right to post a PDF.

    Additionally, your bias against electronic only publishers is ridiculous: short of a global nuclear war it is far more likely that all PAPER copies of something are lost than that all PDF copies are destroyed.

  4. Mike Taylor Says:

    To be fair, Paul has a point on the permanency of paper (one that I hear a lot in my ICZN discussions): although it seems to me that a freely distributed PDF (with aribitrarily many copies replicated around the world at no cost) is going to have better preservation prospects than something that’s merely printed on paper, we know from experience that paper can be good for hundreds of years, whereas we just haven’t had the time to demonstrate that the same is true of electronic publications. I am very confident (as outlined in my 2009 BZN paper) but I can see why other people might not be.

    Also: “Finally, if you are hosting PDFs on your own website you are probably in violation of the copyright agreements you signed when you submitted the papers to this journals.” This is technnically correct, and it’s the reason I won’t be signing any more copyright transfers or unreasonable publishing licences. (Much more on this to come in a future post.) Still, this is another example where the role of publishers is to hinder the free exchange of information — something that the publishing industry should be ashamed of, since it’s precisely the opposite of what they exist for.

  5. Paul Barrett Says:

    Publishers don’t exist to promote “free exchange” – they exist to make a profit. Isn’t this the whole point of your arguments against big publishers?

  6. Mike Taylor Says:

    Paul asks: “Publishers don’t exist to promote “free exchange” – they exist to make a profit. Isn’t this the whole point of your arguments against big publishers?”

    In the end, yes, it is. It’s not (as others have pointed out) that the publishers are deliberately evil; it’s that their goal is not the same as ours, and we have no moral mandate to keep supporting them in ours. Our goal is to make science and communicate it to the world. When big publishers are helping us to do that, I am in favour of them; when (as now) they are hindering, I am against.


  7. Paul – yes, they exist to make a profit. But there is a healthy profit, and there is a total rip-off. If they do not want people to get pissed about their profits, they should not try to get legislation passed to protect absurd profits.

  8. Michael Richmond Says:

    The astronomical community has two great resources which might be relevant to this discussion. The first is the many papers which are placed onto the arXiv preprint server.

    http://arxiv.org/

    (Almost) anyone can post a paper onto this server. Other people can search via title or abstract words (but not words in the main body of the text). This service is very cheap, but it has no editorial control. Fast, cheap, but no guarantees of quality.

    NASA has been funding a different sort of archive for the past ten years or so: the Astrophysics Data Service.

    http://www.adsabs.harvard.edu/

    The ADS stores copies of the full text of all papers published in the major, and even many minor, astronomical journals — over the past 100 years or more! It took a great deal of time and money to scan paper versions of thousands of journals to create this archive. Now, however, not only can one search for articles by abtract or full body text, one can also jump quickly to any of the references of a paper, or search for the papers which cite it, or papers which have similar abstracts, etc.

    The first option is cheap (a few thousand dollars) and easy to set up. The second is expensive and difficult, but gives the users much more power to track down information. Both are much, much more useful to the community than having each author posting copies of his papers to his personal website.

  9. cc Says:

    I would like to raise one other point – how do you define permanence? Should permanence only equal paper and do we defend the ivory towers of libraries or limited access journals? Or do we take steps to use the internet as a tool to move science back to an open discourse between the different fields?

    I bring this up because I have had the priviledge to recenty help 3 graduates students from small colleges (colleges who cannot afford to have access to all the journals out there) to find the papers they needed to do their dissertations. I would argue real permanence is not a paper in a library or a credible citation in some journal in your field. Real permanence is passing along the knowledge and thus creating a building block for somebody else to build upon.

  10. Mike Taylor Says:

    cc, I absolutely agree! It’s disturbing to think that technically you were guilty of copyright violation in helping your colleagues to access the papers they needed for their dissertations. Another reflection on how crazy the world has become.


  11. […] internet and we would have to wade through to find the good stuff (Mike has a post related to this here, in an email convo with Liz Smith, Director of Global Internal Communications at Elsevier). This […]

  12. ech Says:

    When I’ve suggested self-publishing in the past here, the reply (eg https://svpow.wordpress.com/2011/09/30/authors-versus-publishers/#comment-11231) has usually been that peer review is important. I’ve always felt that reply was a little thin, especially given the insistance that the publishers don’t really do much wrt peer review in the first place. But, having never published anything in a peer-reviewed journal myself, I have no idea what I’m talking about, so—how will peer review work in a post-pdfs-on-the-open-web world?

  13. Mike Taylor Says:

    Well, we need to separate several questions here (and we probably need to do a post on these separate questions at some point). Here are some of them: does peer-review improve papers by enough to be worth the pain? Is the willingness to undergo peer-review even if it’s doesn’t actually achieve much indication of being serious about work? Should peer-review be (as at present) a filter that allows/denies publication, or (as some have proposed) a filter on published work that helps to evaluate it? How much of a role do publishers have in making peer-review happen? Do we even need the concept of the journal any more? And the truth is I am either completely undecided or unsure of my conclusions on every single one of those questions.

    If it’s not already apparent from what we’ve been posting, Matt’s and my ideas on all these Shiny Digital Future issues are shifting as we discuss them here and elsewhere. Which of course is as it should be: we wouldn’t want to land up in a fixed, immutable place where we’re invulnerable to new data and ideas.

    So the most I am prepared to commit to at this point is: peer-review as currently practiced by many journals is unnecessarily depleting and dispiriting, and will probably change; but will near-certainly always need some form of peer-review.


  14. […] relative merits of formally published papers and more informal publications such as blog-posts a couple of times, but perhaps never really dug into what the differences are between […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: