Introducing The One Repo

June 30, 2015

You know what’s wrong with scholarly publishing?

Wait, scrub that question. We’ll be here all day. Let me jump straight to the chase and tell you the specific problem with scholarly publishing that I’m thinking of.

There’s nowhere to go to find all open-access papers, to download their metadata, to access it via an open API, to find out what’s new, to act as a platform for the development of new tools. Yes, there’s PubMed Central, but that’s only for work funded by the NIH. Yes, there’s Google Scholar, but that has no API, and at any moment could go the way of Google Wave and Google Reader when Google loses interest.

Instead, we have something like 4000 repositories out there, balkanised by institution, by geographical region, and by subject area. They have different UIs, different underlying data models, different APIs (if any). They’re built on different software platforms. It’s a jungle out there!

81zeSfGzaUL._SL1500_

As researchers, we don’t need 4000 repos. You know what we need? One Repo.

Hey! That would be a good name for a project!

I’ve mentioned before how awesome and pro-open my employers, Index Data, are. (For those who are not regular readers, I’m a palaeontologist only in my spare time. By day, I’m a software engineer.) Now we’re working on an index of green/gold OA publishing. Metadata of every article across every repository and publisher. We want it to be complete, in the sense that we will be going aggressively for the long tail as opposed to focusing on some region or speciality, or things that are easily harvestable by OAI-PMH or other standards. We want it to be of a high, consistent quality in terms of metadata. We want it to be up to date. And most importantly, we want it to be fully open for all and any kind of re-use, by any other actor. This will include downloadable data files, OAI-PMH access, search-retrieve web services, embeddable widgets and more. We also envisage a Linked Data representation with a CRUD interface that allows third parties to contribute supplemental information, entity reconciliation, tagging, etc.

Instead of 4000 fragments, one big, meaty chunk of data.

bodyCover_334

Because we at Index Data have spent the last ten years helping aggregators and publishers and others getting access to difficult-to-access information through all kinds of crazy mechanisms, we have a unique combination of the skills, the tools, and the desire to pursue this venture.

So The One Repo is born. At the noment, we have:

  • Harvesting set up for an initial set of 20 repositories.
  • A demonstrator of one possible UI.
  • A whitepaper describing the motivation and some of the technical aspects.
  • A blog about the project’s progress.
  • An advisory board of some of the brightest, most experienced and wisest people in the world of open access.

We’ve been flying under the radar for the last month and a bit. Now we’re ready for the world to know what we’re up to.

The One Repo is go!

24 Responses to “Introducing The One Repo”

  1. Frosted Flake Says:

    DANG !

    Let me get on my knees here now and wave my hands a bit. No, wait. That’s embarrassing. And probably not much fun to watch. How about instead I mention I find it inspiring to encounter a man able to solve an important (worldwide) problem, digging in and doing so, for no better reason than he wants that problem solved. And knows how.

    It’s a beautiful thing to see.

    If I could make one suggestion. Switch the name a bit. Repo One. It seems to roll off the tongue in a satisfying way that sounds a little like slang. And sounds a little like a destination. A place to go and do things. And a place to put things. And easily find them again. I suspect it will be easier for people to visualize Repo One as THE place for data and links to same, than folks would naturally do with One Repo.

    Just a thought. Trash it if you like.

  2. Frosted Flake Says:

    For the rest of the folks… Probably ought to review the previous blog post. Kent Anderson has paid us a visit. To call us (and Mike) an ignorant rabble. Or so it seems at first blush. I got to go try to read that, but it’s HARD to read insults. Even when they pretend not to be… actual insults.

  3. Jeff Says:

    We should chat. We’re doing the same thing at SHARE and are happy to have collaborators.

  4. Mark Robinson Says:

    Holy crap, Mike. That is an amazing announcement. Gob-smacking, actually. Fantastic to see Index Data just go a head and do something to address the issue while a lot of other people are just talking about it. (Not that discussing a problem is a bad thing, and plenty of what has been said has been useful but eventually someone has to start the ball rolling and act if the problem is to be resolved or, at least, mitigated.

    Congrats to all. I really hope it kicks off. I presume that you’ll be limiting access to academics and publishers and not permitting peasants to get their grubby turnip-farming fingers on data that they wouldn’t be able to understand properly anyway?

  5. brembs Says:

    Excellent! This is fantastic news. I hope it’ll be more well known and hence used that the previous attempt from the <uk, <the CORE project:

    http://project.core.ac.uk/about-core-project

  6. Mike Taylor Says:

    Jeff, we would love to be working with SHARE. I have your email address from the metadata of your comment — I’ll drop you a line today.


  7. Mike,

    You may be interested in similar work I do at CrossRef. We have collected metadata on most scholarly content (OA or otherwise,) across countries, publishers, disciplines, so on.

    We provide it all for free, without license restriction, to anyone via a single API here: http://api.crossref.org .

    We don’t yet allow supplemental assertions from the general public, but we have thought on occasion that we would like to collect assertions, whose provenance could, perhaps, be verified by ORCID sign in.

    It will be interesting to see this new initiative progress.

    Karl


  8. We even have a UI for the API here: http://search.crossref.org although it supports no where near all the features of our REST API for content look up, search and discovery.

  9. Mike Taylor Says:

    That is indeed interesting, Karl, many thanks! I will certainly look into this.

    My sense has always been that CrossRef is limited to the more established end of the journal market: for example, several of my own papers, listed here, don’t seem to be covered, including the ones at PaleoBios (where my fourth most-cited paper is) and at the Bulletin of Zoological Nomenclature. Is that still true, or has coverage increased and diversified?


  10. Coverage is diversifying. We are seeing new members from less economically developed countries (helped in part by our fee waiving partnerships with organisations such as PKP and INASP and our relationships with governmental entities, for example in Brazil.) So DOI coverage is less ‘western-centric’ than it used to be.

    We are seeing many smaller members, too, and our membership is long past a time when we could say large or medium sized commercial publishers and prominent societies and university presses make up most of our numbers. Entities with one or a few journals that are now able to work with CrossRef (ask Martin Eve about our tooling aimed directly at them) are becoming (probably already are) the norm.

    And of course, we definitely have a handle on the more prominent gold OA journals to appear in recent years.

    Unsure about palaeontology in particular or subject ‘holes’ more generally. Though I’m sure some such holes do still exist.


  11. I think the key distinction compared to Crossref Metdata is that One Repo will provide links back to full text in different places. Certainly the diversification of Crossref members is good and that adds a lot, particularly given many new members are OA publishers.

    The real problem OR tries to tackle though is the many articles available through IRs and other smaller repositories. There’s no record of author manuscripts in Crossref metadata at the moment at least and that’s for me where the added value comes.

    Of course the real end game is having well structured indexes that can be brought together. So for me One Repo helps to tackle that at one end, while Crossref is working at the other.


  12. Yes I think this a fantastic goal to aim towards.

    Diversity of location and diversity of stewardship of content – the concept of multiplicity of stewardship – is lacking within the CrossRef model as it stands right now. Yes, IRs, self-hosted copies, so on are essentially absent in CrossRef.

  13. Mike Taylor Says:

    Well, Karl, this sounds like there is excellent potential for real synergy between what CrossRef is already doing and what we want to bring to the party. We’ll be in touch soon to see what form that might take. Thanks for dropping by!

  14. Brian Nosek Says:

    Yes, nice to see this and SHARE pushing the same direction. Following up on Jeff’s note above, here is SHARE’s github repo: https://github.com/CenterForOpenScience/SHARE and a discovery portal to the dataset: https://osf.io/share/? It would be great to collaborate.

  15. Frosted Flake Says:

    I’d extend the discussion to say what might be too obvious.

    I hope a lot of thought has gone into how to make the system vandal resistant. And how to move large data sets, often, without being noticed by the other folks on the net. I think (this is just me, y’know) The best model might be the torrent.

    With the data spread out and redundant a meteor strike would not render Earth stupid again. And being spread out, a page can be taken from every server to assemble into a book at yours.

    You guys did that a few years ago. Didn’t ya?

  16. Thomas Munro Says:

    This is tremendously exciting news. Thanks for undertaking this huge task and bringing together such formidable people. The mooted connection with crossref would also be wonderful – google scholar’s unreliable metadata and hidden DOIs are exasperating flaws. Some thoughts:

    1. Another key source to integrate with would be BASE, the biggest repository index I know of:
    http://www.base-search.net/about/en/index.php
    It’s huge (70 million documents) but only ~70% OA. Setting up a sharing deal with them could save you a lot of work on repositories. Add journals and crossref integration, and you will rule OA search like an iron fist in a velvet glove.

    2. Why not run a crowdfunding campaign? I believe the major sites only take a 5% cut. I’d donate like a shot.


  17. Fantastic! That’s exactly what we’re doing in Paperity, a multidisciplinary open-access aggregator launched in 2014:

    http://paperity.org/

    Paperity aggregates 2500 journals and 830,000 papers already, with the aim to reach 100% coverage at some point. Until now, we’ve focused on journals, but we plan to add repositories in the next stage, as well.

    Hopefully we could join the efforts when One Repo is up and running? I think buliding an API in One Repo is a particularly great idea, potentially very useful and important for the scholarly community. I wish you good luck in developing this functionality.


  18. This is an amazing idea, and exactly what is needed. Let me know if you want any help on the communication end once you get to the point where it’s ready for end users to start exploring.

  19. gemstest Says:

    A welcome initiative – but it needs to engage with Google! The objections in the white paper are not show-stoppers if the end game is access, and a Google outlet shouldn’t stop receipt of funding.

  20. Mike Taylor Says:

    Agtually, gemstest, we hope that one of the outcomes of The One Repo will be that Google Scholar is better able to index the open-access literature, because of nice clear harvest we make available of the long-tail stuff. (And in an ideal world, Google would for that reason be one of the funding organisations.)

  21. gemstest Says:

    That’s good news. I’d hate to see this become a European Library thing. Any maybe google.org could find a way to fund this.

  22. Mike Taylor Says:

    What do you mean by “a European Library thing”?


  23. […] seats when we put up too many posts in a row on open access or rabbits or…okay, mostly just OA and bunnies. If that’s you – or, heck even if it isn’t – your good day has […]


  24. […] spent much of the week squeezed into a conference room discussing software projects (including The One Repo, yay!) But we did have plenty of time to lounge around on the beach and in the pools, too — […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: