Introducing The One Repo
June 30, 2015
You know what’s wrong with scholarly publishing?
Wait, scrub that question. We’ll be here all day. Let me jump straight to the chase and tell you the specific problem with scholarly publishing that I’m thinking of.
There’s nowhere to go to find all open-access papers, to download their metadata, to access it via an open API, to find out what’s new, to act as a platform for the development of new tools. Yes, there’s PubMed Central, but that’s only for work funded by the NIH. Yes, there’s Google Scholar, but that has no API, and at any moment could go the way of Google Wave and Google Reader when Google loses interest.
Instead, we have something like 4000 repositories out there, balkanised by institution, by geographical region, and by subject area. They have different UIs, different underlying data models, different APIs (if any). They’re built on different software platforms. It’s a jungle out there!
As researchers, we don’t need 4000 repos. You know what we need? One Repo.
Hey! That would be a good name for a project!
I’ve mentioned before how awesome and pro-open my employers, Index Data, are. (For those who are not regular readers, I’m a palaeontologist only in my spare time. By day, I’m a software engineer.) Now we’re working on an index of green/gold OA publishing. Metadata of every article across every repository and publisher. We want it to be complete, in the sense that we will be going aggressively for the long tail as opposed to focusing on some region or speciality, or things that are easily harvestable by OAI-PMH or other standards. We want it to be of a high, consistent quality in terms of metadata. We want it to be up to date. And most importantly, we want it to be fully open for all and any kind of re-use, by any other actor. This will include downloadable data files, OAI-PMH access, search-retrieve web services, embeddable widgets and more. We also envisage a Linked Data representation with a CRUD interface that allows third parties to contribute supplemental information, entity reconciliation, tagging, etc.
Instead of 4000 fragments, one big, meaty chunk of data.
Because we at Index Data have spent the last ten years helping aggregators and publishers and others getting access to difficult-to-access information through all kinds of crazy mechanisms, we have a unique combination of the skills, the tools, and the desire to pursue this venture.
So The One Repo is born. At the noment, we have:
- Harvesting set up for an initial set of 20 repositories.
- A demonstrator of one possible UI.
- A whitepaper describing the motivation and some of the technical aspects.
- A blog about the project’s progress.
- An advisory board of some of the brightest, most experienced and wisest people in the world of open access.
We’ve been flying under the radar for the last month and a bit. Now we’re ready for the world to know what we’re up to.
The One Repo is go!
July 1, 2015 at 1:54 am
DANG !
Let me get on my knees here now and wave my hands a bit. No, wait. That’s embarrassing. And probably not much fun to watch. How about instead I mention I find it inspiring to encounter a man able to solve an important (worldwide) problem, digging in and doing so, for no better reason than he wants that problem solved. And knows how.
It’s a beautiful thing to see.
If I could make one suggestion. Switch the name a bit. Repo One. It seems to roll off the tongue in a satisfying way that sounds a little like slang. And sounds a little like a destination. A place to go and do things. And a place to put things. And easily find them again. I suspect it will be easier for people to visualize Repo One as THE place for data and links to same, than folks would naturally do with One Repo.
Just a thought. Trash it if you like.
July 1, 2015 at 2:04 am
For the rest of the folks… Probably ought to review the previous blog post. Kent Anderson has paid us a visit. To call us (and Mike) an ignorant rabble. Or so it seems at first blush. I got to go try to read that, but it’s HARD to read insults. Even when they pretend not to be… actual insults.
July 1, 2015 at 3:07 am
We should chat. We’re doing the same thing at SHARE and are happy to have collaborators.
July 1, 2015 at 3:35 am
Holy crap, Mike. That is an amazing announcement. Gob-smacking, actually. Fantastic to see Index Data just go a head and do something to address the issue while a lot of other people are just talking about it. (Not that discussing a problem is a bad thing, and plenty of what has been said has been useful but eventually someone has to start the ball rolling and act if the problem is to be resolved or, at least, mitigated.
Congrats to all. I really hope it kicks off. I presume that you’ll be limiting access to academics and publishers and not permitting peasants to get their grubby turnip-farming fingers on data that they wouldn’t be able to understand properly anyway?
July 1, 2015 at 4:22 am
Excellent! This is fantastic news. I hope it’ll be more well known and hence used that the previous attempt from the <uk, <the CORE project:
http://project.core.ac.uk/about-core-project
July 1, 2015 at 6:59 am
Jeff, we would love to be working with SHARE. I have your email address from the metadata of your comment — I’ll drop you a line today.
July 1, 2015 at 5:03 pm
Mike,
You may be interested in similar work I do at CrossRef. We have collected metadata on most scholarly content (OA or otherwise,) across countries, publishers, disciplines, so on.
We provide it all for free, without license restriction, to anyone via a single API here: http://api.crossref.org .
We don’t yet allow supplemental assertions from the general public, but we have thought on occasion that we would like to collect assertions, whose provenance could, perhaps, be verified by ORCID sign in.
It will be interesting to see this new initiative progress.
Karl
July 1, 2015 at 5:05 pm
We even have a UI for the API here: http://search.crossref.org although it supports no where near all the features of our REST API for content look up, search and discovery.
July 1, 2015 at 5:16 pm
That is indeed interesting, Karl, many thanks! I will certainly look into this.
My sense has always been that CrossRef is limited to the more established end of the journal market: for example, several of my own papers, listed here, don’t seem to be covered, including the ones at PaleoBios (where my fourth most-cited paper is) and at the Bulletin of Zoological Nomenclature. Is that still true, or has coverage increased and diversified?
July 1, 2015 at 6:56 pm
Coverage is diversifying. We are seeing new members from less economically developed countries (helped in part by our fee waiving partnerships with organisations such as PKP and INASP and our relationships with governmental entities, for example in Brazil.) So DOI coverage is less ‘western-centric’ than it used to be.
We are seeing many smaller members, too, and our membership is long past a time when we could say large or medium sized commercial publishers and prominent societies and university presses make up most of our numbers. Entities with one or a few journals that are now able to work with CrossRef (ask Martin Eve about our tooling aimed directly at them) are becoming (probably already are) the norm.
And of course, we definitely have a handle on the more prominent gold OA journals to appear in recent years.
Unsure about palaeontology in particular or subject ‘holes’ more generally. Though I’m sure some such holes do still exist.
July 1, 2015 at 8:14 pm
I think the key distinction compared to Crossref Metdata is that One Repo will provide links back to full text in different places. Certainly the diversification of Crossref members is good and that adds a lot, particularly given many new members are OA publishers.
The real problem OR tries to tackle though is the many articles available through IRs and other smaller repositories. There’s no record of author manuscripts in Crossref metadata at the moment at least and that’s for me where the added value comes.
Of course the real end game is having well structured indexes that can be brought together. So for me One Repo helps to tackle that at one end, while Crossref is working at the other.
July 1, 2015 at 8:24 pm
Yes I think this a fantastic goal to aim towards.
Diversity of location and diversity of stewardship of content – the concept of multiplicity of stewardship – is lacking within the CrossRef model as it stands right now. Yes, IRs, self-hosted copies, so on are essentially absent in CrossRef.
July 1, 2015 at 9:55 pm
Well, Karl, this sounds like there is excellent potential for real synergy between what CrossRef is already doing and what we want to bring to the party. We’ll be in touch soon to see what form that might take. Thanks for dropping by!
July 2, 2015 at 1:10 pm
Yes, nice to see this and SHARE pushing the same direction. Following up on Jeff’s note above, here is SHARE’s github repo: https://github.com/CenterForOpenScience/SHARE and a discovery portal to the dataset: https://osf.io/share/? It would be great to collaborate.
July 3, 2015 at 7:51 am
I’d extend the discussion to say what might be too obvious.
I hope a lot of thought has gone into how to make the system vandal resistant. And how to move large data sets, often, without being noticed by the other folks on the net. I think (this is just me, y’know) The best model might be the torrent.
With the data spread out and redundant a meteor strike would not render Earth stupid again. And being spread out, a page can be taken from every server to assemble into a book at yours.
You guys did that a few years ago. Didn’t ya?
July 3, 2015 at 8:35 am
This is tremendously exciting news. Thanks for undertaking this huge task and bringing together such formidable people. The mooted connection with crossref would also be wonderful – google scholar’s unreliable metadata and hidden DOIs are exasperating flaws. Some thoughts:
1. Another key source to integrate with would be BASE, the biggest repository index I know of:
http://www.base-search.net/about/en/index.php
It’s huge (70 million documents) but only ~70% OA. Setting up a sharing deal with them could save you a lot of work on repositories. Add journals and crossref integration, and you will rule OA search like an iron fist in a velvet glove.
2. Why not run a crowdfunding campaign? I believe the major sites only take a 5% cut. I’d donate like a shot.
July 3, 2015 at 11:12 am
Fantastic! That’s exactly what we’re doing in Paperity, a multidisciplinary open-access aggregator launched in 2014:
http://paperity.org/
Paperity aggregates 2500 journals and 830,000 papers already, with the aim to reach 100% coverage at some point. Until now, we’ve focused on journals, but we plan to add repositories in the next stage, as well.
Hopefully we could join the efforts when One Repo is up and running? I think buliding an API in One Repo is a particularly great idea, potentially very useful and important for the scholarly community. I wish you good luck in developing this functionality.
July 3, 2015 at 3:03 pm
This is an amazing idea, and exactly what is needed. Let me know if you want any help on the communication end once you get to the point where it’s ready for end users to start exploring.
July 6, 2015 at 8:05 am
A welcome initiative – but it needs to engage with Google! The objections in the white paper are not show-stoppers if the end game is access, and a Google outlet shouldn’t stop receipt of funding.
July 6, 2015 at 8:24 am
Agtually, gemstest, we hope that one of the outcomes of The One Repo will be that Google Scholar is better able to index the open-access literature, because of nice clear harvest we make available of the long-tail stuff. (And in an ideal world, Google would for that reason be one of the funding organisations.)
July 6, 2015 at 9:47 am
That’s good news. I’d hate to see this become a European Library thing. Any maybe google.org could find a way to fund this.
July 6, 2015 at 8:40 pm
What do you mean by “a European Library thing”?
July 13, 2015 at 4:32 am
[…] seats when we put up too many posts in a row on open access or rabbits or…okay, mostly just OA and bunnies. If that’s you – or, heck even if it isn’t – your good day has […]
December 8, 2015 at 6:02 pm
[…] spent much of the week squeezed into a conference room discussing software projects (including The One Repo, yay!) But we did have plenty of time to lounge around on the beach and in the pools, too — […]