Where do formatted references come from?

December 12, 2012

It’s an oddity to me that when publishers try to justify their existence with long lists of the valuable services they provide, they usually skip lightly over one of the few really big ones. For example, Kent Anderson’s exhausting 60-element list omitted it, and it had to be pointed out in a comment by Carol Anne Meyer:

One to add: Enhanced content linking, including CrossREF DOI reference linking, author name linking cited-by linking, related content linking, updates and corrections linking.

(Anderson’s list sidles up to this issue in his #28, “XML generation and DTD migration” and #29, “Tagging”, but doesn’t come right out and say it.)

Although there are a few journals whose PDFs just contain references formatted as in the manuscript — as we did for our arXiv PDF — nearly all mainstream publishers go through a more elaborate process that yields more information and enables the linking that Meyer is talking about. (This is true of the new kids on the block as well as the legacy publishers.)

The reference-formatting pipeline

When I submit a manuscript with formatted reference like:

Taylor, M.P., Hone, D.W.E., Wedel, M.J. and Naish, D. 2011. The long necks of sauropods did not evolve primarily through sexual selection. Journal of Zoology 285(2):150–161. doi:10.1111/j.1469-7998.2011.00824.x

(as indeed I did in that arXiv paper), the publisher will take that reference and break it down into structured data describing the specific paper I was referring to. It does this for various reasons: among them, it needs to provide this information for services like the Web Of Knowledge.

Once it has this structured representation of the reference, the publication process plays it out in whatever format the journal prefers: for example, had our paper appeared in JVP, Taylor and Francis’s publication pipeline would have rendered it:

Taylor, M. P., D. W. E. Hone, M. J. Wedel, and D. Naish. 2011. The long necks of sauropods did not evolve primarily through sexual selection. Journal of Zoology 285:150–161.

(With spaces between multiple initials, initials preceding surnames for all authors except the first, an “Oxford comma” before the last author, no italics for the journal name, no bold for the volume number, the issue number omitted altogether, and the DOI inexplicably removed.)

What’s needed in a submitted reference

Here’s the key point: so long as all the relevant information is included in some format (authors, year, article title, journal title, volume, page-range), it makes no difference how it’s formatted. Because the publication process involves breaking the reference down into its component fields, thus losing all the formatting, before reassembling it in the preferred format.

And this leads us the key question: why do journals insist that authors format their references in journal style at all? All the work that authors do to achieve this is thrown away anyway, when the reference is broken down into fields, so why do it?

And the answer of course is “there is no good reason”. Which is why several journals, including PeerJ, eLifePLOS ONE and certain Elsevier journals have abandoned the requirement completely. (At the other end of the scale, JVP has been known to reject papers without review for such offences as using the wrong kind of dash in a page-range.)

Like so much of how we do things in scholarly publishing, requiring journal-style formatting at the submission stage is a relic of how things used to be done and makes no sense whatsoever in 2012. Before we had citation databases, the publication pipeline was much more straight-through, and the author’s references could be used “as is” in the final publication. Not any more.

How far can we go?

All of this leads me to wonder how far we can go in cutting down the author burden of referencing. Do we actually need to give all the author/title/etc. information for each reference?

In the case of references that have a DOI, I think not (though I’ve not yet discussed this with any publishers). I think that it suffices to give only the DOI. Because once you have a DOI, you can look up all the reference data. Go try it yourself: go to http://www.crossref.org/guestquery/ and paste my DOI “10.1111/j.1469-7998.2011.00824.x” into the DOI Query box at the bottom of the page. Select the “unixref” radio button and hit the Search button. Scroll down to the bottom of the results page, and voila! — an XML document containing everything you could wish to know about the referenced paper.

And the data in that structured document is of course what the publication process uses to render out the reference in the journal’s preferred style.

Am I missing something? Or is this really all we need?


31 Responses to “Where do formatted references come from?”

  1. protohedgehog Says:

    Nice post Mike. Just a quick comment – even googling your DOI gives your paper as the top hit! It’s such a simple method of tracking down papers. Having never submitted a proper paper (for now), I can’t comment on the complexities of reference formatting.

  2. Andy Farke Says:

    One issue with “DOI only” is that they are easily machine-readable, but not human-readable. “Taylor et al. 2011” is fairly intuitive to me as a reader and a reviewer, but “10.1111/j.1469-7998.2011.00824.x” is not.

    So is your real proposal “reference-in-any-format plus DOI”?

  3. Mike Taylor Says:

    Well, I don’t exactly have a proposal at this point, just a thought experiment. But what you’d need to do to make it work is probably to have main text that says “Taylor et al. (2011) argued that the necks of sauropods were not sexually selected”, and the bibliography would have a line saying “Taylor et al. 2012. doi: 10.1111/j.1469-7998.2011.00824.x”.

  4. ech Says:

    Wikipedia has a tool where you can just add references like that: {{Cite doi|10.1111/j.1469-7998.2011.00824.x}} and it will automatically grab the details and format it for the article. With en-dashes for page ranges, *of course*. So in general it does seem like yes technically that should be all you need. It seems like every now and then a DOI can’t be found via http://dx.doi.org/ etc, but that could have been user error on my part.

  5. Matt Butler Says:

    I don’t think that the format matters. But since the publisher is not formatting references at the review stage, it seems important to inclue author/date/journal/title for the reviewers.

  6. Andy Farke Says:

    I otherwise mostly agree on just letting the journals add the rest of the stuff, if they’re so intent on adding value. But perhaps it’s not as simple as it seems? I would be curious to hear from those at the front lines of production who can elaborate what actually goes on. Are they using the stuff authors provide? Or just chucking it and generating something from scratch?

    As an aside that Mike probably anticipates me saying, a reference manager can take care of most of the grunt work – aside from the issues of formatting, they eliminate the problems of unreferenced citations and uncited references. Those are the real things that authors need to take care of, in my opinion, not adding spaces or em-dashes. (Yes, I know that there are still some issues with reference managers – particularly for manuscripts w/multiple authors on multiple computer systems)

  7. Andy Farke Says:

    Re: Mike’s response above, at least some basic bibliographic data are nice, particularly for prolific authors and labs. For instance, “Sereno 1999” refers to at least three different papers. But, I think we’re really arguing the same thing – it doesn’t make a lot of sense for authors to invest too much time on formatting references for a publication that may never end up in a particular journal.

  8. Mike Taylor Says:

    ech, good to see Wikipedia leading the way. Surely we’re going to end up doing something along these lines for journal articles sooner or later. Hopefully something made easy with slick tools.

    Matt Butler makes an excellent point for now, though! Yes, of course we do need to include some form of human-readable reference for the benefit of reviewers — a DOI alone would be horribly uninformative. Of course reviewers don’t care what format references are in, so long as they’re clear. There’s certainly no need to conform to journal style for them.

    As for reference managers: so far, I’ve found them all horrible to work with, but I agree that tools like this are the way to go. Crucially, as he says, they fix the real problem of ensuring that all citations are referenced and all references are cited.

    But of course the reference manager-based workflow introduces its own brain-damage. We prepare manuscripts with smart markers that point at the cited articles; then throw all that away by flattening the manuscripts before submitting them to the journals, which then painstakingly recreate all the information we threw away!

  9. Mike Taylor Says:

    It doesn’t make a lot of sense for authors to invest too much time on formatting references for a publication that may never end up in a particular journal.

    Well, I’m making a slightly stronger point. I’m saying there’s no point investing time in formatting references even if the paper does end up in a particular journal. Because those carefully formatted references will be torn apart and reconstituted anyway.

    As you know, I do agree that Author+Date citations have their own information value, aside from being pointers into the bibliography — which is why I so strongly prefer them over numbered references. If in a sauropod paper you see a citation to Wilson and Sereno 1998 you don’t need to consult the bibliography, you already know what paper it’s talking about.

  10. kaveh1000 Says:

    Congratulations, Mike. You are the winner of the “Find the elephant in the room” competition. ;-)

    To cut it short, you absolutely put your finger on it. Yes, it is as daft as it seems, and it is just a continuation of the traditional print model, where we simply needed to punctuate and abbreviate correctly, so it was less work if the author put the references and citations in the correct style to begin with. I think the publishers are not sure what to replace the old author instructions with, so they are leaving things as they are. (No distinction between subscription and OA models here.)

    The authors are probably under the impression that they are helping the publishing process by doing a bit more work but, unbeknown to them, they are hindering it. Because precisely as you have said, when the manuscript comes into the dark basement of “typesetting” (companies like ours), we have to reverse engineer those beautifully formed references, e.g. as you say by going to CrossRef, and get the full metadata, in order to provide XML. And we flush all the carefully placed author punctuations down the figurative loo.

    So the XML is the definitive archive of the client. They still need PDF according to their style. So for each journal we write a script to take the XML, and (re)create precisely the style they need, fully automatically from that XML.

    I caused a bit of a stir when I shouted my mouth off at the recent SpotOn conference, asking publishers to help put me out of business because I would rather do something more rewarding! Search for “Kaveh” here to listen

    In fact I gave two presentations a few days ago where I highlighted this very point, and with more detail. It was recorded and the recording will be up very soon. I will put a note here as soon as they are up.

    **Disclaimer** Please note that I can only speak from my own experience and with our own workflow. And I have been a bit provocative here I admit. But I would be interested in any comments from other typesetters or any publishers.

  11. 220mya Says:

    Mike – A clarification for your readers regarding JVP. Regardless of what you think about the nit-pickyness of the initial formatting review, your statement implies a hard rejection (i.e., you can’t resubmit). JVP only rejects in in the sense that they want you to resubmit with the proper formatting.

  12. Bill Parker Says:

    PLoSONE has eliminated their reference formatting requirement!? he reads after formatting the tons of references for a very large paper to submit to that venue). When!!??(

  13. Nick Gardner Says:

    On the flip side of things, when you’re working on crocodylians and there are three Brochu papers in a single year alone (and often more than that!)….

  14. Mike Taylor Says:

    Thanks, Randy, important clarification on JVP — I’ll update the article to be more explicit (though, really, surely no-one could think that any journal would reject without the option to resubmit for such reasons).

    Bill, all we know (so far) about PLOS ONE’s not requiring reference formatting is in Matt Hodgkinson’s comments on the earlier SV-POW! article about reference formats. We await further instructions. (Ironically, if I’d known this, I might well have submitted our neck-anatomy paper to PLOS ONE before PeerJ opened. It has 150 references and I couldn’t face all the pointless editing.)

  15. Mike, thank you for this conversation, which is long overdue.

    It’s funny. In disciplines like the humanities and social sciences where CrossRef DOIs have not been an established part of the citation process for as long as they have been in the sciences, authors sometimes rebel against including DOIs in their references, thinking it is too much work to look them up. At CrossRef, we’re making tools available to make this easier for authors and publishers. In addition to the guest query form that you mention, we also have a prototype simple metadata search http://search.labs.crossref.org/.

    Publishers (and others) could encourage the practice by conforming to CrossRef’s DOI display guidelines (http://www.crossref.org/02publishers/doi_display_guidelines.html). If publishers always included their CrossRef DOIs with their bibliographic metadata, not only in publications, but in tables of contents, metadata feeds to third parties (like the Web of Science example you mentioned), and in the references, everyone who cares could avoid spending so much time looking them up.

    As long as I’m wishing, wouldn’t it be nice if PubMed would always include DOIs in its records?

    As Kaveh commented recently at the CrossRef annual meeting, the cleaner the data is coming in, the less work has to be done by the publisher and its vendors. Somebody countered that it makes more sense for the publisher or vendor to spend time on formatting than the author.

    I do have one quibble. In the past, actually very rarely were authors’ references used “as is”. Copyeditors have historically had to spend an inordinate amount of their time making references conform to house style, because not all authors are as careful as the commenters here seem to be. Of course, the degree of pickiness varies from publisher to publisher and even journal to journal.

    So indeed what exactly IS the value in continuing to support so many different styles?

    It’s like the legend of the cook who always cuts the end off the roast before putting it in the pan. Somebody asks her why. She shrugs and says, “I don’t know, Mom always did it that way.” She asks Mom, who shrugs and says HER mom always did it that way. Turns out Grandma had a pan that was too small.

    PeerJ and other new publishers have the advantage of not having to support legacy content, workflows, and styles. The plethora of citation formats is clearly left over from our print roots. Newer tools like PubGet, Mendeley, Zotero, and UtopiaDocs use DOIs to identify citations. So do article-level metrics.

    Until the styles are more unified, as you note, another option is to let a computer do the formatting. My colleague Karl Ward has written a clever tool that allows an author to paste in a CrossRef DOI, and get a formatted citation back in a choice of styles. This function is also available programmatically through content negotiation. What a brilliant use of computer power, saving the researcher’s time for, um, research. You can find out more about it here: http://labs.crossref.org/styled-5/citation_formatting_service.html. Karl is also responsible for the CrossRef Metadata Search interface I mentioned earlier.

    Warning: The metadata search and citation formatter tools are CrossRef Labs projects and so we must weaselly say that we don’t guarantee uptime yada yada.

    I agree with Andy that It does make sense for the author to include enough citation data in addition to the CrossRef DOI so that every human body in the creation/editing/production cycle can catch those user errors. (You know, typing the character for the letter “l” instead of the number “1”, or putting the DOI on the wrong reference, etc.)

    But please, let’s all stop cutting off the ends of the roast.

  16. Oops–that last clause in the last parenthesis should have been “putting the DOI on the WRONG reference)

    [Fixed it for you — Mike.]

  17. Mike Taylor Says:

    Thanks, Carol, lots of interesting stuff there — very helpful to see it from the CrossRef side of the fence. (Although is it really necessary to precede each occurrence of “DOI” with “CrossRef”? That’s not the way to build a brand!)

    I’d not seen those display guidelines before. Why is it better to display “http://dx.doi.org/10.1111/j.1469-7998.2011.00824.x” than “doi:10.1111/j.1469-7998.2011.00824.x“?

    PubMed omits DOIs?! Say it ain’t so! It horrifies me that JVP seems to remove them, too.

    It’s nice to know that historically it was copy-editors who did the drudge work of reformatting authors’ references into journals’ favoured styles. But it makes it even more inexplicable that this work has moved to authors now, just when it’s not needed at all any more! (Of course the broader question is why the heck the journals all need different styles anyway.)

    Thanks for the links to the Labs projects.

  18. Ah, branding… in the beginning, any DOI was very likely to have been assigned by CrossRef. And DOIs were supposed to be an under-the-hood thing that only publishers really needed to know about. So CrossRef didn’t give much thought to what they were called. But, as you point out here, they have become very useful for scholars, librarians and researchers as well.

    CrossRef isn’t the only registration agency (RA) for DOIs. The entertainment industry assigns DOIs to music and films (They call them EIDRs). Bowker can assign DOIs (they call them ISBN-As) to books for tracking sales.

    Nor is CrossRef the only DOI RA in the scholarly community. DataCite is assigning DOIs. figshare is assigning DOIs. medRA is assigning DOIs in Italy. WanFang Data assigns DOIs in China.

    Although DOI resolution is interoperable among RAs (which means http://dx.doi.org will work for any of them), the metadata that CrossRef has available for lookup is restricted to documents (books and book chapters, journal articles, proceedings papers, datasets, theses, etc) deposited into the CrossRef system–which is an application layer unique to CrossRef on top of the DOI infrastructure. Some RAs (like MedRA) deposit their metadata at CrossRef so a search of CrossRef will return their DOIs. Some do not.

    We’ve done some work on making DOI services from different RAs interoperable. For example, DataCite and CrossRef both support content negotiation. But as more RAs assign more DOIs, people will become confused about why a service works with one DOI rather than another.

    CrossRef does have about 90% of the DOIs. But that number is changing. And that’s fine. But it can create confusion between the DOI itself and the services offered by particular registration agencies such as the once I mentioned. So for example, if you took a DOI from the entertainment industry and put it into the CrossRef citation formatter, you wouldn’t get a meaningful result. Hence the use of “CrossRef DOI”.

    In retrospect, maybe we should have called them something else. But we didn’t.

    Aren’t you glad you asked? :)

    The rationale for displaying the DOI as a full hypertext link is actually in the document I pointed to. In a nutshell, it makes it easier for computers to interpret DOIs correctly, it identifies to users what the heck a DOI is (you’d be surprised at the number of researchers who do not know), and it allows people to do stuff like use the “Copy link” features of their browsers.

  19. Interesting post Mike. However, you presuppose one thing. You are assuming that authors get their references right. As someone who has worked in journal editorial offices for 15 years I can tell you that is almost never the case (certainly in the journals that I have worked with).

    Virtually every submitted reference list contains errors, even if the author has used some sort of reference management software.

    These errors range from the relatively harmless, such as incorrect or missing authors or last page numbers, to more serious mistakes that could hamper discovery and/or linking such as the wrong year of publication or missing volume numbers through to “fatal” errors such as completely wrong journals or titles from one paper coupled with publication information from another paper altogether.

    While I totally agree that it is pointless to insist on a particular format from the author, for exactly the reasons you outline, it would be disastrous to allow authors to provide references as DOIs only.

    I don’t think there is any reason to assume that authors would be more accurate in providing DOIs than they are in providing references in journal style. However, a mistake in a DOI would be much more troublesome than most mistakes in formatted references.

    If a single character in the DOI is incorrect, the journal now has no idea what paper is being cited, other than perhaps what journal the article was published in, and would have to query the author for every reference error.

    By asking authors to provide a reasonably full reference in a generic format, it is possible in most cases of reference errors to figure out what paper is being cited without having to pester the author, who has better things to do, like write more papers.

  20. Isn’t the right approach here to provide the information in a tagged form, rather than specifically formatted? Use one of the available ontologies like the Bibliographic Ontology, or one of the SPAR ontologies, or oe of various markup languages… write the semantics, not the formatting!

  21. Sorry – something I missed. A DOI-only reference would give you nothing to cross-check against (except the context of the citation, which is laborious and often open to interpretation), so in many cases the journal might not even know that the DOI was incorrect. Reasonably full journal references give several comparison points that often allow the incorrect element to be identified by the balance of probabilities amongst a large number of correct elements.

  22. @Carol: Thank you for a nice “meaty” reply. I will certainly use the Cook and the Roast analogy again. ;-)

    Karl Ward has done great work on CSL (Scitation Style Language). But my worry is that authors are going to start using that and dumping beautifully formed styles on us typesetters, and we still have to do the reverse engineering. In my opinion, CSL should be an automated styler that allows authors to read papers with any style they prefer. And journals should just dump the idea of “house style”. I have done a live demo of this and will share video soon.

    @Robin: You are right that it is possible for authors to get the DOI slightly wrong, but only if they type them in by hand. This post (and replies) is full of long URIs. Did we get any of them wrong? No, because we didn’t type them in. We copied and pasted. So the authoring system should have a one-click system of inserting any reference from their ref manager. And if they put the wrong URI in? Well tough. It’s up to them to check. ;-)

  23. @Kaveh – that’s certainly a valid point (i.e. errors can, mostly, be avoided by copy and paste, although I guarantee that authors can and will propagate incorrect DOIs by copying and pasting them from non-authoritative online resources, just as many incorrect or non-existent references circulate in the literature, but that’s splitting hairs).

    However, there are already tools that authors can use to avoid reference errors; EndNote has been around for more than 20 years, and resources such as PubMed and many online journals allow authors to download authoritative references directly into their citation management software.

    But, authors still make mistakes: they subvert these systems because they aren’t as easy as typing in references by hand.

    Publishers (etc.) can therefore take one of two courses (if they value the accuracy of their reference lists):

    (i) deal with author behaviour by investing in reference validation and correction tools, such as those available in the eXtyles software (note for transparency: I am representing Inera, the company behind eXtyles, in the UK), or

    (ii) build a tool, along the lines that you outline (although I’m not entirely clear how this differs from e.g. EndNote if it is used properly; presumably you are talking about incorporating this sort of tool into online submission systems, rather than into authors’ desktop applications), that is so easy to use that authors will use it in preference to typing their reference lists by hand.

    I guess which of (i) and (ii) is the easier course is open to discussion.

  24. Kaveh, The CrossRef citation formatter should include the DOI so that you don’t have to reverse engineer the metadata. We are trying to encourage style guides to include the url version of the DOI in their reference formats, and some have done so.

    Robin, For the record, I do believe that Editorial Manager offers its customers an option to use Inera eXtyles to check references in submissions. Not sure if the others do or not. (Disclosure–I used to work for Aries Systems, but my information may be old. Oh, and if you couldn’t tell from my post, I’m currently at CrossRef.)

  25. kaveh1000 Says:

    As promised, here’s the recording of my talk on how XML should be used, and why we should just dump the idea of the “house style” ;-)

  26. philliplord Says:

    Mike, this is a good idea. In fact, my tool called kcite http://knowledgeblog.org/kcite-plugin has supported this for several years. References are added as DOI, arXiv ID, PUBMED ID, or any URI. The reference list is generated automatically, using data from a variety of different sources. It even works against your blog. If you poke it a bit, it will even letter the READERS choose the reference style.

    As well as being easy to read, it has another advantage. While you are authoring a blog post, if you use the wrong URL, you get the wrong reference (or none at all). So, it effectively means that you check your links are correct. And everyone else gets the correct reference, and correct DOI. So everyone wins.

    Of course, this is only useful for WordPress. There is similar tool for Jekyl also. Why do the publishers not do this? I don’t know. It’s not that hard.

  27. philliplord Says:

    @kaveh CSL is all very well, but it codifies the idea that we need 3000 citation styles. We just don’t. Most of them make no sense on the web anyway, and supporting a generic process comes with a huge load in terms of processing.

    I think it’s time we grew up, and just threw away the silly notion that we need so many styles. There *are* real differences between numeric and author year styles (the latter is undefined where the authors name is not first,last compliant, while the former is opaque to the reader). But most of the distinctions are about “italics here, indent three, make this font smaller”. None of these visual layout issues makes any sense at all on the web, and little sense in the face of highly varied screen sizes.

    Another piece of baggage from the paper based publishing industry we should throw away.

  28. kaveh1000 Says:

    @phillip I think we are on the same side. But we can have the best of both worlds. Have the content in structured format, on a server say, and generate the “style” on the fly. Someone may actually want a particular one of the 3000 styles, for their own reasons. Well, something like CSL (or a similar tool) can generate that at the point of delivery. The author can even produce their own new multicolored style if they want.

  29. […] that is still, for historical reasons, known as “typesetting” — that is, the transformation of the manuscript from from an opaque form like an MS-Word file (or indeed a stack of hand-written sheets) into a […]

  30. Guido Frede Says:

    Hey Mike,

    Very nice article. I also wonder often why journals insist on a certain style if most of them use certain tool to reformat the references anyway. However, it would be nice to have further information about those processes.
    Where did you get those information e.g. about the reference formatting at T&F?
    Can you privode further Information about this topic?
    Best wishes

  31. Mike Taylor Says:

    I’m afraid I don’t know any more than I put into the article when I wrote it. I know that publishers do this, and I think it’s about the most valuable service they provide. But I don’t know how they do it. Kaveh would know more: do read the comments if you haven’t already.

