Ian Mulvany

March 10, 2021

why publishers should care about persistent identifiers (PIDs) ?

#PIDs #scpb #infrastructure #ORCID #DOIs 


We are doing a big piece of metadata standardisation work at BMJ - the name of the internal group is the Research Data Integrity Group. We are also looking at where our processes diverge across our journals, and are aiming to have metadata and process diverge only where it must, rather than where it might. 

As a part of these discussions persistent identifiers have come up - PIDs. I've been asked to provide answers to a couple of short, very good questions, on the topic. I thought this was a good enough topic to share the responses more widely.

Before getting in to these answers though, let's look at one perspective on why publishers exist, as that informs my own thinking on the answers to these questions. I'll keep this brief, as otherwise the answer to that could spiral out a bit, and I'm a bit pressed for time, but a core function that we have an obligation to uphold is the integrity of the scholarly record. That means supporting the infrastructure to publish well structured content, be that good dublin core metadata on our webpages, well structured XML, or dare I say it, easy to access PDFs. In a nutshell, by so doing we enable the content we publish to exist more easily within a ecosystem of knowledge, and the nature of scholarly work is that it is embedded in an ecosystem. It never stands alone, it explicitly calls to the past, and looks to the future. Each paper, finding, review, commentary, hopefully enraging our tapestry of understanding of the universe around us. 

As one player in a wider ecosystem persistent identifiers have emerged as a tool to allow bits of the scholarly record to get connected, and while some of them are well established, such as digital object identifiers, there is still a lot of scope for improvement. One might say in the end machine learning is going to crack the problem, but even then, some form of synthetic identifiers will need to be created, and having good PIDs in place can only help any efforts that folks like scite.ai or scholarly can bring to the party. 

OK on the to the questions: 


What are the Industry standard PIDs that we should be capturing?

Certainly:
ORCIDs
DOIs - of many stripes 
FundRef identifiers 

Probably:
RoR 
Connected, but not explicitly a PID - use of the CRediT taxonomy 

I'm not sure where I set when it comes to identifiers for objects of study or concepts, e.g. gene ontologies, or MESH terms. They should be supported, but I think they are rather a slightly different beast that PIDs for scholarly infrastructure. 


Proprietary PIDs are generally unpopular with researchers, should we be avoiding them?

The one that springs to mind here is Ringgold, so lets talk about that as an example. I would be led here by utility, and take a broad approach where possible. If we have systems that are Ringgold centric, and we work well with those systems, I would not advise replacing them just for the sake of it, but rather seeing if we can enhance our records with a bag of identifiers approach.

In addition any organisation will have a range of internal unique identifiers - though not PIDs per se. Again, there is little reason to root out any internal identifiers that are working well, just for the sake of doing so - often where that is coined may be deep in some system that you do not want to unpick. Rather enhance, and create a mapping to a PID where possible.

What are 'Common' PIDs that are currently appearing across a lot of our products that would help us build narratives across platforms

Same answer as the first question.


Best practice for centralising PIDs? I'd guess Business Objects?

There is no right right answer to this, as long as our own friction in traversing our identifiers is as low as possible we'll be good.

People working on PID graphs have been using systems that support GraphQL endpoints, and that looks really promising. Martin Fenner is the key person looking at this right now.

Data discovery platforms are emerging that have been built for a post identifier world that do some truly amazing things, have a look at https://metadataday2020.splashthat.com/ for a bag of inspiration on these. DataHub, Apache Goblin and Amundsen are three newcomers here.

I'm much less concerned with the underlying technology, than that whatever we use can support easy access to the kinds of questions we are interested in. Let's think about those first.

The one thing that is really important is that we embed PIDs well into our article XML, and ideally into the metadata or our webpages. That allows our content to interoperate with the wider scholarly ecosystem.



Any interesting products or developments that have leveraged PIDs that we should be aware of?

See below. 






Some useful links and resources:

Forum:
[The PID Forum - The PID Forum](https://www.pidforum.org/)


Registries:
[Funder Registry - Crossref](https://www.crossref.org/services/funder-registry/)
[ROR](https://ror.org/scope/)
[Dryad Home - Publish and Preserve your Data](https://datadryad.org/stash)
[Welcome to DataCite](https://datacite.org/)
[ORCID](https://orcid.org/)


Some services using PiDs
[Zenodo - Research. Shared.](https://zenodo.org/)
[Making Your Code Citable · GitHub Guides](https://guides.github.com/activities/citable-code/)


Projects building on top of PiDs
[Introducing the PID Graph — FREYA](https://www.project-freya.eu/en/blogs/blogs/the-pid-graph)
http://www.scholix.org/
Rescognito  - a way of distributing credit, built on top of the PID graph https://rescognito.com/


Data Dumps Available
[Semantic Scholar Open Research Corpus](http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/legal/)
[Microsoft Academic Graph data schema - Microsoft Academic Services | Microsoft Docs](https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema)
[Open Academic Graph - Microsoft Research](https://www.microsoft.com/en-us/research/project/open-academic-graph/)

Also useful:
https://jats4r.org