h1. OpenAIRE entity identifier and PID mapping policy
(copied from https://docs.google.com/document/d/1PnvZpmhbanJu3AeOT-zdIyMKIHoGKC4_Z0UtDFDZAeM/edit#)
OpenAIRE assigns internal identifiers for each object it collects.
By default, the internal identifier is generated as @sourcePrefix::md5(localId)@ where
- @sourcePrefix@ is a namespace prefix of 12 chars assigned to the data source at registration time
- @localid@ is the identifier assigned to the object by the data source
After years of operation, we can say that:
- @localId@ are unstable
- objects can disappear from sources
- PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos)
Therefore, when the record is collected from an authoritative source:
- the identity of the record is forged using the PID, like @pidTypePrefix::md5(lowercase(doi))@
- the PID is added in a @pid@ element of the data model.
When the record is collected from a source which is not authoritative for any type of PID:
- the identity of the record is forged as usual using the local identifier;
- the PID, if available, is added as @alternateIdentifier@s
As of November 2021, the following data sources are used as "PID authorities":
|. PID Type |. Prefix (12 chars) |. Authority |
| doi | @doi_____@ | Crossref, Datacite, Zenodo |
| pmc | @pmc___@ | Europe PubMed Central, PubMed Central |
| pmid | @pmid__@ | Europe PubMed Central, PubMed Central |
| arXiv | @arXiv_@ | arXiv.org e-Print Archive |
| handle | @handle___@ | any repository |
TODO: WHAT HAPPENS FOR RECORDS WITH BOTH pmc and pmid? pmc wins?
OpenAIRE also perform duplicate identification (see dedicated section for details).
All duplicates are "merged" together in a "representative record" which must be assigned to a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).
The following strategy is applied to generate the OpenAIRE identifier of a representative record, to ensure it is as stable as possible:
TODO
Updated by Claudio Atzori about 3 years ago · 3 revisions