Project

General

Profile

Actions

DOIBoost » History » Revision 22

« Previous | Revision 22/53 (diff) | Next »
Alessia Bardi, 10/11/2021 05:15 PM


DOIBoost

DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID

The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:

  • La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: https://doi.org/10.5281/zenodo.1441071

In short, the goal is to enrich the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI.

Inputs

  • Crossref: dump available to Crossref subscribers via MetadataPlus service, updated once a month.
  • Micorsoft Academic Graph: downloaded version on 2021-02-15. We plan to take a latest version on Dec 2021 before MAG will be retired.
  • ORCID: baseline dump obtained in 2020-10-13, regularly updated every week from the ORCID public API
  • Unpaywall: public database snapshot downloaded in March 2021. Unpaywall updates it twice a year (https://unpaywall.org/products/snapshot)

The generation of DOIBoost consists in the following phases:

1 Filter Crossref records that

  • have blank title
  • have one of the following publishers: "Test accounts", "CrossRef Test Account"
  • have no authors with valid names, where valid means: not blank and different from all strings in this list: List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")
  • have "Addie Jackson" as author and "Elsevier BV" as publisher (empirically we say they are test records)
  • have not one of the following values in the field type:
    • "book-section"
    • "book"
    • "book-chapter"
    • "book-part"
    • "book-series"
    • "book-set"
    • "book-track"
    • "edited-book"
    • "reference-book"
    • "monograph"
    • "journal-article"
    • "dissertation"
    • "other"
    • "peer-review"
    • "proceedings"
    • "proceedings-article"
    • "reference-entry"
    • "report"
    • "report-series"
    • "standard"
    • "standard-series"
    • "posted-content"
    • "dataset"

Records with type=dataset are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.

Mapping Crossref properties into the OpenAIRE Research Graph

Properties in OpenAIRE results are set based on the logics described in the following table:

TODO: ensure we use the field names of the public dump

OpenAIRE Result field path Crossref path(s) Notes
pid doi, clinical-trial-number, alternative-id the doi is normalised and lowered case
dateofcollection indexed.datetime
collectedfrom.name Default value "Crossref"
collectedfrom.id TODO Default value ID
publisher publisher
title title as main title
title original-title, short-title as alternative title
title subtitle as subtitle
description abstract
source source only if the record is not of type book
source ${container-title.head} ISBN: ${ISBN.head} only if the record is of type book
dateofacceptance issued.datetime or, if not available, created.datetime
relevantdate created.datetime, posted.datetime, accepted.datetime, published-print, published-online
subject subject with classid='keywords', i.e. no controlled vocabularies for Crossref subjects
author author if available the sequence is mapped to rank and the ORCID is also mapped (as 'orcid_pending')
journal only for publications
journal.name container-title.head
journal.eissn issn-type.value if issn-type.type='electronic'
journal.issn issn-type.value if issn-type.type='print'
journal.vol volume
journal.sp page before '-'
journal.ep page after '-'
instance TODO One instance is created . . .
instance.license license.URL If there is a license.content-version='vor', then this is used. Otherwise the first license entry is used.
instance.pid the list of pids as in the first row of this table
instance.refereed set to 'peerReviewed' only if relation.has-review.id is not empty
instance.instancetype subtype mapped using the OpenAIRE vocabularies
instance.collectedfrom as in result.collectedfrom above
instance.dateofacceptance as in result.dateofacceptance above
instance.url URL, link.URL there may be different URLs in the same instance
instance.accessright.value based on license and dateofacceptance:
- UNKNOWN: if license is blank
- OPEN ACCESS: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see Unpaywall FAQ for details) or if OUP license, but only after 12 months from the publication date
- EMBARGO: OUP license, before 12 months from the publication date
- CLOSED: if there is a license not covered by the previous cases
instance.accessright.openaccessroute only if instance.accessright.value = 'OPEN ACCESS'. Default is 'hybrid'. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list.

2 Map Crossref links to projects/funders

Links to funding available in Crossref are mapped as funding relationships (result -- isProducedBy --> project) applying the following mapping:

funder grant code Link to
DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665}
or name: 'European Union’s Horizon 2020 research and innovation program'
series of 4-9 digits in award Link to H2020 project
DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780} series of 4-9 digits in award Link to FP7 project
DOI: 10.13039/501100000781 OR name: 'European Union's' series of 4-9 digits in award Link to FP7 or H2020 project
DOI: 10.13039/100000001 award Link to NSF project
DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'} award Link to ANR project
DOI: 10.13039/501100002341 award Link to Academy of Finland project
DOI: 10.13039/501100001602 award, removing the initial 'SFI' if present Link to SFI project
DOI: 10.13039/501100000923 award Link to ARC project
DOI: 10.13039/501100000038 award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE Link to NSERC (unidentified project)
DOI: 10.13039/501100000155 award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE Link to SSHRC (unidentified project)
DOI: 10.13039/501100000024 award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE Link to CIHR (unidentified project)
DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado' award Link to CONICYT project
DOI: 10.13039/501100003448 series of 4-9 digits in award Link to GSRT project
DOI: 10.13039/501100010198 award Link to SGOV project
DOI: 10.13039/501100004564 series of 4-9 digits in award Link to MESTD project
DOI: 10.13039/501100003407 award Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (unidentified project) is also generated
DOI: {10.13039/501100006588, 10.13039/501100004488} award, removing 'Project No' and 'HRZZ' prefix, if present Link to HRZZ or MZOS project
DOI: 10.13039/501100006769 award Link to Russian Science Foundation project
DOI: 10.13039/501100001711 award after '_' and before '/' Link to SNSF project
DOI: 10.13039/501100004410 award Link to TUBITAK project
DOI: 10.10.13039/100004440 or name: 'Wellcome Trust Masters Fellowship' award Link to Wellcome Trust specific project and to the unidentified project.

3 Intersect Crossref with Unpaywall by DOI (DOIBoost1)

The fields we consider from Unpaywall are:
  • is_oa
  • best_oa_location
  • oa_status

The results of Crossref that intersect by DOI with Unpaywall records are enriched with:

TODO: ensure we refer to json fields of the public dump

OpenAIRE Result field path Unpaywall field path Notes
result.instance created only if is_oa and a best_oa_location is available
result.instance.collectedfrom.name default value "Unpaywall"
result.instance.collectedfrom.id default value TODO
result.instance.url best_oa_location
result.instance.license best_oa_location.license
result.instance.pid doi
result.instance.accessright default value Open Access: we do not add instances if Unpaywall says there is no open version
result.instance.accessright.route oa_status

For the definition of Unpaywall's oa_status refer to the Unpaywall FAQ

4 Intersect DOIBoost1 with ORCID (DOIBoost2)

The fields we consider from ORCID are:
  • doi
  • authors, a list of authors, each with come with optional name, surname, creditName, oid
OpenAIRE field path ORCID path Notes
pid doi
author.name capitalize(name) only mapped if not blank
author.surname capitalize(surname) only mapped if not blank
author.fullname if name and surname are not blank, they are concatenated (capitalize(name) capitalize(surname)), otherwise we use the creditName
author.pid oid as confirmed ORCID identifier (in contrast to the 'orcid_pending' set from Crossref and Unpaywall
The records are enriched with the ORCID identifiers of their authors.
  • if the number of authors from Crossref equals the size of authors from ORCID, then we pick the list of authors with more PIDs and try to enrich it with the PIDs from the other list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available;
  • if the number of authors are different, then we take the longest and try to enrich it with the PIDs from the other author list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available

TODO: How do we ensure that if an author comes with an orcid_pending from Crossref and one orcid from ORCID, the last wins?

5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3)

The records are enriched with:
  • abstracts
  • MAG identifiers of authors
  • affiliation relationships
  • subjects (MAG FieldsOfStudy)
  • conference or journal information (in the journal field) TODO: or container, in case of the dump?
  • [TO BE REMOVED] instances with URL from MAG

6 Enrich DOIBoost3 with hosting data sources (hostedby) and access right information

In this phase we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (issn, eissn, lissn) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a journal.[l|e]issn that match are enriched as follows:
  • Each instance gain the hostedby information.
  • If the journal is open access, the access rights of the instances are also set to "Open Access" with "gold" route.

The hostedby of records that do not match are set to the "Unknown Repository".

Updated by Alessia Bardi about 3 years ago · 22 revisions