⬇️ How a Reference Manager works, Part I: Importing.

The first part in building a Reference Manager is the Import. This phase focuses on taking a user input and transforming it into a list of pre-references.

⬇️ How a Reference Manager works, Part I: Importing

A Reference Manager that doesn’t allow importing of references is not a reference manager. Arguably importing is the most important task performed by scientists as it allows integrations with other tools such as other reference managers, academical search engines and medical platforms.

Within PapersHive, the main goal of the Importer is to transform an input into a pre-reference: a potential reference that is not yet verified, but that has as many fields as possible (title, abstract, year, authors, etc).

Imagine you are using PubMed, Google Scholar or Web of Science and you just found a medical article that you would like to save. You are usually presented with several options, from a citation to the PDF itself.

📂 Pre-Step: Classify User Input

In order to transform the input into a pre-reference, we need first to classify it.

The input file gets classifier according to its extension name and content: Citation, ID, CSV or PDF.

PapesHive’s classifier considers the following areas:

  1. Citations in files that present a Bib, Endnote, or RIS extension.
  2. Publications IDs such as PubMed or DOI.
  3. PDF for full-text.
  4. CSV files.
  5. User Input

⬇️ Importing

Importing Citations

Citation input means the user is submitting a file with one of the following extensions: bib, enw, ris. Once the file gets recognized, its content is submitted into a Citation-Parser component that outputs a pre-reference for each citation found. The fields of a pre-reference are the same fields considered in any academic or medical search engine: title, abstract, id, year, authors, etc.

Each imported file that is recognized as a Bib, EndNote or Ris citation gets parsed into a list of pre-references. Each pre-reference has all the typical academical fields.

Importing IDs

A publication ID is the easiest way to get a pre-reference. PapersHive uses an ID-Classifier to label the ID as corresponding to one of the following datasets: DOI (doi.org and CrossRef), PMID (PubMed unique ID), PMC ID (PubMed Central unique ID), Pre-Prints (biorxiv, medrxiv, chemrxiv, arxiv, etc).

Once an ID is recognized as such, PapersHive checks for different endpoints in order to retrieve the most accurate data for the pre-reference: PubMed, PubMed Central, Doi.org, CrossRef, Biorxiv, Medrxiv, Chemrxiv, Arxiv, etc.

As in the previous step, the output is a valid pre-reference.

Importing PDF

This is the most common case we have identified. One or multiple article PDFs are imported. We use a size limit on 50MB and a total of 100 PDFs per import. This is to avoid chugging the bandwidth of the network. It can be increased for Enterprise clients.

We have 3 server instances for PDF parsing deployed in production and ready to accept requests 24/7. Once a PDF is submitted, the PDF model transforms it into XML with proper label fields such as title, abstract, authors, etc.

Each PDF is converted into XML, and consequently the PDF is converted into a valid Pre-Reference JSON.

There is a margin of error in which 5% of the articles failed to be mapped successfully.

Importing CSV

A CSV file has columns for fields and rows for articles. The file is parsed transforming each line into a pre-reference.

Importing User Input

At last, users can always manually insert the information directly into the Reference Manager.

2️⃣ Removing Duplicates

At the end of the Import phase, the input is transformed into a set of pre-references. As a last step for this phase, the list of pre-references is de-duplicated.

Removing duplicates before going into the second phase. This concludes the Import phase into pre-references.

This early deduplication saves more complicated troubles down the line later.

🚀 Conclusions

The first part of PapersHive’s Reference Manager is focused on transforming an input into a list of unique pre-references as in the following image:

The full Importing pipeline for PapersHive Reference Manager.


The next part focuses on how to map the pre-references into actual references and you can read about it in How a Reference Manager works, Part II: References (coming soon!).

Do you have hundreds if not thousands of references scattered in your workflow? We know the pain. Setup a free trial, and test the smoothness of a Reference Manager that removes duplicates, automatically attaches PDFs, synchronizes with the whole team, and searches seamlessly among all your references in full-text search.

Everything starts with search.

With a smart suite of search tools to help you find the information you need, when you need it. Enhance your Search Experience with PapersHive Today!

Contact Us