⬇️ How a Reference Manager works, Part I: Importing.
The first part in building a Reference Manager is the Import. This phase focuses on taking a user input and transforming it into a list of pre-references.
A Reference Manager that doesn’t allow importing of references is not a reference manager. Arguably importing is the most important task performed by scientists as it allows integrations with other tools such as other reference managers, academical search engines and medical platforms.
Within PapersHive, the main goal of the Importer is to transform an input into a pre-reference: a potential reference that is not yet verified, but that has as many fields as possible (title, abstract, year, authors, etc).
Imagine you are using PubMed, Google Scholar or Web of Science and you just found a medical article that you would like to save. You are usually presented with several options, from a citation to the PDF itself.
📂 Pre-Step: Classify User Input
In order to transform the input into a pre-reference, we need first to classify it.
PapesHive’s classifier considers the following areas:
- Citations in files that present a Bib, Endnote, or RIS extension.
- Publications IDs such as PubMed or DOI.
- PDF for full-text.
- CSV files.
- User Input
Citation input means the user is submitting a file with one of the following extensions: bib, enw, ris. Once the file gets recognized, its content is submitted into a Citation-Parser component that outputs a pre-reference for each citation found. The fields of a pre-reference are the same fields considered in any academic or medical search engine: title, abstract, id, year, authors, etc.
A publication ID is the easiest way to get a pre-reference. PapersHive uses an ID-Classifier to label the ID as corresponding to one of the following datasets: DOI (doi.org and CrossRef), PMID (PubMed unique ID), PMC ID (PubMed Central unique ID), Pre-Prints (biorxiv, medrxiv, chemrxiv, arxiv, etc).
As in the previous step, the output is a valid pre-reference.
This is the most common case we have identified. One or multiple article PDFs are imported. We use a size limit on 50MB and a total of 100 PDFs per import. This is to avoid chugging the bandwidth of the network. It can be increased for Enterprise clients.
We have 3 server instances for PDF parsing deployed in production and ready to accept requests 24/7. Once a PDF is submitted, the PDF model transforms it into XML with proper label fields such as title, abstract, authors, etc.
There is a margin of error in which 5% of the articles failed to be mapped successfully.
A CSV file has columns for fields and rows for articles. The file is parsed transforming each line into a pre-reference.
Importing User Input
At last, users can always manually insert the information directly into the Reference Manager.
2️⃣ Removing Duplicates
At the end of the Import phase, the input is transformed into a set of pre-references. As a last step for this phase, the list of pre-references is de-duplicated.
This early deduplication saves more complicated troubles down the line later.
The first part of PapersHive’s Reference Manager is focused on transforming an input into a list of unique pre-references as in the following image:
The next part focuses on how to map the pre-references into actual references and you can read about it in How a Reference Manager works, Part II: References (coming soon!).
Do you have hundreds if not thousands of references scattered in your workflow? We know the pain. Setup a free trial, and test the smoothness of a Reference Manager that removes duplicates, automatically attaches PDFs, synchronizes with the whole team, and searches seamlessly among all your references in full-text search.
Everything starts with search.
With a smart suite of search tools to help you find the information you need, when you need it. Enhance your Search Experience with PapersHive Today!Contact Us