Streamline Literature Reviews with Paper QA and Zotero

Update: I have created a Hugging Face Space: https://huggingface.co/spaces/lifan0127/zotero-qa. For more details please see https://apex974.com/articles/zotero-qa-hugging-face-space.

When I initially experimented with ChatGPT for scientific research questions, I was impressed by the seemingly coherent answer, backed by a list of references. However, my jaw dropped when I realized that virtually all the references were fabricated. Over time, I began to comprehend the factuality issue of large language models (LLMs) and the various tactics to use chains and indexes for information retrieval and response synthesis for such scenarios.

Then I was delighted to learn that Prof. Andrew White, a scholar and software developer, has created a tool, paper-qa, to address this unmet need. This Python package uses ChatGPT (or other LLMs) to index a group of scholarly articles, and then matches a question with the most relevant document chunks for response synthesis. This is essentially how indexes work in both LangChain AI and LLamaIndex. But Paper QA also keeps track of citation metadata to produce a list of “real” references as part of the response, making it ideal for literature review purposes.

Like many other researchers, I use Zotero for reference management. So I am interested in integrating paper-qa with my Zotero library, which would enable me to leverage ChatGPT for literature reviews without having to rely on additional tools. The process would look like the following diagram.

Process to integrate Zotero with Paper QA

Thanks to Zotero APIs and its Python binding (pyZotero), it was straightforward to interact with Zotero library and implement a prototype in combination with paper-qa.

Below is my sample Zotero library collection for this prototype, containing 10 open-access articles retrieved from RSC Digital Discovery:

For demo purposes, let's go with a sample question: "What predictive models are used in materials discovery?" To convert the question into one or more search queries, I used the paper-qa built-in generate_search_query function, which helped generate 3 sets of keywords:

Materials discovery models
Predictive materials modeling
Machine learning materials discovery

The next step was to use the keywords and search against my Zotero library collection. Since my question and the associated keywords were quite broad, this search returned 9 articles in total, and I was quite pleased with the comprehensiveness of the coverage. It's worth noting that searching against the Zotero full-text index yields the best results, so make sure to index all the PDF attachments in your Zotero library.

Afterwards, the matched PDF files and their associated bibliographic information were exported from Zotero, and, with minor modifications, fed into paper-qa. From there, paper-qa extracted the text from each PDF file, divided it into smaller document chunks (1-2 pages each) and created an embedding-based index using ChatGPT. Once all the document chunks were indexed, the most relevant chunks from the top 5 articles were identified and used as the contexts to answer the original question. It then generated the following response.

To summarize, paper-qa makes use of LLMs such as ChatGPT to perform Q&A and summarization based on scholarly articles. By combining it with pyZotero, we can streamline literature review based on one's Zotero library. For more details, please check out the end-to-end Python script in this GitHub Gist: https://gist.github.com/lifan0127/e34bb0cfbf7f03dc6852fd3e80b8fb19