It’s more important than ever to come together, as companies, non-profits, governments, scientists, and clinicians, to bring our best information and technologies to bear on challenges with COVID-19.

Today, we announced a collaboration with colleagues to create the COVID-19 Open Research Dataset (CORD-19) from a coalescence of scientific articles about the coronavirus group of viruses for use by the worldwide research community. CORD-19 contains over 29,000 scholarly articles for COVID-19 and the coronavirus family more broadly, with full text available for over 13,000 of the articles.

The motivation behind the CORD-19 effort is to make research and discovery more efficient—and to accelerate progress toward solutions to the pandemic. The machine-readable dataset was constructed with colleagues at the National Library of Medicine (NLM), the Allen Institute for AI, Georgetown University, the Chan Zuckerberg Initiative, Kaggle and the White House Office of Science and Technology Policy (OSTP). Microsoft contributed the indexing and mapping of thousands of articles worldwide. We’ll continue to update the index to provide the global research community with a unified, continually updated resource that brings together all that we know about COVID-19. 

A key aspect of aggregating scientific literature into a valuable unified data resource is gaining access to the full content of articles—including permissions to analyze the content with computational tools. Many medical articles are tucked behind paywalls. Even when text is made available, publishers may not provide researchers with the rights to perform machine analysis and datamining. Much has been going on behind the scenes to open up the literature on the coronavirus family and on COVID-19 to create this kind of machine-readable resource.

It’s my hope that the machine-readable content will stimulate advances in computing methods that can help investigators to develop deeper understandings and approaches to addressing the COVID-19 pandemic. Developing tools to help scientists to do research and synthesize new understandings has been a long-term aspiration in AI. Work has been underway over years on methods that can answer questions, analyze and summarize the content of numerous scientific papers, assess the credibility of clinical trials, generate and test hypotheses, and guide experimentation. As examples of prior work on machine reading in biomedicine, research scientists at Microsoft have explored the use of natural language analysis and machine learning to analyze thousands of biomedical papers to construct a representation of cellular regulatory networks and then to leverage the representation to generate recommendations for cancer therapies.

It has been gratifying to see the fabulous cross-organizational teamwork that led to the creation of CORD-19. I’m excited to see what multiple communities of passionate and creative investigators will do with the resource.