Research Identifiers
Keywords
Funding (4)
Using the literature to build causal models of retrospective observational data ✓ NIH
United States National Library of Medicine (Bethesda, Maryland, US)
Homepage URL: https://reporter.nih.gov/search/agkD0LqiWk-OD1wThSJt3Q/project-details/10879451
GRANT_NUMBER: R00LM013367
Show more detailOrganization identifiers
📄 Project Abstract (from NIH)
records (EHRs), allow for the identification links between health events, such as drug exposures and side-
effects. Some of these links indicate stable dependencies deemed as causes. Causal insight allows reverse-
engineering disease. If confounding is not addressed, it will be difficult to distinguish causative from correlative
links. Our approach is to identify confounders explicitly. Graphical causal modeling (GCMs) can discover
causal links from data and prior knowledge. GCMs summarize causal links between variables. Automated
selection of variables would allow GCMs to scale and yield more insight from data. Literature-based discovery
(LBD) methods were developed to identify links between concepts in the literature. Advanced methods permit
the search for concepts linked to each other through specific verbs, e.g., “causes”, “treats”. Our hypothesis is
that we can exploit structured knowledge extracted from the literature to inform GCMs. In prior work, we found
that LBD + GCM was better at identifying side-effects in EHR data than traditional methods. Compared to
methods which use solely data, we hypothesize that our method will increase the ability to detect causal
relationships from EHR data. The first aim is to determine the extent to which LBD-informed GCM improves the
identification of causal links for drug safety. We will build LBD-informed GCMs using publicly available
reference datasets for drug safety. These reference datasets contain drug/side-effect pairs for performance
benchmarking. (A) Test the ability of GCM algorithms to identify known causal links solely using data. We will
systematically evaluate GCM algorithms based on their ability to re-discover causal links in a reference
standard. Results will guide our studies on how GCM can be tuned. (B) Determine the effect of adding different
subsets of LBD-derived information to GCMs at identifying drug side-effects. We will build causal models using
increasing numbers confounders. The second aim is to test the ability of LBD built with disease-specific
literature to improve the relevance of LBD derived confounders for Alzheimer's Disease (AD). We chose AD for
its high prevalence and relative lack of effective pharmacologic treatment. (A) Compare LBD strategies in a
disease-specific setting. We will test LBD variants using disease-specific literature or with LBD lacking subject-
matter restrictions. (B) Define the ability of robust LBD-informed GCM to validate drug repurposing candidates
for treating AD symptoms. We will test the ability of advanced methods to iteratively resolve hidden latent
confounding, when detected, to improve effect estimates. The fulfillment of these aims will yield new methods
to combine insights from the literature with causal modeling to uncover causal relationships of drug exposures
on adverse events and on beneficial outcomes.
👤 Principal Investigator(s) (from NIH)
🏛️ Recipient Organization (from NIH)
📅 Project Dates (from NIH)
End: 2026-07-31T00:00:00
💰 Award Amount (from NIH)
📊 Fiscal Year (from NIH)
🏷️ Activity Code (from NIH)
🔢 Project Number (from NIH)
🔗 Full Project Record (from NIH)
Added
Last modified
Using the literature to build causal models of retrospective observational data ✓ NIH
United States National Library of Medicine (Bethesda, US)
Homepage URL: https://app.dimensions.ai/details/grant/grant.9844339
GRANT_NUMBER: K99LM013367
Show more detailOrganization identifiers
Funding project translated title
📄 Project Abstract (from NIH)
records (EHRs), allow for the identification links between health events, such as drug exposures and side-
effects. Some of these links indicate stable dependencies deemed as causes. Causal insight allows reverse-
engineering disease. If confounding is not addressed, it will be difficult to distinguish causative from correlative
links. Our approach is to identify confounders explicitly. Graphical causal modeling (GCMs) can discover
causal links from data and prior knowledge. GCMs summarize causal links between variables. Automated
selection of variables would allow GCMs to scale and yield more insight from data. Literature-based discovery
(LBD) methods were developed to identify links between concepts in the literature. Advanced methods permit
the search for concepts linked to each other through specific verbs, e.g., “causes”, “treats”. Our hypothesis is
that we can exploit structured knowledge extracted from the literature to inform GCMs. In prior work, we found
that LBD + GCM was better at identifying side-effects in EHR data than traditional methods. Compared to
methods which use solely data, we hypothesize that our method will increase the ability to detect causal
relationships from EHR data. The first aim is to determine the extent to which LBD-informed GCM improves the
identification of causal links for drug safety. We will build LBD-informed GCMs using publicly available
reference datasets for drug safety. These reference datasets contain drug/side-effect pairs for performance
benchmarking. (A) Test the ability of GCM algorithms to identify known causal links solely using data. We will
systematically evaluate GCM algorithms based on their ability to re-discover causal links in a reference
standard. Results will guide our studies on how GCM can be tuned. (B) Determine the effect of adding different
subsets of LBD-derived information to GCMs at identifying drug side-effects. We will build causal models using
increasing numbers confounders. The second aim is to test the ability of LBD built with disease-specific
literature to improve the relevance of LBD derived confounders for Alzheimer's Disease (AD). We chose AD for
its high prevalence and relative lack of effective pharmacologic treatment. (A) Compare LBD strategies in a
disease-specific setting. We will test LBD variants using disease-specific literature or with LBD lacking subject-
matter restrictions. (B) Define the ability of robust LBD-informed GCM to validate drug repurposing candidates
for treating AD symptoms. We will test the ability of advanced methods to iteratively resolve hidden latent
confounding, when detected, to improve effect estimates. The fulfillment of these aims will yield new methods
to combine insights from the literature with causal modeling to uncover causal relationships of drug exposures
on adverse events and on beneficial outcomes.
👤 Principal Investigator(s) (from NIH)
🏛️ Recipient Organization (from NIH)
📅 Project Dates (from NIH)
End: 2023-07-31T00:00:00
💰 Award Amount (from NIH)
📊 Fiscal Year (from NIH)
🏷️ Activity Code (from NIH)
🔢 Project Number (from NIH)
🔗 Full Project Record (from NIH)
Added
Last modified
Using Biomedical Knowledge to Identify Plausible Signals for Pharmacovigilance
United States National Library of Medicine (n/a, US)
GRANT_NUMBER: grant.R01LM011563
Show more detailOrganization identifiers
Funding project translated title
Added
Last modified
NLM Training Program in Biomedical Informatics & Data Science for Predoctoral and Postdoctoral Fellows
United States National Library of Medicine (n/a, US)
GRANT_NUMBER: grant.T15LM007093
Show more detailOrganization identifiers
Funding project translated title
Added
Last modified
Education and qualifications (4)
University of Pittsburgh: Pittsburgh, PA, US
Department
Added
Last modified
University of Texas Health Science Center at Houston: Houston, Texas, US
Department
Added
Last modified
Carnegie Mellon University: Pittsburgh, PA, US
Organization identifiers
Department
Added
Last modified
University of Pittsburgh: Pittsburgh, PA, US
Organization identifiers
Department
Added
Last modified
CausalKnowledgeTrace: A Novel Computational Framework for Automated Literature-Based Causal Graph Construction and Evidence-Based Variable Selection in Biomedical Research
Homepage URL
Contributors
External identifiers
Abstract
Background
Variable selection for causal inference from observational biomedical data is challenging, as overlooking confounders or conditioning on colliders leads to biased estimates. While vast causal knowledge exists in biomedical literature, manually extracting this information for principled variable selection is impractical at scale.
Methods
We developed CausalKnowledgeTrace, a Python-based computational framework with Django web interface that systematically leverages structured causal knowledge from the Semantic MEDLINE Database (SemMedDB) to inform variable selection in causal studies. The system implements a six-stage analysis pipeline using NetworkX for graph operations, including graph parsing, basic analysis, comprehensive cycle detection, systematic generic node removal, post-removal analysis, and formal causal inference with bias detection.
Results
Analysis of the hypertension-Alzheimer’s relationship across three degree neighborhoods (1-3) demonstrated systematic scaling of causal complexity: 361-866 variables, 429-1,442 relationships, with graph densities of 0.0033-0.0019. The analysis revealed complex cyclic structures with 54-606 baseline cycles across degree levels. Processing times ranged from 0.3-1.0 seconds for all three degrees, demonstrating computational efficiency for complex biomedical networks. Key confounders identified across all degrees included inflammation, diabetes, insulin resistance, obesity, and ischemia. In the third degree of graph, the pipeline structurally identified 39 confounders, 11 mediators, and 3 colliders from the causal graph. Among the key identified confounders and mediators—including obesity, oxidative stress, ischemia, and vascular diseases—all were found to have strong supporting evidence in established epidemiological and pathophysiological literature.
Conclusions
CausalKnowledgeTrace provides a scalable, evidence-based approach to causal graph construction that systematically identifies confounders and bias structures often missed by conventional approaches. The Python-Django architecture enables both standalone analysis and integration into larger computational workflows, representing a significant advance in computational support for causal inference in biomedical research.
Statement of Significance
Problem or Issue
Selecting proper confounders and variables for causal inference from observational biomedical datasets is challenging and often biased by limited expertise or manual review.
What is Already Known
Existing approaches rely on domain experts, statistical variable screening, or manual construction of causal graphs, but these often overlook literature-documented confounders and complex biases.
What this Paper Adds
This paper introduces an automated, literature-based framework for synthesizing and validating causal graphs, identifying critical variables and complex bias structures, such as M-bias and butterfly bias, with full evidentiary traceability.
Who would benefit from the new knowledge in this paper?
Epidemiologists, biomedical researchers, informaticians, and clinical investigators seeking reliable and transparent causal modeling for observational studies.
