Research Identifiers

Keywords

Literature-based discovery; causal modeling; causal feature selection;

Funding (4)

Using the literature to build causal models of retrospective observational data ✓ NIH

2023-08 to 2026-07 | Award

United States National Library of Medicine (Bethesda, Maryland, US)

Homepage URL: https://reporter.nih.gov/search/agkD0LqiWk-OD1wThSJt3Q/project-details/10879451

GRANT_NUMBER: R00LM013367

Show more detail

Organization identifiers

United States National Library of Medicine: Bethesda, Maryland, US

📄 Project Abstract (from NIH)

Health data contain a wealth of information for research. Health data, such as found in electronic health
records (EHRs), allow for the identification links between health events, such as drug exposures and side-
effects. Some of these links indicate stable dependencies deemed as causes. Causal insight allows reverse-
engineering disease. If confounding is not addressed, it will be difficult to distinguish causative from correlative
links. Our approach is to identify confounders explicitly. Graphical causal modeling (GCMs) can discover
causal links from data and prior knowledge. GCMs summarize causal links between variables. Automated
selection of variables would allow GCMs to scale and yield more insight from data. Literature-based discovery
(LBD) methods were developed to identify links between concepts in the literature. Advanced methods permit
the search for concepts linked to each other through specific verbs, e.g., “causes”, “treats”. Our hypothesis is
that we can exploit structured knowledge extracted from the literature to inform GCMs. In prior work, we found
that LBD + GCM was better at identifying side-effects in EHR data than traditional methods. Compared to
methods which use solely data, we hypothesize that our method will increase the ability to detect causal
relationships from EHR data. The first aim is to determine the extent to which LBD-informed GCM improves the
identification of causal links for drug safety. We will build LBD-informed GCMs using publicly available
reference datasets for drug safety. These reference datasets contain drug/side-effect pairs for performance
benchmarking. (A) Test the ability of GCM algorithms to identify known causal links solely using data. We will
systematically evaluate GCM algorithms based on their ability to re-discover causal links in a reference
standard. Results will guide our studies on how GCM can be tuned. (B) Determine the effect of adding different
subsets of LBD-derived information to GCMs at identifying drug side-effects. We will build causal models using
increasing numbers confounders. The second aim is to test the ability of LBD built with disease-specific
literature to improve the relevance of LBD derived confounders for Alzheimer's Disease (AD). We chose AD for
its high prevalence and relative lack of effective pharmacologic treatment. (A) Compare LBD strategies in a
disease-specific setting. We will test LBD variants using disease-specific literature or with LBD lacking subject-
matter restrictions. (B) Define the ability of robust LBD-informed GCM to validate drug repurposing candidates
for treating AD symptoms. We will test the ability of advanced methods to iteratively resolve hidden latent
confounding, when detected, to improve effect estimates. The fulfillment of these aims will yield new methods
to combine insights from the literature with causal modeling to uncover causal relationships of drug exposures
on adverse events and on beneficial outcomes.

👤 Principal Investigator(s) (from NIH)

Scott Alexander Malec

🏛️ Recipient Organization (from NIH)

UNIVERSITY OF NEW MEXICO HEALTH SCIS CTR (ALBUQUERQUE, NM, UNITED STATES)

📅 Project Dates (from NIH)

Start: 2021-08-01T00:00:00
End: 2026-07-31T00:00:00

💰 Award Amount (from NIH)

$248,671

📊 Fiscal Year (from NIH)

2025

🏷️ Activity Code (from NIH)

R00

🔢 Project Number (from NIH)

5R00LM013367-05

🔗 Full Project Record (from NIH)

Added

2023-11-16

Last modified

2023-11-16
Source: Source Scott Alexander Malec | ✓ Enriched from NIH

Using the literature to build causal models of retrospective observational data ✓ NIH

2021-08-01 to 2023-07-31 | Grant

United States National Library of Medicine (Bethesda, US)

Homepage URL: https://app.dimensions.ai/details/grant/grant.9844339

GRANT_NUMBER: K99LM013367

Show more detail

Organization identifiers

United States National Library of Medicine: Bethesda, US

Funding project translated title

Funding project translated title (en)
Using the literature to build causal models of retrospective observational data

📄 Project Abstract (from NIH)

Health data contain a wealth of information for research. Health data, such as found in electronic health
records (EHRs), allow for the identification links between health events, such as drug exposures and side-
effects. Some of these links indicate stable dependencies deemed as causes. Causal insight allows reverse-
engineering disease. If confounding is not addressed, it will be difficult to distinguish causative from correlative
links. Our approach is to identify confounders explicitly. Graphical causal modeling (GCMs) can discover
causal links from data and prior knowledge. GCMs summarize causal links between variables. Automated
selection of variables would allow GCMs to scale and yield more insight from data. Literature-based discovery
(LBD) methods were developed to identify links between concepts in the literature. Advanced methods permit
the search for concepts linked to each other through specific verbs, e.g., “causes”, “treats”. Our hypothesis is
that we can exploit structured knowledge extracted from the literature to inform GCMs. In prior work, we found
that LBD + GCM was better at identifying side-effects in EHR data than traditional methods. Compared to
methods which use solely data, we hypothesize that our method will increase the ability to detect causal
relationships from EHR data. The first aim is to determine the extent to which LBD-informed GCM improves the
identification of causal links for drug safety. We will build LBD-informed GCMs using publicly available
reference datasets for drug safety. These reference datasets contain drug/side-effect pairs for performance
benchmarking. (A) Test the ability of GCM algorithms to identify known causal links solely using data. We will
systematically evaluate GCM algorithms based on their ability to re-discover causal links in a reference
standard. Results will guide our studies on how GCM can be tuned. (B) Determine the effect of adding different
subsets of LBD-derived information to GCMs at identifying drug side-effects. We will build causal models using
increasing numbers confounders. The second aim is to test the ability of LBD built with disease-specific
literature to improve the relevance of LBD derived confounders for Alzheimer's Disease (AD). We chose AD for
its high prevalence and relative lack of effective pharmacologic treatment. (A) Compare LBD strategies in a
disease-specific setting. We will test LBD variants using disease-specific literature or with LBD lacking subject-
matter restrictions. (B) Define the ability of robust LBD-informed GCM to validate drug repurposing candidates
for treating AD symptoms. We will test the ability of advanced methods to iteratively resolve hidden latent
confounding, when detected, to improve effect estimates. The fulfillment of these aims will yield new methods
to combine insights from the literature with causal modeling to uncover causal relationships of drug exposures
on adverse events and on beneficial outcomes.

👤 Principal Investigator(s) (from NIH)

Scott Alexander Malec

🏛️ Recipient Organization (from NIH)

UNIVERSITY OF PITTSBURGH AT PITTSBURGH (PITTSBURGH, PA, UNITED STATES)

📅 Project Dates (from NIH)

Start: 2021-08-01T00:00:00
End: 2023-07-31T00:00:00

💰 Award Amount (from NIH)

$68,663

📊 Fiscal Year (from NIH)

2022

🏷️ Activity Code (from NIH)

K99

🔢 Project Number (from NIH)

5K99LM013367-02

🔗 Full Project Record (from NIH)

Added

2022-07-20

Last modified

2022-07-20
Source: Source DimensionsWizard via Scott Alexander Malec | ✓ Enriched from NIH

Using Biomedical Knowledge to Identify Plausible Signals for Pharmacovigilance

Organization identifiers

United States National Library of Medicine: n/a, US

Funding project translated title

Funding project translated title (en)
Using Biomedical Knowledge to Identify Plausible Signals for Pharmacovigilance

Added

2018-02-06

Last modified

2018-02-06
Source: Source DimensionsWizard via Scott Alexander Malec

NLM Training Program in Biomedical Informatics & Data Science for Predoctoral and Postdoctoral Fellows

Organization identifiers

United States National Library of Medicine: n/a, US

Funding project translated title

Funding project translated title (en)
NLM Training Program in Biomedical Informatics & Data Science for Predoctoral and Postdoctoral Fellows

Added

2018-02-06

Last modified

2018-02-06
Source: Source DimensionsWizard via Scott Alexander Malec

Education and qualifications (4)

University of Pittsburgh: Pittsburgh, PA, US

2018-08-01 to 2023-07-31 | Postdoctoral Scholar (Department of Biomedical Informatics)
Education
Show more detail

Department

Department of Biomedical Informatics

Added

2018-09-20

Last modified

2024-09-05
Source: Scott Alexander Malec

University of Texas Health Science Center at Houston: Houston, Texas, US

2015-08-25 to 2018-08-15 | PhD (School of Biomedical Informatics)
Education
Show more detail

Department

School of Biomedical Informatics

Added

2018-02-06

Last modified

2018-09-20
Source: Scott Alexander Malec

Carnegie Mellon University: Pittsburgh, PA, US

2005-08-26 to 2010-05-15 | MSIT (CMU)
Education
Show more detail

Organization identifiers

RINGGOLD: 6612
Carnegie Mellon University : Pittsburgh, PA, US

Department

CMU

Added

2018-02-06

Last modified

2018-02-06
Source: Scott Alexander Malec

University of Pittsburgh: Pittsburgh, PA, US

2002-01-01 to 2003-12-13 | MLIS (Department of Library and Information Science)
Education
Show more detail

Organization identifiers

RINGGOLD: 6614
University of Pittsburgh : Pittsburgh, PA, US

Department

Department of Library and Information Science

Added

2018-02-06

Last modified

2018-02-06
Source: Scott Alexander Malec

Predicting Alzheimer’s Disease Diagnosis, a Decade or more Years before Onset using the Electronic Health Record and Random Forest Machine Learning Models

2025-11-06 | Preprint
Contributors: Sanya B. Taneja; Richard D. Boyce; Scott A. Malec; Steven M. Albert; C. Elizabeth Shaaban (and 9 more)
Show more detail

Contributors

Sanya B. Taneja (Author)
Richard D. Boyce (Author)
Scott A. Malec (Author)
Steven M. Albert (Author)
C. Elizabeth Shaaban (Author)
Arthur S. Levine (Author)
Paul Munro (Author)
Jiang Bian (Author)
Jie Xu (Author)
Demetrius Maraganore (Author)
Karen Schliep (Author)
Jonathan C. Silverstein (Author)
Michelle Kienholz (Author)
Helmet T. Karim (Author)

External identifiers

Added

2025-11-07

Last modified

2025-11-12
Source: Validated source Crossref

Unsupervised Latent Pattern Analysis for Estimating Type 2 Diabetes Risk in Undiagnosed Populations

2025-10-12 | Conference paper
Contributors: Praveen Kumar; Vincent T. Metzger; Scott Alexander Malec
Show more detail

Contributors

Praveen Kumar (Author)
Vincent T. Metzger (Author)
Scott Alexander Malec (Author)

External identifiers

Added

2025-12-10

Last modified

2025-12-10
Source: Validated source Crossref

Detecting Opioid Use Disorder in Health Claims Data With Positive Unlabeled Learning

IEEE Journal of Biomedical and Health Informatics
2025-02 | Journal article
Contributors: Praveen Kumar; Fariha Moomtaheen; Scott A. Malec; Jeremy J. Yang; Cristian G. Bologa (and 9 more)
Show more detail

Contributors

Praveen Kumar (Author)
Fariha Moomtaheen (Author)
Scott A. Malec (Author)
Jeremy J. Yang (Author)
Cristian G. Bologa (Author)
Kristan A Schneider (Author)
Yiliang Zhu (Author)
Mauricio Tohen (Author)
Gerardo Villarreal (Author)
Douglas J. Perkins (Author)
Elliot M. Fielstein (Author)
Sharon E. Davis (Author)
Michael E. Matheny (Author)
Christophe G. Lambert (Author)

External identifiers

Added

2025-02-10

Last modified

2025-02-19
Source: Validated source Crossref

Unsupervised Latent Pattern Analysis for Estimating Type 2 Diabetes Risk in Undiagnosed Populations

The 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB)
2025 | Conference paper
Contributors: Praveen Kumar; Vincent T. Metzger; Scott A. Malec
Show more detail

Contributors

Praveen Kumar (Author)
Vincent T. Metzger (Author)
Scott A. Malec (Author)

External identifiers

Abstract

The global prevalence of diabetes, particularly type 2 diabetes mellitus (T2DM), is rapidly increasing, posing significant health and economic challenges. T2DM not only disrupts blood glucose regulation but also damages vital organs such as the heart, kidneys, eyes, nerves, and blood vessels, leading to substantial morbidity and mortality. In the US alone, the economic burden of diagnosed diabetes exceeded $400 billion in 2022. Early detection of individuals at risk is critical to mitigating these impacts. While machine learning approaches for T2DM prediction are increasingly adopted, many rely on supervised learning, which is often limited by the lack of confirmed negative cases. To address this limitation, we propose a novel unsupervised framework that integrates Non-negative Matrix Factorization (NMF) with statistical techniques to identify individuals at risk of developing T2DM. Our method identifies latent patterns of multimorbidity and polypharmacy among diagnosed T2DM patients and applies these patterns to estimate the T2DM risk in undiagnosed individuals. By leveraging data-driven insights from comorbidity and medication usage, our approach provides an interpretable and scalable solution that can assist healthcare providers in implementing timely interventions, ultimately improving patient outcomes and potentially reducing the future health and economic burden of T2DM.

Added

2025-10-01

Last modified

2025-10-07
Source: Source Scott Alexander Malec

An open source knowledge graph ecosystem for the life sciences.

Scientific data
2024-04-11 | Journal article
Contributors: Callahan TJ; Tripodi IJ; Stefanski AL; Cappelletti L; Taneja SB (and 27 more)
Show more detail

Contributors

Callahan TJ (Author) [ORCID: 0000-0002-8169-9049]
Tripodi IJ (Author)
Stefanski AL (Author)
Cappelletti L (Author)
Taneja SB (Author) [ORCID: 0000-0003-1707-1617]
Wyrwa JM (Author)
Casiraghi E (Author) [ORCID: 0000-0003-2024-7572]
Matentzoglu NA (Author) [ORCID: 0000-0002-7356-1779]
Reese J (Author)
Silverstein JC (Author) [ORCID: 0000-0002-9252-6039]
Hoyt CT (Author) [ORCID: 0000-0003-4423-4370]
Boyce RD (Author)
Malec SA (Author)
Unni DR (Author) [ORCID: 0000-0002-3583-7340]
Joachimiak MP (Author)
Robinson PN (Author) [ORCID: 0000-0002-0736-9199]
Mungall CJ (Author)
Cavalleri E (Author) [ORCID: 0000-0003-1973-5712]
Fontana T (Author)
Valentini G (Author) [ORCID: 0000-0002-5694-3919]
Mesiti M (Author) [ORCID: 0000-0001-5701-0080]
Gillenwater LA (Author)
Santangelo B (Author)
Vasilevsky NA (Author) [ORCID: 0000-0001-5208-3432]
Hoehndorf R (Author) [ORCID: 0000-0001-8149-5890]
Bennett TD (Author)
Ryan PB (Author)
Hripcsak G (Author)
Kahn MG (Author) [ORCID: 0000-0003-4786-6875]
Bada M (Author)
Baumgartner WA Jr (Author)
Hunter LE (Author)

External identifiers

Abstract

Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.

Added

2025-11-17

Last modified

2025-11-17
Source: Source Scott Alexander Malec

Causal feature selection using a knowledge graph combining structured knowledge from the biomedical literature and ontologies: A use case studying depression as a risk factor for Alzheimer’s disease

Journal of Biomedical Informatics
2023-06 | Journal article
Contributors: Scott A. Malec; Sanya B. Taneja; Steven M. Albert; C. Elizabeth Shaaban; Helmet T. Karim (and 4 more)
Show more detail

Contributors

Scott A. Malec (Author)
Sanya B. Taneja (Author)
Steven M. Albert (Author)
C. Elizabeth Shaaban (Author)
Helmet T. Karim (Author)
Arthur S. Levine (Author)
Paul Munro (Author)
Tiffany J. Callahan (Author)
Richard D. Boyce (Author)

External identifiers

Added

2023-05-25

Last modified

2025-02-19
Source: Validated source Crossref

An Open-Source Knowledge Graph Ecosystem for the Life Sciences

arXiv
2023 | Preprint
Contributors: Tiffany J. Callahan; Ignacio J. Tripodi; Adrianne L. Stefanski; Luca Cappelletti; Sanya B. Taneja (and 27 more)
Show more detail

Contributors

Tiffany J. Callahan (Author)
Ignacio J. Tripodi (Author)
Adrianne L. Stefanski (Author)
Luca Cappelletti (Author)
Sanya B. Taneja (Author)
Jordan M. Wyrwa (Author)
Elena Casiraghi (Author)
Nicolas A. Matentzoglu (Author)
Justin Reese (Author)
Jonathan C. Silverstein (Author)
Charles Tapley Hoyt (Author)
Richard D. Boyce (Author)
Scott A. Malec (Author)
Deepak R. Unni (Author)
Marcin P. Joachimiak (Author)
Peter N. Robinson (Author)
Christopher J. Mungall (Author)
Emanuele Cavalleri (Author)
Tommaso Fontana (Author)
Giorgio Valentini (Author)
Marco Mesiti (Author)
Lucas A. Gillenwater (Author)
Brook Santangelo (Author)
Nicole A. Vasilevsky (Author)
Robert Hoehndorf (Author)
Tellen D. Bennett (Author)
Patrick B. Ryan (Author)
George Hripcsak (Author)
Michael G. Kahn (Author)
Michael Bada (Author)
William A. Baumgartner (Author)
Lawrence E. Hunter (Author)

External identifiers

Abstract

Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to automatically construct them. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoints and abstraction algorithms), and benchmarks (e.g., prebuilt KGs and embeddings). We evaluate the ecosystem by surveying open-source KG construction methods and analyzing its computational performance when constructing 12 large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.

Added

2023-11-16

Last modified

2025-02-19
Source: Source Scott Alexander Malec

Causal feature selection using a knowledge graph combining structured knowledge from the biomedical literature and ontologies: a use case studying depression as a risk factor for Alzheimer's disease

2022-07-20 | Preprint
Contributors: Scott Alexander Malec; Sanya B Taneja; Steven M Albert; C. Elizabeth Shaaban; Helmet T Karim (and 4 more)
Show more detail

Contributors

Scott Alexander Malec (Author)
Sanya B Taneja (Author)
Steven M Albert (Author)
C. Elizabeth Shaaban (Author)
Helmet T Karim (Author)
Art S Levine (Author)
Paul Wesley Munro (Author)
Tiffany J Callahan (Author)
Richard David Boyce (Author)

External identifiers

Added

2022-07-20

Last modified

2025-02-19
Source: Validated source Crossref

Does clinical data capture modifiable midlife risk factors for Alzheimer’s disease?

Alzheimer's & Dementia
2021-12 | Journal article
Contributors: C Elizabeth Shaaban; Sanya B Taneja; Kailyn F Witonsky; Scott A Malec; Helmet T Karim (and 5 more)
Show more detail

Contributors

C Elizabeth Shaaban (Author)
Sanya B Taneja (Author)
Kailyn F Witonsky (Author)
Scott A Malec (Author)
Helmet T Karim (Author)
Sheila Pratt (Author)
Arthur S Levine (Author)
Paul Munro (Author)
Richard D Boyce (Author)
Steven M Albert (Author)

External identifiers

Added

2023-04-12

Last modified

2025-02-19
Source: Source Scott Alexander Malec

Using computable knowledge mined from the literature to elucidate confounders for EHR-based pharmacovigilance

2020-07-10 | Other
Contributors: Scott A. Malec; Elmer V. Bernstam; Peng Wei; Richard D. Boyce; Trevor Cohen
Show more detail

Contributors

Scott A. Malec (Author)
Elmer V. Bernstam (Author)
Peng Wei (Author)
Richard D. Boyce (Author)
Trevor Cohen (Author)

External identifiers

Added

2020-07-14

Last modified

2025-02-19
Source: Validated source Crossref

Exploring Novel Computable Knowledge in Structured Drug Product Labels.

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
2020-05 | Journal article
Contributors: Malec SA; Boyce RD
Show more detail

Contributors

Malec SA (Author)
Boyce RD (Author)

External identifiers

PMID: 32477661
PMC: PMC7233092

Added

2020-07-08

Last modified

2025-02-19
Source: Validated source Europe PubMed Central

Literature-Based Discovery of Confounding in Observational Clinical Data.

AMIA Annual Symposium Proceedings
2016 | Journal article
Contributors: Malec SA; Wei P; Xu H; Bernstam EV; Myneni S (and 1 more)
Show more detail

Contributors

Malec SA (Author)
Wei P (Author)
Xu H (Author)
Bernstam EV (Author)
Myneni S (Author)
Cohen T (Author)

External identifiers

PMID: 28269951
PMC: PMC5333204

Added

2018-02-06

Last modified

2025-02-19
Source: Validated source Europe PubMed Central

Propp Revisited: Integration of Linguistic Markup into Structured Content Descriptors of Tales

Digital Humanities 2010
2010-07 | Conference paper
Show more detail

Added

2018-02-07

Last modified

2025-02-19
Source: Source Scott Alexander Malec

Integration of Linguistic Markup into Semantic Models of Folk Narratives: The Fairy Tale Use Case.

Unpublished
2010 | Conference paper
Contributors: Piroska Lendvai; Thierry Declerck; Sándor Darányi; Pablo Gervás; Raquel Hervás (and 2 more)
Show more detail

Contributors

Piroska Lendvai (Author)
Thierry Declerck (Author)
Sándor Darányi (Author)
Pablo Gervás (Author)
Raquel Hervás (Author)
Scott Malec (Author)
Federico Peinado (Author)

External identifiers

DOI: 10.13140/2.1.2365.2801

Added

2018-02-06

Last modified

2025-10-09
Source: Validated source DataCite

AutoPropp: Toward the Automatic Markup, Classification, and Annotation of Russian Magic Tales

Added

2018-02-06

Last modified

2025-02-19
Source: Source Scott Alexander Malec