lacuene About & Methodology

← Back to Dashboard

What is lacuene?

lacuene (from French lacune, meaning gap) is a multi-source biomedical data reconciliation tool that cross-references 95 neural crest genes across 16 public databases. It surfaces funding gaps: genes with established clinical relevance but insufficient experimental research coverage, helping program officers identify high-impact targets for craniofacial and dental research funding.

Key finding: Of 95 neural crest genes, 73 have Mendelian disease associations in OMIM but zero experimental datasets in the NIDCR-funded FaceBase repository. These represent concrete opportunities for new research investment.

Capabilities

lacuene provides sixteen interactive features, each designed around a question a program officer or PI might ask during grant review or portfolio planning.

  1. Funding Gap Finder — Identifies genes with confirmed Mendelian disease associations (OMIM) but no experimental coverage in FaceBase, ranked by weighted priority score (combines syndrome burden, phenotype count, genetic constraint, and publication scarcity). Click any gene to see its full profile.
  2. Source Coverage — Shows at a glance how completely each of the 16 databases covers the gene set. Immediately reveals which databases have the largest gaps.
  3. Understudied Gene Ranking — Disease genes sorted by craniofacial publication count (ascending). Low publication counts for genes with known pathogenic variants suggest high-impact, low-competition research opportunities. Priority badges mark the highest-value targets.
  4. Gene Landscape Graph — Interactive Cytoscape.js network visualization with 95 nodes and 2000+ edges across four relationship types: shared HPO phenotypes (gray edges), shared OMIM syndromes (pink edges), shared GO biological processes (blue edges), and STRING protein–protein interactions (green dashed edges). Click any node to highlight its neighborhood and open its detail panel. Supports force-directed, circle, concentric, and cluster layouts.
  5. Community Clustering — Label propagation community detection identifies groups of functionally related genes in the network. The cluster layout arranges communities spatially with convex hull boundaries, revealing which biological modules are well-studied vs. underserved.
  6. Cross-Source Anomaly Detection — CUE-computed rules identify cross-source inconsistencies: genes with OMIM disease associations but no ClinVar variants, high genetic constraint but no clinical trials, high publication counts but no FaceBase coverage, and ClinVar variants but no HPO phenotypes. Filterable by anomaly type.
  7. Syndrome-Centric View — Flips the analysis from gene-level to disease-level. Instead of asking “which databases cover SOX10?”, you can ask “how well is Waardenburg syndrome covered?” Shows every multi-gene syndrome, how many of its genes have FaceBase data, and aggregate publication and pathogenic variant counts. Click a syndrome to highlight all its genes in the graph simultaneously.
  8. Portfolio Overlay — Paste a list of gene symbols from your current funded portfolio (or a proposed grant) to instantly see which critical gaps your funding addresses and which remain uncovered. Separates your genes into covered gaps (green), unfunded gaps (red), and genes that are already well-covered.
  9. Cross-Source Filter — Interactive filter panel with tri-state toggle buttons for each database (any / required / excluded) plus numeric ranges for publication count and pathogenic variants. Ask compound questions like “every gene in OMIM but not in FaceBase with more than 100 pathogenic variants” and see the filtered results instantly.
  10. Gene Table Search — Real-time search across gene symbols, syndrome names, and protein names. Filters the per-gene coverage table as you type.
  11. Per-Gene Dossier — Click any gene to see all 16 sources, publications with trend analysis (rising/stable/declining), pathogenic variants, syndromes, tissue expression from GTEx, active NIH grants with PI names, genetic constraint scores (pLI, LOEUF), active clinical trials, and STRING protein interaction partners.
  12. Tissue Expression — GTEx expression data showing top tissues and craniofacial-specific TPM values in the gene detail panel. Confirms whether a gene is expressed in tissues relevant to craniofacial development.
  13. Active Grants — NIH Reporter project details with PI names and direct links. Reveals which gap genes already have federal research investment.
  14. Change History — The pipeline saves a timestamped snapshot of the gap state each time it runs. Once multiple snapshots exist, the change history shows which gaps opened or closed between runs — directly measuring the impact of research investments over time.
  15. Exportable Briefing — Generates a plain-text summary paragraph with top priority targets, suitable for pasting into emails or grant reviews. Copy to clipboard in one click.
  16. CSV Export — Full dataset export with 23 columns covering all 16 sources, publication counts, pathogenic variants, and syndrome associations. Four presets: All Genes, Critical Gaps Only, Top Priority (score ≥ 15), Understudied (< 20 publications). Exports the currently filtered view when filters are active.

What lacuene does that individual databases don’t

Each of lacuene’s 16 data sources is excellent at what it does. OMIM catalogs disease associations with unmatched depth. FaceBase curates craniofacial datasets with careful experimental metadata. PubMed indexes the literature comprehensively. The challenge isn’t the quality of any single source — it’s that no single source answers cross-cutting questions:

Data Sources

Each gene is queried against 16 biomedical databases. The presence or absence of a gene in each source contributes to its coverage profile and gap severity assessment.

SourceCoverage
Gene Ontology95/95
OMIM95/95
HPO81/95
UniProt95/95
FaceBase22/95
ClinVar95/95
PubMed95/95
gnomAD91/95
NIH Reporter95/95
GTEx95/95
ClinicalTrials95/95
STRING95/95
Orphanet80/95
Open Targets91/95
AlphaFold/PDB95/95
MGI/ZFIN95/95

Source Descriptions

  1. Gene Ontology (GO) — Provides standardized molecular function, biological process, and cellular component annotations. Every gene in our set is annotated with GO terms via the QuickGO API. Ashburner et al. (2000) Nature Genetics 25:25–29.
  2. OMIM — Online Mendelian Inheritance in Man. Catalogs human genes and genetic disorders. A gene's presence in OMIM with associated syndromes indicates established disease relevance. Amberger et al. (2019) Nucleic Acids Research 47:D1038–D1043.
  3. Human Phenotype Ontology (HPO) — Standardized vocabulary of phenotypic abnormalities. Provides the phenotype-to-gene associations used to compute shared-phenotype edges in the gene landscape graph. Köhler et al. (2021) Nucleic Acids Research 49:D1207–D1217.
  4. UniProt — Universal Protein Resource. Provides protein names, accession numbers, and functional annotations for each gene product. UniProt Consortium (2023) Nucleic Acids Research 51:D523–D531.
  5. FaceBase — NIDCR-funded data repository for craniofacial research. Contains experimental datasets (RNA-seq, ChIP-seq, imaging, etc.). A gene's absence from FaceBase despite disease relevance represents the core funding gap this tool identifies. Brinkley et al. (2020) Orthodontics & Craniofacial Research 23 Suppl 1:44–51.
  6. ClinVar — NCBI's archive of clinically significant genomic variants. We query pathogenic and likely pathogenic variants per gene, providing a measure of clinical genetic evidence. Landrum et al. (2020) Nucleic Acids Research 48:D845–D855.
  7. PubMed — NCBI's biomedical literature index. We query each gene combined with “craniofacial OR neural crest” to count domain-specific publications. Low publication counts for disease-associated genes indicate understudied targets. Publication data queried via NCBI E-utilities.
  8. gnomAD — Genome Aggregation Database. Provides population allele frequencies and gene-level constraint metrics (pLI, LOEUF) that quantify how intolerant a gene is to loss-of-function variation. High constraint scores indicate essential genes where mutations are strongly selected against.
  9. NIH Reporter — NIH Research Portfolio Online Reporting Tools. Tracks active NIH-funded grants mentioning each gene, providing a direct measure of current federal research investment. Genes with disease relevance but no active grants represent funding opportunities.
  10. GTEx — Genotype-Tissue Expression project. Provides tissue-specific gene expression data across 54 human tissues. Used to confirm craniofacial-relevant expression patterns and identify genes with tissue-specific regulatory programs.
  11. ClinicalTrials.gov — Registry and results database of clinical studies. Queried via the v2 API for active interventional and observational trials mentioning each gene. Surfaces which disease genes have active translational research, complementing the basic science coverage from other sources.
  12. STRING — Search Tool for Retrieval of Interacting Genes/Proteins. Provides known and predicted protein–protein interactions with confidence scores. Used to build PPI edges in the gene landscape graph and identify interaction partners within the 95-gene network. Szklarczyk et al. (2023) Nucleic Acids Research 51:D483–D489.
  13. Orphanet — European reference portal for rare diseases and orphan drugs. Provides disorder-gene associations with prevalence estimates and inheritance patterns from the en_product6 XML dataset. Complements OMIM with European rare disease classification and epidemiological data. Rath et al. (2012) Human Mutation 33:803–808.
  14. Open Targets — Systematic drug target identification platform integrating genomic, transcriptomic, and chemical data. Provides drug tractability assessments, clinical pipeline phase (preclinical through approved), and known drug associations per gene. Surfaces which gap genes already have therapeutic development activity. Ochoa et al. (2023) Nucleic Acids Research 51:D1302–D1310.
  15. MGI/ZFIN (Alliance of Genome Resources) — Aggregates model organism data from the Mouse Genome Informatics (MGI) and Zebrafish Information Network (ZFIN) databases. Reports availability of mouse and zebrafish genetic models for each gene, indicating translational research readiness — genes with established animal models are closer to functional validation. Alliance of Genome Resources Consortium (2024) Genetics 227:iyae149.
  16. AlphaFold/PDB — Protein structure availability from AlphaFold predicted structures and the RCSB Protein Data Bank (PDB). Reports AlphaFold mean confidence (pLDDT) and count of experimental crystal/cryo-EM structures. Structural availability enables structure-based drug design and mechanistic understanding of disease variants. Jumper et al. (2021) Nature 596:583–589.

Methodology

Gene Selection

The 95 genes span the neural crest gene regulatory network as described in the literature, organized into 8 developmental categories: border specification, neural crest specifiers, EMT/migration, signaling pathways, craniofacial patterning, melanocyte/pigmentation, enteric nervous system, and cardiac neural crest. Simoes-Costa & Bronner (2015) Development 142:242–257 and Martik & Bronner (2017) Developmental Biology 429:293–302.

Data Pipeline

Each source is fetched by a Python normalizer script that queries the source API, caches raw results locally, and emits a CUE data file. CUE’s lattice-based unification merges all 16 sources into a single typed model per gene — each source owns its fields, and CUE guarantees structural consistency across the full dataset without imperative merge logic.

Gap Detection

The “critical gap” definition is computed as a CUE projection:

critical: [for k, v in genes if v._in_omim && !v._in_facebase { symbol: k syndromes: v.omim_syndromes pub_count: v.pubmed_total }]

A gene is “critical” when it has Mendelian disease associations (OMIM) but lacks experimental datasets in the NIDCR-funded FaceBase repository. The gap list is sorted by publication count (ascending) to prioritize the most understudied genes.

Graph Construction

The gene landscape graph connects genes via four relationship types: shared HPO phenotypes (gray edges), shared OMIM syndromes (pink edges), shared GO biological processes (blue edges), and STRING protein–protein interactions (green dashed edges). Shared-phenotype edges are filtered to phenotypes present in 2–5 genes to avoid edge explosion from universal phenotypes like “Intellectual disability.” PPI edges are filtered to interactions within the 95-gene network with confidence scores above 0.4 (medium confidence). Node size reflects log-scaled craniofacial publication count; color indicates developmental role. Community detection via label propagation identifies clusters of functionally related genes.

Technology

lacuene is built with CUE for data unification, Python for normalization and generation, and Cytoscape.js for graph visualization. The pipeline is fully reproducible from cached source data. Source code is available on GitHub.

References