What is lacuene?
lacuene (from French lacune, meaning gap) is a multi-source biomedical data
reconciliation tool that cross-references 95 neural crest genes across 16 public databases.
It surfaces funding gaps: genes with established clinical relevance but insufficient
experimental research coverage, helping program officers identify high-impact targets for
craniofacial and dental research funding.
Key finding: Of 95 neural crest genes, 73 have Mendelian disease
associations in OMIM but zero experimental datasets in the NIDCR-funded FaceBase repository.
These represent concrete opportunities for new research investment.
Capabilities
lacuene provides sixteen interactive features, each designed around a question a program officer
or PI might ask during grant review or portfolio planning.
- Funding Gap Finder — Identifies genes with confirmed Mendelian
disease associations (OMIM) but no experimental coverage in FaceBase, ranked by
weighted priority score (combines syndrome burden, phenotype count, genetic constraint,
and publication scarcity). Click any gene to see its full profile.
- Source Coverage — Shows at a glance how completely each of the
16 databases covers the gene set. Immediately reveals which databases have
the largest gaps.
- Understudied Gene Ranking — Disease genes sorted by craniofacial
publication count (ascending). Low publication counts for genes with known pathogenic
variants suggest high-impact, low-competition research opportunities. Priority badges
mark the highest-value targets.
- Gene Landscape Graph — Interactive Cytoscape.js network visualization
with 95 nodes and 2000+ edges across four relationship types: shared HPO phenotypes
(gray edges), shared OMIM syndromes (pink edges), shared GO biological processes
(blue edges), and STRING protein–protein interactions (green dashed edges). Click any
node to highlight its neighborhood and open its detail panel. Supports force-directed,
circle, concentric, and cluster layouts.
- Community Clustering — Label propagation community detection
identifies groups of functionally related genes in the network. The cluster layout arranges
communities spatially with convex hull boundaries, revealing which biological modules
are well-studied vs. underserved.
- Cross-Source Anomaly Detection — CUE-computed rules identify
cross-source inconsistencies: genes with OMIM disease associations but no ClinVar variants,
high genetic constraint but no clinical trials, high publication counts but no FaceBase
coverage, and ClinVar variants but no HPO phenotypes. Filterable by anomaly type.
- Syndrome-Centric View — Flips the analysis from gene-level to
disease-level. Instead of asking “which databases cover SOX10?”, you can ask
“how well is Waardenburg syndrome covered?” Shows every multi-gene syndrome,
how many of its genes have FaceBase data, and aggregate publication and pathogenic variant
counts. Click a syndrome to highlight all its genes in the graph simultaneously.
- Portfolio Overlay — Paste a list of gene symbols from your current
funded portfolio (or a proposed grant) to instantly see which critical gaps your funding
addresses and which remain uncovered. Separates your genes into covered gaps (green),
unfunded gaps (red), and genes that are already well-covered.
- Cross-Source Filter — Interactive filter panel with tri-state
toggle buttons for each database (any / required / excluded) plus numeric ranges for
publication count and pathogenic variants. Ask compound questions like “every gene
in OMIM but not in FaceBase with more than 100 pathogenic variants” and see the
filtered results instantly.
- Gene Table Search — Real-time search across gene symbols,
syndrome names, and protein names. Filters the per-gene coverage table as you type.
- Per-Gene Dossier — Click any gene to see all 16
sources, publications with trend analysis (rising/stable/declining), pathogenic variants,
syndromes, tissue expression from GTEx, active NIH grants with PI names, genetic constraint
scores (pLI, LOEUF), active clinical trials, and STRING protein interaction partners.
- Tissue Expression — GTEx expression data showing top tissues and
craniofacial-specific TPM values in the gene detail panel. Confirms whether a gene
is expressed in tissues relevant to craniofacial development.
- Active Grants — NIH Reporter project details with PI names and
direct links. Reveals which gap genes already have federal research investment.
- Change History — The pipeline saves a timestamped snapshot of
the gap state each time it runs. Once multiple snapshots exist, the change history
shows which gaps opened or closed between runs — directly measuring the impact
of research investments over time.
- Exportable Briefing — Generates a plain-text summary paragraph
with top priority targets, suitable for pasting into emails or grant reviews.
Copy to clipboard in one click.
- CSV Export — Full dataset export with 23 columns covering all
16 sources, publication counts, pathogenic variants, and syndrome associations.
Four presets: All Genes, Critical Gaps Only, Top Priority (score ≥ 15), Understudied
(< 20 publications). Exports the currently filtered view when filters are active.
What lacuene does that individual databases don’t
Each of lacuene’s 16 data sources is excellent at what it does. OMIM catalogs
disease associations with unmatched depth. FaceBase curates craniofacial datasets with
careful experimental metadata. PubMed indexes the literature comprehensively. The challenge
isn’t the quality of any single source — it’s that no single source answers
cross-cutting questions:
- Gap detection across sources. OMIM can tell you a gene causes
Waardenburg syndrome. FaceBase can tell you what datasets it has. Neither tells you which
disease genes lack FaceBase data. lacuene computes that automatically for every gene
in the set.
- Disease-level aggregation. A syndrome like Treacher Collins involves
multiple genes (TCOF1, POLR1C, POLR1D, POLR1B). Evaluating research coverage for the
syndrome as a whole — rather than gene by gene — requires combining OMIM,
FaceBase, and PubMed data in a way none of those databases do individually.
- Portfolio-aware analysis. Funding agencies maintain portfolios of
supported research. Knowing which gaps a proposed grant would fill — and which would
remain — requires overlaying portfolio data against the gap analysis. This is a
question no public database is designed to answer.
- Reproducible reconciliation. lacuene’s pipeline uses
CUE lattice unification to merge all 16 sources
structurally, not through ad-hoc scripts. Adding another source means adding one normalizer;
the type system guarantees it integrates cleanly with the existing model. The full pipeline
rebuilds from cached data in under 10 seconds.
Data Sources
Each gene is queried against 16 biomedical databases. The presence or absence of a gene
in each source contributes to its coverage profile and gap severity assessment.
Source Descriptions
- Gene Ontology (GO) —
Provides standardized molecular function, biological process, and cellular component
annotations. Every gene in our set is annotated with GO terms via the
QuickGO API.
Ashburner et al. (2000) Nature Genetics 25:25–29.
- OMIM —
Online Mendelian Inheritance in Man. Catalogs human genes and genetic disorders.
A gene's presence in OMIM with associated syndromes indicates established
disease relevance.
Amberger et al. (2019) Nucleic Acids Research 47:D1038–D1043.
- Human Phenotype Ontology (HPO) —
Standardized vocabulary of phenotypic abnormalities. Provides the phenotype-to-gene
associations used to compute shared-phenotype edges in the gene landscape graph.
Köhler et al. (2021) Nucleic Acids Research 49:D1207–D1217.
- UniProt —
Universal Protein Resource. Provides protein names, accession numbers, and
functional annotations for each gene product.
UniProt Consortium (2023) Nucleic Acids Research 51:D523–D531.
- FaceBase —
NIDCR-funded data repository for craniofacial research. Contains experimental
datasets (RNA-seq, ChIP-seq, imaging, etc.). A gene's absence from FaceBase
despite disease relevance represents the core funding gap this tool identifies.
Brinkley et al. (2020) Orthodontics & Craniofacial Research 23 Suppl 1:44–51.
- ClinVar —
NCBI's archive of clinically significant genomic variants. We query pathogenic and
likely pathogenic variants per gene, providing a measure of clinical genetic evidence.
Landrum et al. (2020) Nucleic Acids Research 48:D845–D855.
- PubMed —
NCBI's biomedical literature index. We query each gene combined with “craniofacial
OR neural crest” to count domain-specific publications. Low publication counts
for disease-associated genes indicate understudied targets.
Publication data queried via NCBI E-utilities.
- gnomAD —
Genome Aggregation Database. Provides population allele frequencies and gene-level
constraint metrics (pLI, LOEUF) that quantify how intolerant a gene is to loss-of-function
variation. High constraint scores indicate essential genes where mutations are strongly
selected against.
- NIH Reporter —
NIH Research Portfolio Online Reporting Tools. Tracks active NIH-funded grants
mentioning each gene, providing a direct measure of current federal research investment.
Genes with disease relevance but no active grants represent funding opportunities.
- GTEx —
Genotype-Tissue Expression project. Provides tissue-specific gene expression data
across 54 human tissues. Used to confirm craniofacial-relevant expression patterns
and identify genes with tissue-specific regulatory programs.
- ClinicalTrials.gov —
Registry and results database of clinical studies. Queried via the v2 API for active
interventional and observational trials mentioning each gene. Surfaces which disease
genes have active translational research, complementing the basic science coverage
from other sources.
- STRING —
Search Tool for Retrieval of Interacting Genes/Proteins. Provides known and predicted
protein–protein interactions with confidence scores. Used to build PPI edges in
the gene landscape graph and identify interaction partners within the 95-gene network.
Szklarczyk et al. (2023) Nucleic Acids Research 51:D483–D489.
- Orphanet —
European reference portal for rare diseases and orphan drugs. Provides disorder-gene
associations with prevalence estimates and inheritance patterns from the en_product6
XML dataset. Complements OMIM with European rare disease classification and
epidemiological data.
Rath et al. (2012) Human Mutation 33:803–808.
- Open Targets —
Systematic drug target identification platform integrating genomic, transcriptomic,
and chemical data. Provides drug tractability assessments, clinical pipeline phase
(preclinical through approved), and known drug associations per gene. Surfaces
which gap genes already have therapeutic development activity.
Ochoa et al. (2023) Nucleic Acids Research 51:D1302–D1310.
- MGI/ZFIN (Alliance of Genome Resources) —
Aggregates model organism data from the Mouse Genome Informatics (MGI) and
Zebrafish Information Network (ZFIN) databases. Reports availability of mouse
and zebrafish genetic models for each gene, indicating translational research
readiness — genes with established animal models are closer to functional
validation.
Alliance of Genome Resources Consortium (2024) Genetics 227:iyae149.
- AlphaFold/PDB —
Protein structure availability from AlphaFold predicted structures and the RCSB
Protein Data Bank (PDB). Reports AlphaFold mean confidence (pLDDT) and count of
experimental crystal/cryo-EM structures. Structural availability enables
structure-based drug design and mechanistic understanding of disease variants.
Jumper et al. (2021) Nature 596:583–589.
Methodology
Gene Selection
The 95 genes span the neural crest gene regulatory network as described in the
literature, organized into 8 developmental categories: border specification,
neural crest specifiers, EMT/migration, signaling pathways, craniofacial patterning,
melanocyte/pigmentation, enteric nervous system, and cardiac neural crest.
Simoes-Costa & Bronner (2015) Development 142:242–257 and
Martik & Bronner (2017) Developmental Biology 429:293–302.
Data Pipeline
Each source is fetched by a Python normalizer script that queries the source API,
caches raw results locally, and emits a CUE data
file. CUE’s lattice-based unification merges all 16 sources into a single typed
model per gene — each source owns its fields, and CUE guarantees structural
consistency across the full dataset without imperative merge logic.
Gap Detection
The “critical gap” definition is computed as a CUE projection:
critical: [for k, v in genes
if v._in_omim && !v._in_facebase {
symbol: k
syndromes: v.omim_syndromes
pub_count: v.pubmed_total
}]
A gene is “critical” when it has Mendelian disease associations (OMIM) but
lacks experimental datasets in the NIDCR-funded FaceBase repository. The gap list is
sorted by publication count (ascending) to prioritize the most understudied genes.
Graph Construction
The gene landscape graph connects genes via four relationship types: shared HPO phenotypes
(gray edges), shared OMIM syndromes (pink edges), shared GO biological processes
(blue edges), and STRING protein–protein interactions (green dashed edges).
Shared-phenotype edges are filtered to phenotypes present in 2–5 genes to avoid
edge explosion from universal phenotypes like “Intellectual disability.”
PPI edges are filtered to interactions within the 95-gene network with confidence
scores above 0.4 (medium confidence). Node size reflects log-scaled craniofacial
publication count; color indicates developmental role. Community detection via label
propagation identifies clusters of functionally related genes.
Technology
lacuene is built with CUE for data unification,
Python for normalization and generation, and Cytoscape.js
for graph visualization. The pipeline is fully reproducible from cached source data.
Source code is available on GitHub.
References
- [1] Ashburner M et al. “Gene Ontology: tool for the unification of biology.”
Nature Genetics 25:25–29 (2000).
doi:10.1038/75556
- [2] Amberger JS et al. “OMIM.org: leveraging knowledge across phenotype-gene relationships.”
Nucleic Acids Research 47:D1038–D1043 (2019).
doi:10.1093/nar/gky1151
- [3] Köhler S et al. “The Human Phenotype Ontology in 2021.”
Nucleic Acids Research 49:D1207–D1217 (2021).
doi:10.1093/nar/gkaa1043
- [4] UniProt Consortium. “UniProt: the Universal Protein Knowledgebase in 2023.”
Nucleic Acids Research 51:D523–D531 (2023).
doi:10.1093/nar/gkac1052
- [5] Brinkley JF et al. “The FaceBase Consortium: a comprehensive resource for craniofacial researchers.”
Orthodontics & Craniofacial Research 23 Suppl 1:44–51 (2020).
doi:10.1111/ocr.12385
- [6] Landrum MJ et al. “ClinVar: improvements to accessing data.”
Nucleic Acids Research 48:D845–D855 (2020).
doi:10.1093/nar/gkz972
- [7] Simoes-Costa M, Bronner ME. “Establishing neural crest identity: a gene regulatory recipe.”
Development 142:242–257 (2015).
doi:10.1242/dev.105445
- [8] Martik ML, Bronner ME. “Regulatory logic underlying diversification of the neural crest.”
Developmental Biology 429:293–302 (2017).
doi:10.1016/j.ydbio.2017.05.028
- [9] Szklarczyk D et al. “The STRING database in 2023: protein–protein association networks
and functional enrichment analyses for any sequenced genome of interest.”
Nucleic Acids Research 51:D483–D489 (2023).
doi:10.1093/nar/gkac1000
- [10] Rath A et al. “Representation of rare diseases in health information systems:
the Orphanet approach to serve a wide range of end users.”
Human Mutation 33:803–808 (2012).
doi:10.1002/humu.22078
- [11] Ochoa D et al. “The next-generation Open Targets Platform: reimagined, redesigned, rebuilt.”
Nucleic Acids Research 51:D1302–D1310 (2023).
doi:10.1093/nar/gkac1046
- [12] Alliance of Genome Resources Consortium. “The Alliance of Genome Resources: building a
modern data ecosystem for model organism databases.”
Genetics 227:iyae149 (2024).
doi:10.1093/genetics/iyae149
- [13] Jumper J et al. “Highly accurate protein structure prediction with AlphaFold.”
Nature 596:583–589 (2021).
doi:10.1038/s41586-021-03819-2