TaxonomyLLM: Automating Enterprise Knowledge Graph Generation with Topological Attention
A deep dive into TaxonomyLLM, a novel transformer architecture that leverages disentangled TopoAttention to automatically translate SQL schemas into valid RDF taxonomies, bridging the gap between relational data and semantic knowledge graphs.
Table of Contents
The Challenge of Enterprise Data Semantics
Organizations today face a critical challenge: disparate data classifications across systems that require reconciliation. As machine learning adoption accelerates, consistent data understanding through unified semantics becomes essential for extracting meaningful value from information. However, manually crafting taxonomy tags is non-trivial, expensive, and fundamentally unscalable.
Consider a typical enterprise scenario: you have dozens of database schemas across different departments, each using its own terminology. The Customer table in Sales might correspond to Client in Support and Account in Finance. Without a unified semantic layer, building cross-functional analytics or training enterprise-wide ML models becomes a Sisyphean task.
TaxonomyLLM addresses this challenge head-on by providing an LLM extension tailored for systematic translation of logical schemas into standardized RDF taxonomy structures.
The Innovation: Disentangled Topological Attention
What makes TaxonomyLLM unique is its TopoAttention mechanism - a specialized self-attention approach designed to explicitly model alignments between input schema structure and output taxonomy topology.
Why Standard Attention Falls Short
In a vanilla transformer, self-attention computes alignment scores between all token pairs:
Attention(Qi, Kj) = Qi * Kj^T
This captures general relationships but lacks the specificity needed for schema-to-taxonomy translation. When mapping a database schema to an RDF graph, we need to explicitly correlate:
- Schema tables with taxonomy classes
- Schema columns with taxonomy properties
- Column relationships with property constraints
The TopoAttention Solution
TopoAttention introduces disentangled attention matrices that focus exclusively on structural and positional reasoning:
# Schema Structure Attention
HsAttention = Qs @ Ws.T # Correlates schema structure patterns
# Taxonomy Position Attention
HpAttention = Qp @ Wp.T # Correlates taxonomy positional elements
The mathematical formulation:
TopoAttention(E_s) = E_s @ W_q^s @ K_k^s.T # Structure reasoning
TopoAttention(E_p) = E_p @ W_q^p @ K_k^p.T # Position reasoning
Where:
E_scaptures schema structural embeddingsE_pcaptures taxonomy positional embeddings- Separate learned projections (
W_q,W_k) specialize in structure vs. position
This disentanglement enables the model to independently reason about:
- What schema elements exist (structure)
- Where they should map in the taxonomy (position)
Architecture Deep Dive
The TaxonomyLLM architecture builds on T5's encoder-decoder foundation with custom modules optimized for schema ingestion and taxonomy generation.
Knowledge Graph Architecture
The Two-Phase Training Methodology
TaxonomyLLM employs a sophisticated two-phase training approach:
Knowledge Graph Architecture
Why T5? A Rigorous Model Selection
The TaxonomyLLM team evaluated multiple foundation models before selecting T5:
| Model | Schema Assimilation | Relational Reasoning | RDF Constraints |
|---|---|---|---|
| GPT-3 | Medium | Low | Minimal |
| PaLM | High | Medium | Partial |
| BLOOM | High | Medium | Partial |
| T5 | Excellent | High | Significant |
T5's encoder-decoder architecture provides several advantages:
- Bidirectional encoding - Full context awareness for schema understanding
- Span corruption pre-training - Natural fit for structured input/output
- Text-to-text framing - Schema-to-RDF maps cleanly to this paradigm
Implementation Architecture
The core implementation leverages TensorFlow and the Transformers library:
import tensorflow as tf
import transformers
from sqlparse import parse
from rdflib import Graph
class SchemaEncoder(transformers.T5EncoderModel):
"""
Custom encoder that vectorizes SQL schema syntax patterns
into structural hidden representations.
"""
def forward(self, schema_tokens):
# Parse SQL CREATE statements
# Distill tables, columns, data types
# Output schema encoding vectors
return structural_embeddings, positional_embeddings
class TaxonomyDecoder(transformers.T5DecoderModel):
"""
Specialized decoder for taxonomic topological generation.
Relates schema entities into taxonomy components.
"""
def forward(self, encoder_hidden_states):
# Apply TopoAttention reasoning
# Generate RDF triples
return rdf_output
class TaxonomyLLM(transformers.TFT5ForConditionalGeneration):
def __init__(self):
super().__init__()
self.encoder = SchemaEncoder()
self.decoder = TaxonomyDecoder()
End-to-End Pipeline
# Complete SQL to RDF pipeline
schema = """
CREATE TABLE Customer (
id INT PRIMARY KEY,
name TEXT,
email VARCHAR(255)
);
CREATE TABLE Order (
id INT PRIMARY KEY,
customer_id INT REFERENCES Customer(id),
total DECIMAL(10,2),
created_at TIMESTAMP
);
"""
# Parse and encode
parsed_schema = parse(schema)
input_vectors = model.encoder(parsed_schema)
# Generate RDF triples
output_triples = model.generate(input_vectors)
# Materialize and validate
graph = Graph().parse(output_triples, format="ttl")
print(graph.serialize(format="ttl"))
Output:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ex: <http://example.org/ontology#> .
ex:Customer a rdfs:Class ;
rdfs:label "Customer" ;
rdfs:comment "Entity representing customer information" .
ex:Order a rdfs:Class ;
rdfs:label "Order" ;
rdfs:comment "Entity representing purchase orders" .
ex:customerId a rdf:Property ;
rdfs:domain ex:Customer ;
rdfs:range xsd:integer .
ex:hasOrder a rdf:Property ;
rdfs:domain ex:Customer ;
rdfs:range ex:Order .
Validity Scoring and Quality Metrics
TaxonomyLLM's instruction tuning phase teaches the model to differentiate valid from invalid RDF graphs. The model learns critical ontology constraints:
RDF Constraint Categories
# Class membership axioms
rdfs:subClassOf # Connects class hierarchies
# Property scoping rules
rdfs:domain # Constrains property origins
rdfs:range # Constrains property targets
# Relationship dependencies
rdfs:subPropertyOf # Links property hierarchies
Evaluation Metrics
The model achieves strong performance across multiple quality dimensions:
| Metric | Score | Description |
|---|---|---|
| RDF Validity | 0.86 | Confirms taxonomic modeling soundness |
| Mapping Precision | 0.81 | Accuracy against gold annotations |
| Vocabulary Alignment | 0.79 | Consistency with expected terms |
| Topology Comparability | 0.74 | Structural parallels preserved |
Practical Example: Schema to Taxonomy
Let us walk through a concrete example of how TaxonomyLLM processes a database schema.
Input Schema
Member(id, name, email)
Activity(id, type, time)
Step-by-Step Processing
Step 1: Schema Encoding
Each schema token gets embedded into a semantic vector:
x_member = [0.2, 1.3, 0.8, ...] # Member table embedding
x_activity = [0.4, 0.9, 1.2, ...] # Activity table embedding
Step 2: Taxonomy Tag Encoding
Target taxonomy concepts are similarly embedded:
t_personal = [0.5, 0.1, 1.1, ...] # PersonalInformation tag
t_event = [0.3, 1.4, 0.6, ...] # ActivityEvent tag
Step 3: Topology Attention
Compute compatibility scores between schema and tag vectors:
A_personal = x_member @ t_personal.T # High score = semantic similarity
A_event = x_activity @ t_event.T
Step 4: Tag Decoding
Convert scores to probabilities via softmax:
p(t_personal | x_member) = softmax(A_personal)
# High probability indicates Member -> PersonalInformation mapping
Step 5: Optimization
Gradient descent refines the mappings during training.
Output Taxonomy
ex:Member rdfs:subClassOf ex:PersonalInformation .
ex:Activity rdfs:subClassOf ex:ActivityEvent .
ex:name rdfs:domain ex:Member ;
rdfs:range xsd:string .
ex:email rdfs:domain ex:Member ;
rdfs:range xsd:string ;
rdfs:subPropertyOf ex:ContactInformation .
The Knowledge Engineering Perspective
From a knowledge engineering standpoint, TaxonomyLLM represents a significant advancement in bridging the gap between:
- Syntactic schemas - The structural definitions of databases
- Semantic ontologies - The conceptual models of knowledge graphs
Traditional approaches require domain experts to manually:
- Analyze schema structures
- Map entities to ontology concepts
- Define property relationships
- Validate constraint satisfaction
TaxonomyLLM automates this entire pipeline while maintaining the rigor of formal RDF semantics.
Integration with Existing Knowledge Graphs
The generated taxonomies can be directly integrated with:
- Enterprise ontologies (FIBO, Schema.org, Dublin Core)
- Knowledge graph platforms (Neo4j, Amazon Neptune, Stardog)
- Semantic middleware (Apache Jena, RDF4J)
from rdflib import Graph, Namespace
# Load enterprise ontology
enterprise_onto = Graph()
enterprise_onto.parse("enterprise-ontology.ttl")
# Generate taxonomy from new schema
new_taxonomy = taxonomy_llm.generate(new_schema)
# Merge with alignment
merged = enterprise_onto + new_taxonomy
merged.serialize("unified-knowledge-graph.ttl")
Technical Requirements
To run TaxonomyLLM, you will need:
tensorflow==2.8.0
transformers==4.10.0
sqlparse==0.4.2
rdflib==6.1.1
The model can be deployed via Kubernetes and Docker for enterprise-scale processing.
Conclusion
TaxonomyLLM demonstrates that large language models, when augmented with domain-specific attention mechanisms, can effectively automate complex knowledge engineering tasks. The key innovations are:
- Disentangled TopoAttention - Separately models structural and positional alignments
- Two-phase training - Combines broad schema assimilation with precise constraint learning
- End-to-end pipeline - From SQL parsing to valid RDF graph generation
As organizations continue to struggle with data silos and semantic fragmentation, tools like TaxonomyLLM offer a path toward automated, scalable knowledge graph construction.
The project is open source and available on GitHub. We encourage contributions and welcome feedback from the knowledge engineering community.
References
-
Abdul Saeed et al., "Enterprise data taxonomy: The first step toward data management," Capturing Social and Behavioral Domains and Measures in Electronic Health Records Phase 2 (2014)
-
Ren, Xi et al., "SchemaStore: Large-Scale Structured Dataset for Modern Deep Learning," International Joint Conference on Neural Networks (2020)
-
E. Jimenez-Ruiz et al., "Introducing TaxoNote: An RDF/S format for capturing description logic axioms," International Andrei Ershov Memorial Conference on Perspectives of System Informatics (2016)