TaxonomyLLM: Automating Enterprise Knowledge Graph Generation with Topological Attention

The Challenge of Enterprise Data Semantics

Organizations today face a critical challenge: disparate data classifications across systems that require reconciliation. As machine learning adoption accelerates, consistent data understanding through unified semantics becomes essential for extracting meaningful value from information. However, manually crafting taxonomy tags is non-trivial, expensive, and fundamentally unscalable.

Consider a typical enterprise scenario: you have dozens of database schemas across different departments, each using its own terminology. The Customer table in Sales might correspond to Client in Support and Account in Finance. Without a unified semantic layer, building cross-functional analytics or training enterprise-wide ML models becomes a Sisyphean task.

TaxonomyLLM addresses this challenge head-on by providing an LLM extension tailored for systematic translation of logical schemas into standardized RDF taxonomy structures.

The Innovation: Disentangled Topological Attention

What makes TaxonomyLLM unique is its TopoAttention mechanism - a specialized self-attention approach designed to explicitly model alignments between input schema structure and output taxonomy topology.

Why Standard Attention Falls Short

In a vanilla transformer, self-attention computes alignment scores between all token pairs:

Attention(Qi, Kj) = Qi * Kj^T

This captures general relationships but lacks the specificity needed for schema-to-taxonomy translation. When mapping a database schema to an RDF graph, we need to explicitly correlate:

Schema tables with taxonomy classes
Schema columns with taxonomy properties
Column relationships with property constraints

The TopoAttention Solution

TopoAttention introduces disentangled attention matrices that focus exclusively on structural and positional reasoning:

# Schema Structure Attention
HsAttention = Qs @ Ws.T  # Correlates schema structure patterns

# Taxonomy Position Attention
HpAttention = Qp @ Wp.T  # Correlates taxonomy positional elements

The mathematical formulation:

TopoAttention(E_s) = E_s @ W_q^s @ K_k^s.T  # Structure reasoning
TopoAttention(E_p) = E_p @ W_q^p @ K_k^p.T  # Position reasoning

Where:

E_s captures schema structural embeddings
E_p captures taxonomy positional embeddings
Separate learned projections (W_q, W_k) specialize in structure vs. position

This disentanglement enables the model to independently reason about:

What schema elements exist (structure)
Where they should map in the taxonomy (position)

Architecture Deep Dive

The TaxonomyLLM architecture builds on T5's encoder-decoder foundation with custom modules optimized for schema ingestion and taxonomy generation.

Knowledge Graph Architecture

The Two-Phase Training Methodology

TaxonomyLLM employs a sophisticated two-phase training approach:

Knowledge Graph Architecture

Why T5? A Rigorous Model Selection

The TaxonomyLLM team evaluated multiple foundation models before selecting T5:

Model	Schema Assimilation	Relational Reasoning	RDF Constraints
GPT-3	Medium	Low	Minimal
PaLM	High	Medium	Partial
BLOOM	High	Medium	Partial
T5	Excellent	High	Significant

T5's encoder-decoder architecture provides several advantages:

Bidirectional encoding - Full context awareness for schema understanding
Span corruption pre-training - Natural fit for structured input/output
Text-to-text framing - Schema-to-RDF maps cleanly to this paradigm

Implementation Architecture

The core implementation leverages TensorFlow and the Transformers library:

import tensorflow as tf
import transformers
from sqlparse import parse
from rdflib import Graph

class SchemaEncoder(transformers.T5EncoderModel):
    """
    Custom encoder that vectorizes SQL schema syntax patterns
    into structural hidden representations.
    """
    def forward(self, schema_tokens):
        # Parse SQL CREATE statements
        # Distill tables, columns, data types
        # Output schema encoding vectors
        return structural_embeddings, positional_embeddings

class TaxonomyDecoder(transformers.T5DecoderModel):
    """
    Specialized decoder for taxonomic topological generation.
    Relates schema entities into taxonomy components.
    """
    def forward(self, encoder_hidden_states):
        # Apply TopoAttention reasoning
        # Generate RDF triples
        return rdf_output

class TaxonomyLLM(transformers.TFT5ForConditionalGeneration):
    def __init__(self):
        super().__init__()
        self.encoder = SchemaEncoder()
        self.decoder = TaxonomyDecoder()

End-to-End Pipeline

# Complete SQL to RDF pipeline
schema = """
    CREATE TABLE Customer (
        id INT PRIMARY KEY,
        name TEXT,
        email VARCHAR(255)
    );

    CREATE TABLE Order (
        id INT PRIMARY KEY,
        customer_id INT REFERENCES Customer(id),
        total DECIMAL(10,2),
        created_at TIMESTAMP
    );
"""

# Parse and encode
parsed_schema = parse(schema)
input_vectors = model.encoder(parsed_schema)

# Generate RDF triples
output_triples = model.generate(input_vectors)

# Materialize and validate
graph = Graph().parse(output_triples, format="ttl")
print(graph.serialize(format="ttl"))

Output:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ex: <http://example.org/ontology#> .

ex:Customer a rdfs:Class ;
    rdfs:label "Customer" ;
    rdfs:comment "Entity representing customer information" .

ex:Order a rdfs:Class ;
    rdfs:label "Order" ;
    rdfs:comment "Entity representing purchase orders" .

ex:customerId a rdf:Property ;
    rdfs:domain ex:Customer ;
    rdfs:range xsd:integer .

ex:hasOrder a rdf:Property ;
    rdfs:domain ex:Customer ;
    rdfs:range ex:Order .

Validity Scoring and Quality Metrics

TaxonomyLLM's instruction tuning phase teaches the model to differentiate valid from invalid RDF graphs. The model learns critical ontology constraints:

RDF Constraint Categories

# Class membership axioms
rdfs:subClassOf  # Connects class hierarchies

# Property scoping rules
rdfs:domain      # Constrains property origins
rdfs:range       # Constrains property targets

# Relationship dependencies
rdfs:subPropertyOf  # Links property hierarchies

Evaluation Metrics

The model achieves strong performance across multiple quality dimensions:

Metric	Score	Description
RDF Validity	0.86	Confirms taxonomic modeling soundness
Mapping Precision	0.81	Accuracy against gold annotations
Vocabulary Alignment	0.79	Consistency with expected terms
Topology Comparability	0.74	Structural parallels preserved

Practical Example: Schema to Taxonomy

Let us walk through a concrete example of how TaxonomyLLM processes a database schema.

Input Schema

Member(id, name, email)
Activity(id, type, time)

Step-by-Step Processing

Step 1: Schema Encoding

Each schema token gets embedded into a semantic vector:

x_member = [0.2, 1.3, 0.8, ...]  # Member table embedding
x_activity = [0.4, 0.9, 1.2, ...] # Activity table embedding

Step 2: Taxonomy Tag Encoding

Target taxonomy concepts are similarly embedded:

t_personal = [0.5, 0.1, 1.1, ...]  # PersonalInformation tag
t_event = [0.3, 1.4, 0.6, ...]     # ActivityEvent tag

Step 3: Topology Attention

Compute compatibility scores between schema and tag vectors:

A_personal = x_member @ t_personal.T  # High score = semantic similarity
A_event = x_activity @ t_event.T

Step 4: Tag Decoding

Convert scores to probabilities via softmax:

p(t_personal | x_member) = softmax(A_personal)
# High probability indicates Member -> PersonalInformation mapping

Step 5: Optimization

Gradient descent refines the mappings during training.

Output Taxonomy

ex:Member rdfs:subClassOf ex:PersonalInformation .
ex:Activity rdfs:subClassOf ex:ActivityEvent .

ex:name rdfs:domain ex:Member ;
        rdfs:range xsd:string .

ex:email rdfs:domain ex:Member ;
         rdfs:range xsd:string ;
         rdfs:subPropertyOf ex:ContactInformation .

The Knowledge Engineering Perspective

From a knowledge engineering standpoint, TaxonomyLLM represents a significant advancement in bridging the gap between:

Syntactic schemas - The structural definitions of databases
Semantic ontologies - The conceptual models of knowledge graphs

Traditional approaches require domain experts to manually:

Analyze schema structures
Map entities to ontology concepts
Define property relationships
Validate constraint satisfaction

TaxonomyLLM automates this entire pipeline while maintaining the rigor of formal RDF semantics.

Integration with Existing Knowledge Graphs

The generated taxonomies can be directly integrated with:

Enterprise ontologies (FIBO, Schema.org, Dublin Core)
Knowledge graph platforms (Neo4j, Amazon Neptune, Stardog)
Semantic middleware (Apache Jena, RDF4J)

from rdflib import Graph, Namespace

# Load enterprise ontology
enterprise_onto = Graph()
enterprise_onto.parse("enterprise-ontology.ttl")

# Generate taxonomy from new schema
new_taxonomy = taxonomy_llm.generate(new_schema)

# Merge with alignment
merged = enterprise_onto + new_taxonomy
merged.serialize("unified-knowledge-graph.ttl")

Technical Requirements

To run TaxonomyLLM, you will need:

tensorflow==2.8.0
transformers==4.10.0
sqlparse==0.4.2
rdflib==6.1.1

The model can be deployed via Kubernetes and Docker for enterprise-scale processing.

Conclusion

TaxonomyLLM demonstrates that large language models, when augmented with domain-specific attention mechanisms, can effectively automate complex knowledge engineering tasks. The key innovations are:

Disentangled TopoAttention - Separately models structural and positional alignments
Two-phase training - Combines broad schema assimilation with precise constraint learning
End-to-end pipeline - From SQL parsing to valid RDF graph generation

As organizations continue to struggle with data silos and semantic fragmentation, tools like TaxonomyLLM offer a path toward automated, scalable knowledge graph construction.

The project is open source and available on GitHub. We encourage contributions and welcome feedback from the knowledge engineering community.

References

Abdul Saeed et al., "Enterprise data taxonomy: The first step toward data management," Capturing Social and Behavioral Domains and Measures in Electronic Health Records Phase 2 (2014)
Ren, Xi et al., "SchemaStore: Large-Scale Structured Dataset for Modern Deep Learning," International Joint Conference on Neural Networks (2020)
E. Jimenez-Ruiz et al., "Introducing TaxoNote: An RDF/S format for capturing description logic axioms," International Andrei Ershov Memorial Conference on Perspectives of System Informatics (2016)