Overview¶
MetaboT is a modular multi-agent system for translating natural-language metabolomics questions into executable SPARQL queries over a knowledge graph. It is designed for researchers who want the power of semantic querying without having to write SPARQL manually for every task.
The current repository defaults to the ENPKG endpoint, but the design is intended to be portable to other RDF knowledge graphs with schema-aware prompt updates.
Core Idea¶
Instead of relying on a single prompt to infer everything, MetaboT decomposes the job into smaller steps:
- validate whether the question fits the graph
- resolve real-world entities to graph-compatible identifiers
- generate SPARQL with explicit schema context
- refine failed queries when needed
- summarize results for the user
This separation is especially useful in metabolomics, where taxa, chemical classes, structures, and biological targets all need precise identifiers.
Main Agents¶
Entry Agent¶
The gateway to the system. It distinguishes between new knowledge requests and follow-up interpretation questions, and it can inspect user-provided files before routing.
Validator Agent¶
Checks whether a question is in scope for the knowledge graph. It uses prompt-level schema context and plant validation logic to reject clearly invalid questions early.
Supervisor Agent¶
Coordinates the workflow. It decides whether entity resolution is needed and routes the request between the specialized agents.
KG Agent¶
The manuscript refers to a KG Agent; in the current codebase this role is implemented by ENPKG_agent. This agent resolves entities to authoritative identifiers before SPARQL generation.
Examples include:
- plant and taxon names via Wikidata
- chemical classes via NPClassifier-derived resources
- biological targets via ChEMBL
- SMILES strings via GNPS-linked resolution tools
SPARQL Query Runner Agent¶
Prepares the full context needed for query generation and hands the work to GraphSparqlQAChain. This includes the question, resolved identifiers, and schema fragments relevant to the target graph.
Interpreter Agent¶
Turns raw outputs into user-facing summaries. When a visualization is explicitly requested, generates a .json file containing the JSON code for a Plotly graph โ rendered interactively in the Streamlit app, or saved as a plain .json when running locally/backend.
Workflow Diagram¶
graph TD
A[User question] --> B[Entry Agent]
B --> C[Validator Agent]
C --> D[Supervisor Agent]
D --> E[ENPKG_agent / KG Agent]
D --> F[SPARQL Query Runner Agent]
F --> G[GraphSparqlQAChain]
G --> H[Knowledge graph endpoint]
D --> I[Interpreter Agent]
E --> D
F --> D
I --> D
D --> J[Final answer]
Query Lifecycle¶
sequenceDiagram
participant User
participant Entry
participant Validator
participant Supervisor
participant KG as ENPKG_agent
participant SPARQL as Sparql_query_runner
participant Graph
participant Interpreter
User->>Entry: Ask question
Entry->>Validator: Validate scope
Validator->>Supervisor: Approved question
Supervisor->>KG: Resolve entities if needed
KG-->>Supervisor: Return identifiers
Supervisor->>SPARQL: Build query context
SPARQL->>Graph: Execute SPARQL
Graph-->>SPARQL: Return results
SPARQL-->>Supervisor: Structured output
Supervisor->>Interpreter: Summarize or visualize if needed
Interpreter-->>Supervisor: User-facing explanation
Supervisor-->>User: Final response
Key Tools¶
MetaboT's agents are backed by specialized tools, including:
PlantDatabaseCheckerChemicalResolverSMILESResolverTargetResolverTaxonResolverGraphSparqlQAChainWikidataStructureSearchOutputMergerInterpreterSpectrumPlotter
Together, these tools help MetaboT ground identifiers before query generation and keep the system aligned with the target graph.
Query Generation Strategy¶
GraphSparqlQAChain follows a staged approach:
- Generate an initial SPARQL query using the user question, resolved entities, and schema context.
- Execute the query and inspect whether the result is useful.
- If the query fails or returns no results, try one refinement pass using schema hints and similar stored examples.
This refinement step is important because it helps distinguish between:
- a badly constructed query
- a legitimate absence of data in the knowledge graph
Supported Outputs¶
Depending on the question, MetaboT can return:
- a textual answer
- the generated SPARQL query
- a path to a CSV result file
- a visualization request handled by the interpreter
- spectrum URLs when a USI is involved
Validation Results¶
In the current manuscript, the strongest evaluated configuration reached:
- 83.67% overall accuracy
- 78.95% accuracy on high-complexity questions
This was reported on a 49-question scored subset of a 50-question benchmark released with the project in app/data/evaluation_dataset.csv.
Scope and Limitations¶
MetaboT is strongest when:
- the question maps cleanly to the schema of the target graph
- entities can be resolved to authoritative identifiers
- the endpoint exposes rich metabolomics annotations
Current limitations include:
- dependence on a capable LLM for best performance
- occasional SPARQL generation errors on difficult questions
- single-graph querying rather than full federated SPARQL across many external resources
- evaluation that focuses mainly on query generation, not every downstream interpretation behavior
Where to Go Next¶
- Use the Quick Start for first runs
- Tune providers and endpoints in the Configuration Guide
- Explore practical prompts in Examples