DTDL (Digital Twins Definition Language) provides no mechanism for logical reasoning or constraint checking over digital twin models. We integrate DTDL with ASP Chef, a web-based Answer Set Programming (ASP) platform, via a structured DTDL-to-ASP mapping and three dedicated operations: @DTDL/Parse for fact generation, @DTDL/Analysis for structural metrics, and @DTDL/Debug for symbolic validation with LLM-guided repair. The key design decision is that error detection is symbolic and deterministic within the implemented set of constraint classes; a language model is invoked only after the ASP layer has produced a concrete, grounded diagnostic, keeping the correctness boundary with the symbolic layer. Soundness and completeness guarantees are scoped to these constraint classes; a formal proof is left as future work. We illustrate the framework on two agricultural use cases and report a proof-of-concept assessment on 99 diagnostics spanning 21 error classes across four domains. Three binary metrics are used: json_valid and entity_recall are computed mechanically; fix quality (judge_correct) is assessed by an independent LLM judge (Claude Sonnet 4.6). The complete grounded workflow achieves 90% judge_correct and 86% json_valid; a fair ablation baseline—same LLM and system message, but error type and entity name in natural language without structured diagnostics—achieves 77% and 75%, respectively. The gap is consistent across three independent judges and statistically significant (McNemar (Formula presented.)), but the inter-judge reliability of judge_correct is limited ((Formula presented.) ranging from 0.00 to 0.44), so results should be read as directional evidence rather than precise effect estimates. Excluding the dominant isolated_interface class ((Formula presented.), ceiling score), the conservative estimate is 87% vs. 79% on the remaining 71 diagnostics. These results constitute a preliminary proof-of-concept limited to a small number of models, a few application domains, and a single LLM configuration; results do not generalize beyond this specific setting. The judge_correct metric is assessed by LLM-as-judge and does not carry a perfect inter-annotator agreement.
Symbolic Analysis and LLM-Guided Debugging of Digital Twin Models with ASP Chef and DTDL
Alviano, Mario;Guarasci, Paola
2026-01-01
Abstract
DTDL (Digital Twins Definition Language) provides no mechanism for logical reasoning or constraint checking over digital twin models. We integrate DTDL with ASP Chef, a web-based Answer Set Programming (ASP) platform, via a structured DTDL-to-ASP mapping and three dedicated operations: @DTDL/Parse for fact generation, @DTDL/Analysis for structural metrics, and @DTDL/Debug for symbolic validation with LLM-guided repair. The key design decision is that error detection is symbolic and deterministic within the implemented set of constraint classes; a language model is invoked only after the ASP layer has produced a concrete, grounded diagnostic, keeping the correctness boundary with the symbolic layer. Soundness and completeness guarantees are scoped to these constraint classes; a formal proof is left as future work. We illustrate the framework on two agricultural use cases and report a proof-of-concept assessment on 99 diagnostics spanning 21 error classes across four domains. Three binary metrics are used: json_valid and entity_recall are computed mechanically; fix quality (judge_correct) is assessed by an independent LLM judge (Claude Sonnet 4.6). The complete grounded workflow achieves 90% judge_correct and 86% json_valid; a fair ablation baseline—same LLM and system message, but error type and entity name in natural language without structured diagnostics—achieves 77% and 75%, respectively. The gap is consistent across three independent judges and statistically significant (McNemar (Formula presented.)), but the inter-judge reliability of judge_correct is limited ((Formula presented.) ranging from 0.00 to 0.44), so results should be read as directional evidence rather than precise effect estimates. Excluding the dominant isolated_interface class ((Formula presented.), ceiling score), the conservative estimate is 87% vs. 79% on the remaining 71 diagnostics. These results constitute a preliminary proof-of-concept limited to a small number of models, a few application domains, and a single LLM configuration; results do not generalize beyond this specific setting. The judge_correct metric is assessed by LLM-as-judge and does not carry a perfect inter-annotator agreement.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


