Abu Dhabi’s Khalifa University, in partnership with global telecom association GSMA and US operator AT&T, has released TelcoAgent-Bench, a specialized benchmark designed to test if AI agents can reliably handle real-world telecom network troubleshooting. The framework reveals a significant gap between an AI’s ability to understand a problem and its capacity to execute the correct sequence of actions to solve it, raising important questions about deploying current models in live network environments.
Quick Facts
- New benchmark tests AI on 15 troubleshooting intents.
- Current AI models struggle with correct diagnostic sequences.
- Framework evaluates performance in both English and Arabic.
From Sounding Smart to Acting Smart
As telecom operators push towards autonomous network management, the reliability of AI agents is becoming a safety-critical issue. The new TelcoAgent-Bench framework was built to address a key distinction: the difference between an AI that sounds like a telecom engineer and one that can actually perform like one.
Findings from the benchmark suggest the industry should be cautious about deploying current AI models in operational settings without significant guardrails. While existing general-purpose AI benchmarks like AgentBench and GAIA test broad task completion, they were not designed for the specific operational constraints of telecom networks, such as resolution paths and structured troubleshooting flows.
Inside TelcoAgent-Bench: A Four-Point Stress Test
TelcoAgent-Bench is one of the first domain-specific frameworks built to rigorously evaluate AI agents in telecom network operations. It assesses AI across four core capabilities under realistic constraints.
The benchmark evaluates an AI agent’s ability to:
- Correctly identify the troubleshooting intent.
- Select the right diagnostic tools for the job.
- Execute those tools in the correct sequence.
- Generate an accurate final resolution summary.
The framework covers 15 telecom troubleshooting intents and 49 scenario blueprints, generating approximately 1,470 dialogues to test AI consistency when the same problem is presented in different ways.
The Capability Gap: Where Current AI Models Fall Short
The headline finding from the research is a clear capability gap. Today’s AI models are reasonably good at understanding the initial problem and writing a plausible summary of the resolution. However, they consistently struggle with the most critical operational step: following the correct troubleshooting sequence.
This weakness was particularly evident in bilingual settings. The benchmark runs tests in both English and Arabic to address the practical needs of regional telecom networks, and noted performance gaps between the two languages, with bilingual scenarios proving especially challenging for current models.
A Broader Push for Open Telco AI
TelcoAgent-Bench is the latest collaboration between GSMA and Khalifa University’s Digital Future Institute. In March, both organizations were central to the launch of the Open Telco AI initiative at MWC Barcelona, a global program involving AT&T, AMD, and others to build open AI foundations for the telecom industry.
As part of that initiative, Khalifa University leads the Network Management and Configuration Group, cementing its role in shaping the future of AI within global telecommunications.
About TelcoAgent-Bench
TelcoAgent-Bench is a specialised benchmark developed by the GSMA, AT&T, and the Digital Future Institute of Khalifa University. It is designed to evaluate how reliably AI agents can perform complex telecom network troubleshooting tasks by testing their ability to identify problems, select tools, follow correct sequences, and provide accurate summaries in both English and Arabic.
Source: Middle East AI News


