Abu Dhabi’s Khalifa University has announced the launch of GSMA Open-Telco LLM Benchmarks 2.0, a significant new framework for evaluating the performance of large language models (LLMs) on real-world telecommunications tasks. Developed in collaboration with the GSMA Foundry community and hosted on the Hugging Face platform, the initiative aims to create an industry-wide standard for assessing AI capabilities in mission-critical network operations.
Addressing a Critical Industry Gap
The telecom sector is investing billions in AI, yet a significant gap remains between the capabilities of general-purpose AI models and the deep domain expertise required for complex network management. The new benchmarks address this shortfall by providing a systematic, evidence-based framework for telcos to evaluate AI models. This allows companies to move beyond vendor claims and make informed decisions about deploying AI for network automation, where errors can lead to significant financial loss and service disruption.
A Comprehensive Evaluation Framework
The benchmark rigorously assesses AI models across five complementary dimensions, covering 34 use cases submitted by global telecom operators. These dimensions test for a wide range of industry-specific skills:
- TeleYAML: Evaluates the model’s ability to generate intent, translating operator goals into standards-aligned YAML configurations for 5G core functions and network slicing.
- TeleLogs: Assesses network troubleshooting skills, using synthetic data from real network traces to measure root-cause analysis capabilities.
- TeleMATH: Measures mathematical reasoning through 500 expert-curated, telecom-specific engineering problems.
- 3GPP-TSG: Tests the model’s comprehension of complex technical standards from bodies like the 3GPP.
- TeleQnA: Provides 10,000 multiple-choice questions to gauge knowledge of telecom terminology, research, and technical details.
Global Collaboration and UAE Leadership
This global initiative involves 15 leading mobile network operators, including AT&T, Deutsche Telekom, Orange, Vodafone, and the UAE’s du. Khalifa University’s 6G Research Centre plays a pivotal role, co-leading the Network Management & Configuration track alongside major technology partners. This track focuses on developing datasets like TeleYAML to automate the translation of operator intents into valid network configurations, a key challenge for the industry.
Initial Benchmark Results
Initial results show that frontier models like GPT-5, Grok-4-fast, and Claude Sonnet 4.5 achieve the highest overall performance. GPT-5 led with a score of 65.55%, excelling in network troubleshooting and domain-specific Q&A. However, domain-specific models demonstrated competitive performance on specialised tasks. AT&T’s customised Gemma model, for instance, outperformed all other systems in network troubleshooting scenarios. The results also highlighted that intent-to-configuration tasks remain a significant challenge, with even the top models scoring below 28%, underscoring the need for further innovation in network automation.
About the GSMA Open-Telco LLM Benchmarks
The GSMA Open-Telco LLM Benchmarks is an initiative by the GSMA Foundry to establish a systematic, industry-wide framework for evaluating the performance of AI models on telecommunications-specific tasks. Co-led by Khalifa University, the project brings together mobile network operators, research institutions, and technology companies to develop robust standards for AI deployment in critical network operations, including configuration, troubleshooting, and automation. The benchmarks are publicly available on the Hugging Face AI community platform to foster transparency and collaboration.
Source: Middle East AI News