Saudi AI Firm Navid Launches Human-Rated Arabic Text-to-Speech Leaderboard

4 Min Read

Arabic is spoken by over 400 million people across 20 countries, yet evaluating synthetic Arabic voices has traditionally relied on rigid laboratory algorithms rather than human ears. Riyadh-based AI firm Navid, the generative AI division of Watad, is altering this metric with the launch of the Arabic TTS Arena. Hosted on Hugging Face, the open platform allows native speakers to blindly evaluate and rank artificial intelligence voice models, building a comprehensive text-to-speech leaderboard driven entirely by human preference.

Quick Facts

  • Hosts 15 commercial and open-source Arabic models.

  • Ranks models using the Bradley-Terry statistical framework.

  • Relies strictly on anonymous human A/B testing.

Shifting AI Voice Quality Metrics from Lab to Listener

Historically, text-to-speech evaluation has been driven by technical loss functions. Navid’s new leaderboard addresses the critical gap between mathematical accuracy and natural human listening preferences.

The platform presents Arabic speakers with two anonymous AI voice models reading identical text. Users vote on which output sounds more natural, effectively building a rankings table based on raw human preference.

To ensure statistical reliability, the Arabic TTS Arena utilizes the Bradley-Terry rating model—the same mathematical framework used to rank chess grandmasters and power the prominent LMArena leaderboard. Ratings are centered at 1,000, with confidence intervals computed across 200 rounds of bootstrap resampling. Model identities remain hidden until a vote is cast, ensuring brand recognition does not influence the final score.

The TTS Triangle and Arabic Dialect Diversity

Navid’s research team has introduced a framework called the “TTS Triangle,” arguing that a complete text-to-speech system must address three simultaneous dimensions: content, identity, and delivery.

According to the team, reducing Arabic’s vast dialectal diversity to broad, country-level labels like “Saudi” or “Egyptian” is fundamentally flawed, as dialects vary drastically even within single cities. Instead, Navid advocates for building systems around specific reference speaker identities.

Furthermore, the team challenges the current industry standard of embedding discrete emotion tags like “[laugh]” into text. They argue that human emotion permeates an entire utterance rather than functioning as an isolated event. Instead, they champion natural language delivery instructions—similar to how a director guides a voice actor—echoing the architecture of OpenAI’s TTS API, which they view as the correct foundation for capturing Arabic’s expressive breadth.

Scaling the Arabic AI Developer Community

The Arabic TTS Arena currently hosts 15 commercial and open-source systems, including Arabic F5-TTS, Silma TTS, SpeechT5 Arabic, and XTTS v2.

The leaderboard is designed for continuous expansion with minimal friction for developers. Adding a new model to the arena requires the implementation of a single Python class, allowing the platform to discover the model automatically.

Each model runs in an isolated, containerized environment. As human votes accumulate, an automated daily process recomputes the ratings and updates the public leaderboard.

About Navid

Navid is a Riyadh-based artificial intelligence company operating as the Generative AI arm of the Watad group. The parent company is an established player in regional AI development. In March 2024, Watad launched Mulhem, a seven-billion parameter large language model trained exclusively on Saudi datasets. The bilingual model was trained on 90 billion Arabic and 90 billion English tokens, prioritizing localized context through over 70,000 domain-specific data points and 500,000 conversational datasets.

Source: Middle East AI News

Share This Article