Chinese tech giant Alibaba have just released Qwen 2.5-Max, an AI model they claim outperforms DeepSeek on several vital benchmarks.
The Arena-Hard benchmark focuses on how closely a language model's responses align with human preferences. Qwen 2.5-Max achieved a score of 89.4, surpassing DeepSeek-V3's score of 85.5. This suggests that Qwen 2.5-Max is better at generating responses that are judged to be more helpful, informative, and relevant by human evaluators. It indicates a stronger ability to understand and respond to user needs and preferences.
The LiveBench benchmark evaluates model's capabilities across tasks, including math, coding, reasoning, and language comprehension. Qwen 2.5-Max outperformed DeepSeek-V3 on LiveBench with a score of 62.2 compared to 60.5. This suggests that Qwen 2.5-Max has a more comprehensive understanding of language and a greater ability to apply that understanding.
The LiveCodeBench benchmark is similar but specifically assesses coding. Qwen 2.5-Max achieved a score of 38.7, slightly higher than DeepSeek-V3's 37.6.This suggests Qwen 2.5-Max has a marginal advantage in in code generation and comprehension.
GPQA-Diamond benchmark focuses on general knowledge question-answering. It uses challenging questions that require deeper reasoning and knowledge retrieval. Qwen 2.5-Max scored 60.1, edging out DeepSeek-V3's score of 59.1. This slight advantage for Qwen 2.5-Max means it is slightly better at accessing and utilizing its knowledge base to answer complex questions.
What is Qwen 2.5-Max?
Qwen 2.5-Max is a large language model from Alibaba. It is an upgraded version of its massively popular Qwen 2.5 model.
Mixture of experts architecture
Qwen 2.5-Max works on mixture-of-experts MoE architecture. This means that the model is made of smaller, specialized "expert" networks, each focusing on a particular aspect of language or knowledge.
A "gating network" then acts as a central router, analyzing incoming requests and activating only the relevant experts for the task. This "sparse activation" ensures efficiency and allows the model to scale to larger sizes and handle more complex tasks.
20 trillion token dataset
Qwen 2.5-Max has been trained on a massive dataset of over 20 trillion tokens. This means it has processed massive amounts of text and code, including books, websites, articles, transcripts and more
Being trained on such an extensive datasets allows Qwen 2.5-Max to have a broad and comprehensive understanding.
This not only helps the model to answer a broader range of informative questions but also helps it to generate more creative content as well as translate more languages.
The massive data set also helps Qwen 2.5-Max to exhibit strong reasoning and problem-solving skills. This means it can analyze information, identify patterns, and draw logical conclusions, making it capable of tackling complex tasks that require multi-step reasoning and decision-making.
Supervised Fine-Tuning and Reinforcement Learning from Human Feedback
Qwen 2.5-Max was fine-tuned on a dataset of human-written text, improving its accuracy and ability to follow instructions.
By training on this refined data, the model learns to generate responses that are more accurate, coherent, and stylistically appropriate.
Reinforcement Learning from Human Feedback (RLHF) was also employed. Here humans evaluators review the model's responses based on criteria like accuracy, helpfulness, and safety. This feedback is then used to create a "reward model" that guides the AI's learning process.
Read: Meta's Self-Taught Evaluator Takes Humans Out Of AI Evaluation
What is Alibaba?
Alibaba is a massive Chinese tech company best known for e-commerce.
Alibaba is heavily invested in AI research and development, creating advanced language models like Qwen 2.5-Max.
Their latest model showcases their ability to compete with global leaders in AI innovation, like OpenAI and DeepSeek.