Salesforce's CRM Study Reveals AI Agents Face Challenges in Real-World Business Settings

Salesforce’s CRMArena-Pro Benchmark Highlights AI Challenges in Business

Salesforce has introduced its new CRMArena-Pro benchmark, revealing significant hurdles AI agents face in business environments. Even highly advanced models like Gemini 2.5 Pro achieve only a 58% success rate in straightforward tasks. When interactions become longer, success rates drop to a mere 35%.

CRMArena-Pro aims to assess how well large language models (LLMs) can perform in actual business tasks, particularly in areas like sales, customer service, and pricing. This benchmark expands on the previous CRMArena, including more business functions, multi-turn dialogues, and data privacy testing. The Salesforce team generated 4,280 task instances across 19 business activities using synthetic data.

Challenges with Longer Conversations

The findings shed light on the limitations of current LLMs. For simple, single-turn tasks, models like Gemini 2.5 Pro reach about 58% accuracy. However, when it comes to multi-turn conversations—where follow-up questions are necessary—performance drops dramatically to 35%.

Salesforce ran thorough tests on nine LLMs and discovered that many struggle to ask appropriate follow-up questions. In a review of 20 unsuccessful multi-turn tasks involving Gemini 2.5 Pro, nearly half failed due to the model not seeking vital information. Models that are more proactive in asking questions perform better in these situations.

The best results were seen in automated workflows, like managing customer service cases, where Gemini 2.5 Pro achieved an impressive 83% success rate. However, accuracy significantly declined in tasks that required deeper understanding, such as identifying incorrect product configurations or extracting information from call logs.

Data Privacy Concerns

The benchmark also highlights shortcomings in data privacy. Generally, LLMs do not recognize or refuse requests for sensitive information, like personal details or internal company data.

Only by adjusting the system prompt to include explicit privacy guidelines did models begin to reject these sensitive requests, but this came at a cost to overall performance. For instance, GPT-4o improved its ability to detect confidential information from 0% to 34.2%, but its task completion rate fell by 2.7 points. Open-source models like LLaMA-3.1 were even less responsive to prompt changes, indicating they require better training to prioritize instructions correctly.

Kung-Hsiang Steeve Huang, one of the authors of this study, emphasizes that data protection tests have often been overlooked in benchmarks until now. CRMArena-Pro represents a pioneering effort to systematically evaluate this aspect of AI performance.

Salesforce’s CRM Study Reveals AI Agents Face Challenges in Real-World Business Settings

ETC and Chula Unisearch Kick Off Dynamic New Business Pitching Platform for Practical Learning

CATL to Revolutionize Global Energy Storage with Major Testing Hub in Xiamen

InterSystems and 59stVentures Boost AI-Driven Data Evolution in ASEAN

Indian Consulates Across the U.S. Celebrate and Fortify US-India Ties through Engaging Events

Premier League: Will Xabi Alonso Ignite a New Era for Chelsea?

ETC and Chula Unisearch Kick Off Dynamic New Business Pitching Platform for Practical Learning

Autonomous Electric Trucks Take to Ohio’s Roads This Summer 2026

Salesforce’s CRM Study Reveals AI Agents Face Challenges in Real-World Business Settings

Challenges with Longer Conversations

Data Privacy Concerns

Related Posts

ETC and Chula Unisearch Kick Off Dynamic New Business Pitching Platform for Practical Learning

CATL to Revolutionize Global Energy Storage with Major Testing Hub in Xiamen

InterSystems and 59stVentures Boost AI-Driven Data Evolution in ASEAN

Indian Consulates Across the U.S. Celebrate and Fortify US-India Ties through Engaging Events

Premier League: Will Xabi Alonso Ignite a New Era for Chelsea?

ETC and Chula Unisearch Kick Off Dynamic New Business Pitching Platform for Practical Learning

Autonomous Electric Trucks Take to Ohio’s Roads This Summer 2026