Salesforce AI Research advances enterprise readiness for AI agents

Salesforce AI Research has unveiled an enterprise simulation environment to test agents’ ability to perform in realistic business scenarios.

The company notes that in every high-stakes field, skills and consistencies are honed not through live action, but through deliberate preparation in a space where failure is a learning tool, not a costly mistake. This is why pilots train in flight simulators that push them to prepare for the most extreme challenges, surgeons test their skills in high-risk procedures on synthetic models and cadavers before ever operating on a human, and athletes perfect their plays in drills and scrimmages ahead of a big game, Salesforce said.

AI agents also benefit from simulation testing and training, preparing them to handle the unpredictability of daily business scenarios before they’re deployed. Building on the original CRMArena, which focused on single-turn B2C service tasks, Salesforce AI Research has launched CRMArena-Pro, which tests agent performance in complex, multiturn, multi-agent scenarios including sales forecasting, service case triage, and complex quoting (CPQ) processes.

CRMArena-Pro creates a rigorous, context-rich simulated enterprise environment framework with synthetic data, where it can safely evaluate application programming interface (API) calls to relevant systems, as well as the ability to safeguard personally identifiable information (PII). This allows businesses to test an agent’s accuracy, efficiency, and consistency at scale across enterprise-specific use cases.

Acting much like a digital twin of a business, these environments go beyond simple test beds, capturing the full complexity of enterprise operations. Salesforce AI Research is advancing AI agent training with these simulations, enabling businesses to test agents in scenarios such as customer service escalations or supply chain disruptions before the agents go live.

By incorporating real-world “noise” into the test environment, enterprises can better evaluate performance, strengthen resilience against edge cases, and bridge the gap between training and live operations. The result is AI agents that are capable, consistent, trustworthy, and agentic enterprise-ready.

Salesforce has also introduced the Agentic Benchmark for CRM, the first benchmarking tool designed to evaluate AI agents not on generic capabilities, but in the contexts that matter most to businesses, including customer service, field service, marketing, and sales. The benchmark measures agents across five essential enterprise metrics — accuracy, cost, speed, trust and safety, and sustainability — which together build a comprehensive, data-driven assessment of their readiness for real-world deployment.

With new AI models and updates emerging daily, Salesforce explained that enterprises find it difficult to decide which model — or combination of models — is best suited to help power agents in real-world business settings. This benchmark measures how agents perform within specific business workflows, helping enterprises match the right models to the right agents for reliable, enterprise-grade performance.

Sustainability, the newest metric in the agentic measurement tool, is especially important to track. This measure highlights the relative environmental impact of AI systems, which can demand significant computational resources. Businesses can minimise their environmental footprint and determine their AI sustainability — while achieving the performance they need — by aligning model size with the specific level of intelligence required to complete an enterprise-specific task.

MCP-Eval and MCP-Universe are complementary benchmarks published by Salesforce AI Research this quarter. They are designed to measure agents at different levels of rigour and track large language models (LLMs) as they interact with Model Context Protocol (MCP) servers in the real-world use case environments.

MCP-Eval provides scalable, automatic evaluation through synthetic tasks, making it well-suited for testing across a wide range of MCP servers. MCP-Universe introduces challenging real-world tasks with execution-based evaluators that stress-test agents in complex scenarios, and offers an extendable framework for building and evaluating agents. Together, they form a powerful toolkit: MCP-Eval for broad, initial assessments, and MCP-Universe for deeper diagnosis and debugging.

This dual approach is critical for enterprises, Salesforce said. The company’s research found most state-of-the-art LLMs on the market today face limitations to enterprise-grade performance — from long-context challenges (where models lose track of information in complex inputs) to unknown-tool challenges (where they fail to adapt seamlessly to unfamiliar systems).

By leveraging MCP-Universe and MCP-Eval, enterprises can understand where agents break down and refine their frameworks or tool integrations accordingly. And with a platform that layers in context, enhanced reasoning, and trust guardrails, organisations can move beyond do-it-yourself (DIY) experimentation to deliver agents ready for real-world business impact.

High-quality, unified data is at the heart of reliable, scalable AI agent performance. It enables context-aware, accurate, and compliant decision-making. Unified data allows agents to understand context, follow business rules, and make decisions that align with organisational goals.

But enterprise data is rarely clean or well-organised — a perennial challenge for businesses. Customer records are often duplicated across departments, fields are incomplete, and inconsistent formatting and naming conventions make it difficult to reconcile data across systems.

To tackle this, the Salesforce AI Research and product teams partnered to finetune large and small language models and power Account Matching, a capability that autonomously identifies and unifies accounts across scattered, inconsistent datasets. Instead of treating “The Example Company, Inc.” and “Example Co.” as separate entities, the system can now use AI to consolidate them into a single, authoritative record. Unlike static, rule-based systems that require heavy manual setup, Account Matching accurately reconciles millions of records.

In the first month, one customer’s proprietary tool that utilises Account Matching unified more than a million accounts with a 95% match success rate, reducing average handling time by 30 minutes. The tool automatically matched details like account names, websites, addresses, or phone numbers across business units, surfaced a workflow in each organisation for sellers to connect, and routed only the top 5% of complex cases to humans.

By helping sellers quickly find counterparts covering the same or similar accounts, the solution helped eliminate duplicative work, accelerate sales cycles, and prevent missed opportunities. Best of all, the entire solution was implemented without the need for hard-coding, lowering costs and dramatically improving efficiency.

Comments

Popular posts from this blog

Fortinet enhances FortiRecon to align with CTEM framework

SentinelOne recognised as a 2025 Gartner Peer Insights Customers’ Choice for XDR

AWS: AI adoption grows 20% in Singapore