Microsoft’s new tool lets developers run AI behavior tests using textual descriptions


AI researchers and laboratories have progressed very quickly in evaluating AI models of everything safety And comply with flatter and coordination. But companies and developers appear to be facing a new, specific need: ensuring that their AI system behaves as intended for their specific product or service.

In an attempt to make this testing process simpler, Microsoft revealed the matter on Tuesday to be surewhich is an abbreviation for Adaptive Specification-Based Scoring for Evaluation and Regression Testing.

Microsoft says the open source framework makes evaluating an application’s AI behavior easy by using AI to transform high-level natural language descriptions of goals, policies, or intended behaviors into comprehensive, scored tests that can be investigated.

ASSERT takes plain language descriptions of the expected behavior and policies of an AI model, converts them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them on the target system, and records the results. It can also record the paths taken by the AI ​​system, including intermediate actions and tool calls, so developers can examine where failures occurred.

Developers can provide system context, tools, and limitations as well, if they want to further customize what the evaluations cover.

For example, a developer could specify that a document research AI agent should not send emails to people outside the company, should limit confidential information to C-level executives and provide brief summaries with prior context in mind. ASSERT will use these rules to create test cases that check whether the system follows these rules consistently.

Image credits:Microsoft

According to Microsoft, the framework fills a gap that broader, more general evaluations cannot when AI models are intended to behave in a way shaped by the context, policies, and tools of an application or product.

“One thing we’ve learned is that evaluations are very important for making good decisions,” he said. Sarah Birdchief product officer of Microsoft’s Responsible AI division. “Because if you don’t understand the behavior of an AI system, it’s really hard to know if it meets your organization’s requirements… What we found is that if you really want to have a trustworthy system, you have to evaluate many other application-specific dimensions.”

ASSERT can be used to evaluate systems as they are built, after they are deployed, and even for ongoing monitoring, Baird said.

The release comes amid a gradual but broader shift in the artificial intelligence industry. As models become more powerful, researchers focus on reproducible tests and regression checks Helm Stanford, MLCommons’ AILluminateand evaluation groups such as meter Providing criteria to measure how models behave under different conditions.

When you make a purchase through the links in our articles, We may earn a small commission. This does not affect our editorial independence.

Leave a Reply

Your email address will not be published. Required fields are marked *