Anthropic on Tuesday announced a new initiative to develop new benchmarks to test the capabilities of advanced artificial intelligence (AI) models. The AI company will fund the project and has invited interested entities to apply. The company said existing benchmarks are insufficient to fully test the capabilities and impact of new large language models (LLMs). As a result, a new set of assessments focused on AI safety, advanced capabilities, and societal impact must be developed, Anthropic said.
Anthropic to Fund New Benchmarks for AI Models
In an editorial office sendAnthropic has highlighted the need for a comprehensive third-party assessment ecosystem to move beyond the limited scope of current benchmarks. The AI company announced that through its initiative, it will fund third-party organizations that want to develop new assessments for AI models that focus on high quality and safety standards.
For Anthropic, high-priority areas include activities and questions that can measure an LLM’s AI safety levels (ASL), advanced capabilities in idea and response generation, and the societal impact of these capabilities.
In the ASL category, the company highlighted several metrics, including the ability of AI models to assist or act autonomously in the execution of cyberattacks, the potential of models to assist in the creation or improvement of knowledge of the creation of chemical, biological, radiological and nuclear (CBRN) risks, national security risk assessment, and more.
In terms of advanced capabilities, Anthropic noted that benchmarks should be able to assess AI’s potential to transform scientific research, participation and rejection toward harmfulness and multilingual capabilities. Additionally, the AI firm said that it is necessary to understand an AI model’s potential to impact society. For this, assessments should be able to target concepts such as “harmful bias, discrimination, overdependence, addiction, attachment, psychological influence, economic impacts, homogenization and other broad societal impacts.”
In addition to this, the AI company also listed some principles for good assessments. It said that assessments should not be available in the training data used by AI as they often turn into a memory test for the models. It also encouraged keeping between 1,000 and 10,000 tasks or questions to test the AI. It also asked organizations to use subject matter experts to create tasks that test performance in a specific domain.