This site is part of the Siconnects Division of Sciinov Group
This site is operated by a business or businesses owned by Sciinov Group and all copyright resides with them.
ADD THESE DATES TO YOUR E-DIARY OR GOOGLE CALENDAR
July 16, 2025
The BenchmarkCards framework, a collection of datasets, benchmarks, and mitigations that serves as a guide for developers to build safe and transparent AI systems, was recently incorporated into IBM’s Risk Atlas Nexus, the company’s open-source AI toolkit for governance of foundation models. Through support from the Notre Dame-IBM Technology Ethics Lab, researchers at the University of Notre Dame’s Lucy Family Institute for Data & Society and IBM Research jointly developed the framework, targeting the entire community of researchers and developers, and providing a practical guideline for improved evaluation and mitigation of potential risks when developing AI models.
The development of Large Language Models (LLMs) and assessment of their capabilities is guided by their performance in benchmarks – the combination of datasets, evaluation metrics, and associated processing steps. Although these benchmarks serve this critical role, when misused, they can provide a false sense of safety or performance, leading to serious ethical and practical implications.
In education, the heavy emphasis on popular benchmarks such as Massive Multitask Language Understanding (MMLU) and Grade School Math 8K (GSM8K) to evaluate large language models like ChatGPT has contributed to the development of AI tutors and test proctoring systems that, while innovative, may be limited in promoting deep conceptual understanding, sometimes yield inconsistent results, and raise important questions about the use of personal biometric data and informed consent.