Testing and auditing of AI applications: exploiting potential and reducing risks

Artificial intelligence is finding its way into more and more application areas in the world of work. It assists workers with, for example, time-consuming or risky activities and relieves them by automating monotonous and tiring work processes. AI therefore has the potential to significantly improve the work of many people. However, using AI systems does not only offer opportunities to increase work quality. Its spread also entails risks such as the potential for danger and discrimination – and these risks are a major obstacle to unlocking the positive effects of AI.

Like other technical systems, AI applications have to be tested and assessed in terms of their functionality and safety, and any risks have to be addressed by means of suitable measures. This is the case in particular if they are to be used in critical contexts. But AI applications have a number of specific characteristics such as a high level of complexity or non-transparency regarding their interdependencies and decision making processes which raise the question as to what form effective testing, auditing and certification processes must take specifically for AI systems from a technical, organisational and legal perspective. This is where the interdisciplinary research project ‘ExamAI – AI Testing & Auditing’ came in.

Potential and risks of production automation and HR and talent management

To answer the question as application-specifically as possible as well as for various risk types, the project team comprising (social) computer scientists, software engineers, jurists and political scientists started by evaluating the possibilities and limitations of using AI in eleven obvious use cases in the areas of production automation and HR and talent management. In the area of production automation, AI offers potential in particular when used for driverless transport systems and for autonomous and collaborative mobile robots. In the field of HR and talent management, AI can be used to, for example, automatically generate suggestions and matches on HR platforms and employment websites, perform personality assessments and background checks or even predict the workers’ propensity to resign.

The two areas differ fundamentally in terms of the criticality related to the use of AI: key conditions for the use of AI in the area of production automation are user safety and the prevention of damage to property. When it comes to HR and talent management, however, the focus lies on the fairness of the decisions prepared or even made by AI, involving aspects such as non-discrimination, data privacy and traceability for the data subjects.

Unclear what AI needs to be tested for

In both areas, there are currently no useful control and certification options. Albeit not for a lack of tools with which these systems could be tested, because there are now plenty of effective methods of analysing black boxes. Likewise, corresponding auditing procedures have already been or are set to be regulated by law in the future with the Artificial Intelligence Regulation. Rather, the key problem identified by scientists is the fact that there is currently a lack of standards that specify which criteria have to be met or which measures are sufficient in order for AI systems to be deemed safe and fair. There are indeed already standards in the area of production automation that regulate the topic of technical system safety. However, these standards as they stand cannot be applied to AI systems. And as regards fairness, there is a lack not only of corresponding standards, but also of an adequately binding definition. It is therefore not even possible to stipulate which aspects in particular need to be addressed.

With the scenarios outlined in the project, the result is a considerable liability risk if AI systems are used, as the users run the risk of being held accountable for any damage or incidents of discrimination caused by use of the systems. This represents a major obstacle to unlocking the full potential of artificial intelligence.

How safe is safe enough? How fair is fair enough?

In view of this, the challenge lies in determining which measures need to be effected in order that AI applications can be used in safety-critical environments and in the area of HR and talent management. Or to put it another way, how safe is safe enough? And how fair is fair enough?

What are known as assurance cases are a promising solution here. These are processes which are already established in the area of conventional safety applications and which are used in relation to a target and based on facts to reasonably make the argument that specific measures are suitable for guaranteeing adequate AI system safety. In addition to the control of individual systems, assurance cases offer the advantage that over time, arguments which have proved to be useful in various cases can be generalised and translated into a standard.

In the area of HR and talent management, however, solutions need to already be introduced at the point at which the term ‘fairness’ is more closely defined and operationalised. This is because decisions can only be made about the measures and tests that are necessary and useful in terms of the criticality of concrete applications once a consensus has been reached regarding how ‘fairness’ is to be determined and measured. A promising option here is a combination of assurance cases and acceptance test-driven development (ATDD). With acceptance test-driven development, the customers’/users’ existing expectations regarding application functionality are identified as early as possible and very clearly defined on the basis of suitable communication and coordination measures. This can be used to more precisely determine the meaning of the term ‘fairness’ for specific AI applications in fairness-critical areas, taking into account the different perspectives of users, AI experts, ethicists, jurists and other stakeholders. In this way, a consensus can also be reached regarding which measures are sufficient in order for AI applications to be deemed fair enough and regarding which testing procedures are therefore relevant in each case.

Supporting activities, promoting research and boosting collaboration

Based on the findings, concrete recommendations for action were formulated in workshops with manufacturers, suppliers, union representatives and insurers in the fields of industrial production and HR management (German Social Accident Insurance; DGUV). In the opinion of the experts, assurance cases should, in view of their fitness for purpose, be established as the key element for the auditing and certification of AI until enough experience has been garnered to develop the necessary standards. For the necessary experience to be gathered in a protected space while also being gathered practically, experimental spaces (regulatory sandboxes) should be created for AI testing in which the safety and fairness aspects of applications can be examined in case studies.

Such experimental spaces should also directly bring together safety experts, the developers of safety measures and conformity assessment experts as well as promote interdisciplinary collaboration and dialogue between science and practice.

Policymakers should promote and incentivise activities which aim to introduce standards as well asthe development of methods and tools for AI quality assurance. Support in the form of scientific studies and the involvement of research institutes and AI experts in standardisation bodies are also very important in order to achieve the best possible results. And, vice versa, to strengthen and advance research, not only should projects in the area of fundamental research be promoted – companies should also be encouraged to engage in more transparent collaboration, for example by providing research data or affording others an insight into their safety-critical processes. After all, the practicability of the planned Artificial Intelligence Regulation from a technical and legal point of view must be examined in order to clarify how the regulatory requirements are to be met and addressed on the basis of technical standards, understand its interaction with the existing legislation and identify the action possibly required early on.

This all also hinges on there being a better understanding of the correlations, challenges and possible solutions within the political institutions involved. This calls for sufficient resources, extensive expertise and, not least, the appropriate technical equipment.

The ‘ExamAI – AI Testing & Auditing’ consortia project comprised the German Informatics Society (GI) as the project coordinator, the Fraunhofer IESE, the think tank Stiftung Neue Verantwortung, the Algorithm Accountability Lab of TU Kaiserslautern and the Institute of Legal Informatics at Saarland University. The project, which ran from March 2020 to November 2021, was funded by Germany’s Federal Ministry of Labour and Social Affairs (BMAS) as part of the AI Observatory project of the ministry’s Policy Lab Digital, Work & Society.

Published on 18 Mar 2022 on the topic: Knowledge