Verification of Autonomous Systems Working Group Meeting

NRL, April 16-17, 2014

The original purpose of this meeting was to determine whether or not there existed a mathematical space that could ameliorate some of our state space problems. We ended up with a list of open research challenges (the starting point for the Challenges & Tools page). The report from that meeting is included below.

Meeting Report

Summary

The first meeting of the new Verification of Autonomous Systems Working Group held its kickoff meeting at NRL on April 16 and 17, 2014. At this meeting, the group concluded that there were extensive research challenges in this area and determined that a multi-disciplinary group was required in order to effectively address them. Participants agreed on common definitions for a number of potentially contentious terms and began figuring out how they want to address their common problems.

Terminology

The terms defined below are the default definitions within the group. These are not intended as definitions to be propagated outside the group, merely as ways to simplify and streamline discussion within the group. In discussions, these will be the default definitions. Any of these terms may be defined differently for the purposes of any given discussion/contribution, as long as the person redefining it lets the rest of the group know.

The terms defined at this meeting are the default terms in the Terminology Wiki.

Additional terms that were defined as part of the discussions included autonomy (defined as the freedom to choose to do something the operator may not have done, where if the mission is completed, the system worked) and cognitive element (which in the validation world is more related to the human side of the equation, but in the robotics world refers to the various elements that make up the perception and decision-making aspects of the robot). Users and operators are not the same: a user controls the system without understanding the underlying technology or decision-making capability, while an operator understands the underlying complexity and utilizes it. Users may become operators as they define new tasks for the system to perform that the original designers did not envision – in that case, the users are operators for the purposes of that task.

Key words identified as critical to the ways in which current verification tools do not adequately address autonomous systems include uncertainty, complexity, and unstructured. The robots operate in unstructured, complex environments, the correct action for the system to take at any given time is uncertain, and the interaction between the environment and the system is intractably complex.

Interdisciplinary Awareness

Several scenarios were identified and used as explanatory examples to highlight the complexity of the problem and help define the terminology.

Scenario 1: A robot that will go to the store and bring back a gallon of milk. This enabled the group to begin exploring different aspects of the problem by highlighting the nature of implicit and explicit tasks (“get a gallon of milk” is an explicit task, “cross the street” is an implicit task that may or may not be part of the design specification depending on the path to and from the store, “don’t run over people” is an implicit task that should always be active, and should be caught as part of the initial validation process). There are several aspects of the problem that need to be addressed, including whether we can verify that a robot has enough information to make a decision and how well characterized the sensor is.

Scenario 2: A team of robots (or single robot) that will survey an area, where the survey consists of a low resolution long range survey to identify interesting objects and a secondary step producing high resolution imagery of just those objects.

Goals

Potential problem descriptions:

to be able to mathematically (or otherwise) express the boundary conditions on the predictability of the robot’s behavior – even when the robot’s specific actions cannot be predicted in advance
to be able to verify that a system has implicit understanding (possibly even eventually the ability to generate and understand the implicit goals that must be established and accomplished in order to achieve the larger task)
to be able to verify, at run time, that the system will not perform actions outside some safe cage.

Trust

Anecdotal evidence indicates that the tipping point from distrust to trust is disturbingly narrow and steep. When presented with a functioning system operating safely in a constrained environment, individuals who distrust that system based on hearsay will exhibit a binary shift into a level of trust that is completely undeserved. They will go from wary to wandering around in front of it in a matter of minutes.

This problem is complicated by the need to disambiguate between systems that are untrusted because of some element of operator interaction or existing bias and systems that are untrusted because they are simply not good enough at the task (they are not trustworthy).

Consensus

The group agreed that:

not enough time is spent specifying behaviors
it is possible to inspire user/operator confidence without transparency, and that there are cases in which less transparency is better than providing full and complete access to the complex elements in the system (although how much transparency is too much is an unanswered question)
autonomous systems don't really have "corner cases" in the usual sense, since the entire system is effectively one massive corner case
tolerance for risk is a function of the likelihood of a given outcome and of the severity of the consequences of that outcome
all these systems will be operating with a human user or operator, even if that human’s only job is to define the parameters of the task, tell the system when to start, and accept the results. Introducing the human element into these systems magnifies the problem
the primary focus of the group is scenarios in which it is not practical to constrain the environment to create a tractable problem (constraining the environment to reduce complexity while retaining intractable complex environmental interactions is fine)
specifications need to be defined temporally – some specifications will be part of the task definition (e.g. get me ice cream instead of milk) and some will be part of the system’s typical operation (e.g. don’t run over people)
the processes used to verify and validate autonomous systems cannot take years to perform – the technology is both flexible and changing extremely fast, and users are likely to repurpose systems in dangerous ways – the ability to bound a system for liability purposes is important
the problem is inherently multi-disciplinary, and requires the kind of group assembled at this meeting

The group also concluded that while it may be appropriate to require system developers to adhere to design practices that make verification easier (or even possible), that in practice it is impractical – developers will focus on effectiveness and performance first, and verification to the extent necessary but no further.

Research Challenges Identified

Several major research challenges within this area were defined, covering the development and debugging of models, how to determine an appropriate level of abstraction, new tools and techniques that must be developed, and how to test these new tools and techniques.

(abstraction, model) Disambiguating between cases where the specification was wrong (the abstraction of the task description didn’t capture some required element of the actions the system should take) and cases where the environmental model was wrong (the abstraction of the interaction between the system and its environment didn’t capture something that affected the system’s performance against its goals)
(task model) Creating a model from a human understanding of the task goals, the system, and the environment (note that humans can explain explicit goals, like “this is the important data” but often assume understanding of implicit goals, like “it doesn’t count if the data doesn’t make it back”)
(abstraction, system, environment model) Determining the correct level of abstraction when trying to model a system/environment
(model) Identifying and modeling things we don’t know how to model yet (what are the properties of the things we don’t know how to model? Are there specific classes, or is it just a big blob of “stuff we don’t know how to model”) (how do we model them?)
(model) How do we, from empirical observations, produce a mathematical and/or logical model suitable for formal analysis?
(tools) Figuring out which available tools are appropriate for which aspects of the problem (includes enumerating tools and their appropriate applications)
(tools) Approaches to enable certification of autonomous systems in the same way we certify people or animals, rather than the way we certify machines
(testing) How can we produce test sets that produce the data required to generate mathematical/logical models
(tools) How do we create a rigorous process or methodology to allow the production of techniques that will combine the various verification tools available to verify a larger, complex system?
(testing) How can we produce test sets that find the boundary cases or fault locations in the larger state space?
(testing) How can we ensure that we can, in the physical world, test the simulated situations that result in boundary cases?
(abstraction, model, testing, tools) Developing a structured process for feedback between the physical/ground truth layer, the simulation layer, and the formal methods/verification tools – what are the relevant aspects of the simulation and model that can and cannot be abstracted away? Currently, when problems are found in the third stage validation, those problems are typically fixed rather than folded back into the model/design. With an autonomous system, there are many complex interactions within the system and fixing the problem is like to require another verification stage to verify the fixed system.

(Core problem: What is ENOUGH evidence to say that something is verified?) Running a few cases is not sufficient to verify something (unless those cases can be proven to encompass the range of situations and behaviors that need to be tested to characterize the system or its properties), but when a statistical or sampling approach must be taken, how many samples is enough? In cases where evidence is not quantitative, what is the basis for setting the threshold for enough evidence? (Existing regulatory authorities have thresholds for determining when a system has passed a certification process, but the foundations for those thresholds are murky at best. Fundamentally, we don’t know why these existing processes work, so we can’t develop new ones that include the currently unworkable cases.)

(Design) Does designing a system to be modular actually help long term supportability, in the sense that verifying the new module and its interactions is a less intensive and complex process than having to verify the entire system? This seems like it ought to be true for less complex automated systems, but is it still true for systems that are already intractably complex? In practice, research into this area has found that while theoretical approaches that satisfy certain assumptions do find a benefit, there are significant impediments to implementing it in practice. The biggest software problems are at the module interfaces, where assumptions about the system change as different module creators submit their creations. Defining the underlying assumptions within the interactions between the modules that cause the problem is unsolved.

Current Approaches

The group noted several potential approaches to some of these problems:

using crowdsourcing to identify problems and improve the quality of the final system,
development of a large simulation tool to provide a bridge between the model/design and the physical system, enabling testing of a larger portion of the intractable state space. Ideally, this approach would identify the fault conditions and enable the verification process to ground truth the accuracy of the simulation with respect to the physical system and its environment, by testing the fault conditions in hardware and verifying that they occur, and would support developing techniques that mitigate the existing faults without introducing new ones. It would also provide a platform against which new, verification-friendly design methodologies could be tested. Various testbeds were brought up, including the NUSSRC facility at Panama City, Missy Cummings’ simulator, WeBots, Gazebo and ANVEL.
structured experiments to develop a more principled way to define task requirements and goals
statistical approaches (Bayesian belief networks)
state machine-based tools for proving safety and security properties
observation-based model generation tools that may enable the verification of complex systems by taking their internal complexity and recasting it as a simpler system based on its actual actions within its environment. Instead of attempting to verify something whose programming language is inherently unverifiable and having to redo the verification when the processor, hardware or compiler changes, the verification is done on the observed behavior (note that connecting the model generated from the observed behavior to the actual system relies heavily on testing)
layered specifications, where a system is certified or verified at certain levels, much like a pilot’s training certifications. Instead of going from uncertified to certified at a single step, take gradual steps from general knowledge to specific expertise – break up the problem along the capability axis rather than along system/task/environment lines or verification tool type lines

In all of these cases, we need a way to define relationships between what we want it to do and what it actually does. Equally, we need a way to ensure that the developers are not writing the tests their system will be subject to.

Including verification at design time can improve the developers’ understanding of the system and help create more efficient designs.

Way Forward

Several next steps were identified:

A sharepoint site with a wiki/blog functionality will be set up (either through NRL or through ARL/Penn State) to host the group’s shared documents and manage terminology discussions.
Individual members of the group will work on assembling tutorials and lists of key papers to enable the members to get up to speed on the relevant aspects of the disciplines with which they are unfamiliar
- NSWC PCD will provide NRL with one or two specific publicly available examples of scenarios in which autonomous systems are asked to operate.
- NRL will take those scenarios and develop models, in order to demonstrate what a model is and how it can be created.
- Everyone will contribute papers and anything else they see fit to the sharepoint site.
Group members will pursue funding, publication, and research opportunities related to this topic as they become available. Specific research challenges will be identified that align more closely with 6.1/basic and 6.2/applied research.
The group will stay in the loop with respect to special sessions and workshops – Marc invited group members to attend two technical interchange meetings at Imperial College in London the 3rd and 4th weeks in June, one focused on academics and one focused on NATO, both addressing problems associated with verification of autonomous systems.
The group will meet again in August, most likely in the DC area.