Dynamical modeling provides critical state detection capability along with much needed dimensionality reduction near the critical region. Reinforcement learning is then used in the reduced space to control the system away from the criticality.
Dangerous critical state transitions in complex systems can be detected and avoided through a combination of novel dynamical system modeling and reinforcement learning. These methods will allow the development of tools for sustainable, safe and affordable operation of future deep-space and planetary exploration vehicles, stations, and habitats. These tools combine physics-based system models with sophisticated reinforcement-learning techniques to intelligently monitor the system operation, to identify possible critical (i.e., catastrophic) states being approached and to steer the system away from such states by applying optimally timed and shaped control signals actively.
Recent advances in dynamical system modeling have provided the ability to reduce the state space near a critical region so that reinforcement learning can be effective. These models take the form of a set of multi-dimensional differential equations for system evolution and measurement equations relating sensor readings y(t) to the physical system state x(t). Such a system may exhibit deviations from nominal behavior toward critical or catastrophic states due to the random perturbations as well as undesirable changes in the system parameters. Transitions to catastrophic states occur as a result of an ‘unlikely’ sequence of faults and/or deviations in system parameters. Once such a rare event occurs, the system trajectory is confined to the narrow vicinity of a certain ‘most probable’ critical path. Since the degrees of freedom in this region are reduced the dimensionality of the state space becomes smaller.
Right: Dynamical system model and critical state diagram.
Intelligently applied reinforcement learning (RL) combined with the above algorithms for statistical inference of the nonlinear dynamical models provides increased flexibility and robustness by giving an alternative control policy for moving the system away from a critical region. This is especially important when the available data are not sufficient to identify the dynamical model with enough fidelity for deterministic control. To form the policy, temporal-difference RL tackles the temporal credit assignment problem of how a control action taken at one time-step affects the system’s criticality many time-steps into the future. This is done with temporal update rules for the function Q(xt,ct) representing the expected value of the control action ct near a critical state xt. This function is updated based on an immediate reward observed after the control action, and from estimates of the value of the next state entered, x¢, based on previous observations.
The critical element of this scheme is the ability to sample the forward dynamics at time t using the Bayesian posterior distribution over the dynamical models obtained from sensor data up to the instant t. In this case the state transition resulting from a control action can be estimated without having to actually take an action and observe the results. Unlike the traditional model-free RL our model-enhanced RL can benefit tremendously from (i) the reduction in the number of degrees of freedom (i.e., reduced model state space) as a characteristic of approaching criticality, (ii) intermediate rewards provided by the inferred dynamical model between catastrophic failure and nominal operating conditions, and (iii) optimal estimates of future states allowing the reinforcement learner to converge more quickly on a control policy.
In addition, recent results from collective reinforcement learning provides mechanisms for learning and control to be scaled up to very large systems of interacting (sub)systems, These collective reinforcement learning methods provide coordinated (sub)system activity while preventing each reinforcement learner from being overwhelmed by the dynamical complexities of the other (sub)systems.
In large systems designed to sustain long-term space missions, accurately modeling the dynamics of the system to pinpoint state transitions leading to catastrophic failures is a particularly challenging problem. Currently, safety is increased only at the expense of reduced efficiency (e.g., physical separation among many subsystems that should be logically connected, sequential execution of many commands) and increased cost (e.g., duplication of systems, added complexity, repeated testing).
Neither purely model-based systems, nor purely adaptive systems can currently detect and avoid dangerous state transitions with accuracy. Model-based methods are often inadequate because though they can predict general trends, they cannot predict small-scale variations (e.g., at a variable level). In addition model-based methods often suffer from incomplete models. Adaptive, agent-based methods, in contrast, get overwhelmed by the number of variables (e.g., the number of actions that impact system performance is too large for the system to “learn” the right moves).
However recent results suggest that while catastrophic state transitions are rare events, once they occur the system follows a relatively narrow “path” in the state space. In other words, whereas describing the normal operation of the system may require thousands of variables, once the system enters a critical state, the system’s evolution can be described by a handful of variables: a reduction in state space. This reduction