REINFORCEMENT LEARNING WITH SELF-MODIFYING POLICIES
by Juergen Schmidhuber, Jieyu Zhao, Nicol N. Schraudolph
A learner's modifiable components are called its policy. An algorithm
that modifies the policy is a learning algorithm. If the learning
algorithm has modifiable components represented as part of the policy,
then we speak of a self-modifying policy (SMP). SMPs can modify the
way they modify themselves etc. They are of interest in situations
where the initial learning algorithm itself can be improved by
experience - this is what we call "learning to learn". How can we
force some (stochastic) SMP to trigger better and better
self-modifications? The success-story algorithm (SSA) addresses this
question in a lifelong reinforcement learning context. During the
learner's life-time, SSA is occasionally called at times computed
according to SMP itself. SSA uses backtracking to undo those
SMP-generated SMP-modifications that have not been empirically
observed to trigger lifelong reward accelerations (measured up until
the current SSA call - this evaluates the long-term effects of
SMP-modifications setting the stage for later
SMP-modifications). SMP-modifications that survive SSA represent a
lifelong success history. Until the next SSA call, they build the
basis for additional SMP-modifications. Solely by self-modifications
our SMP/SSA-based learners solve a complex task in a partially
observable environment (POE) whose state space is far bigger than most
reported in the POE literature.