problem solving methods for reinforcement learning

Learning to Optimize with Reinforcement Learning

Since we posted our paper on “ Learning to Optimize ” last year, the area of optimizer learning has received growing attention. In this article, we provide an introduction to this line of work and share our perspective on the opportunities and challenges in this area.

Machine learning has enjoyed tremendous success and is being applied to a wide variety of areas, both in AI and beyond. This success can be attributed to the data-driven philosophy that underpins machine learning, which favours automatic discovery of patterns from data over manual design of systems using expert knowledge.

Yet, there is a paradox in the current paradigm: the algorithms that power machine learning are still designed manually. This raises a natural question: can we learn these algorithms instead? This could open up exciting possibilities: we could find new algorithms that perform better than manually designed algorithms, which could in turn improve learning capability.

Doing so, however, requires overcoming a fundamental obstacle: how do we parameterize the space of algorithms so that it is both (1) expressive, and (2) efficiently searchable? Various ways of representing algorithms trade off these two goals. For example, if the space of algorithms is represented by a small set of known algorithms, it most likely does not contain the best possible algorithm, but does allow for efficient searching via simple enumeration of algorithms in the set. On the other hand, if the space of algorithms is represented by the set of all possible programs, it contains the best possible algorithm, but does not allow for efficient searching, as enumeration would take exponential time.

One of the most common types of algorithms used in machine learning is continuous optimization algorithms. Several popular algorithms exist, including gradient descent, momentum, AdaGrad and ADAM. We consider the problem of automatically designing such algorithms. Why do we want to do this? There are two reasons: first, many optimization algorithms are devised under the assumption of convexity and applied to non-convex objective functions; by learning the optimization algorithm under the same setting as it will actually be used in practice, the learned optimization algorithm could hopefully achieve better performance. Second, devising new optimization algorithms manually is usually laborious and can take months or years; learning the optimization algorithm could reduce the amount of manual labour.

Learning to Optimize

In our paper last year ( Li & Malik, 2016 ), we introduced a framework for learning optimization algorithms, known as “Learning to Optimize”. We note that soon after our paper appeared, ( Andrychowicz et al., 2016 ) also independently proposed a similar idea.

Consider how existing continuous optimization algorithms generally work. They operate in an iterative fashion and maintain some iterate, which is a point in the domain of the objective function. Initially, the iterate is some random point in the domain; in each iteration, a step vector is computed using some fixed update formula, which is then used to modify the iterate. The update formula is typically some function of the history of gradients of the objective function evaluated at the current and past iterates. For example, in gradient descent, the update formula is some scaled negative gradient; in momentum, the update formula is some scaled exponential moving average of the gradients.

What changes from algorithm to algorithm is this update formula. So, if we can learn the update formula, we can learn an optimization algorithm. We model the update formula as a neural net. Thus, by learning the weights of the neural net, we can learn an optimization algorithm. Parameterizing the update formula as a neural net has two appealing properties mentioned earlier: first, it is expressive, as neural nets are universal function approximators and can in principle model any update formula with sufficient capacity; second, it allows for efficient search, as neural nets can be trained easily with backpropagation.

In order to learn the optimization algorithm, we need to define a performance metric, which we will refer to as the “meta-loss”, that rewards good optimizers and penalizes bad optimizers. Since a good optimizer converges quickly, a natural meta-loss would be the sum of objective values over all iterations (assuming the goal is to minimize the objective function), or equivalently, the cumulative regret. Intuitively, this corresponds to the area under the curve, which is larger when the optimizer converges slowly and smaller otherwise.

Learning to Learn

Consider the special case when the objective functions are loss functions for training other models. Under this setting, optimizer learning can be used for “learning to learn”. For clarity, we will refer to the model that is trained using the optimizer as the “base-model” and prefix common terms with “base-” and “meta-” to disambiguate concepts associated with the base-model and the optimizer respectively.

What do we mean exactly by “learning to learn”? While this term has appeared from time to time in the literature, different authors have used it to refer to different things, and there is no consensus on its precise definition. Often, it is also used interchangeably with the term “meta-learning”.

The term traces its origins to the idea of metacognition ( Aristotle, 350 BC ), which describes the phenomenon that humans not only reason, but also reason about their own process of reasoning. Work on “learning to learn” draws inspiration from this idea and aims to turn it into concrete algorithms. Roughly speaking, “learning to learn” simply means learning something about learning. What is learned at the meta-level differs across methods. We can divide various methods into three broad categories according to the type of meta-knowledge they aim to learn:

Learning What to Learn

Learning which model to learn, learning how to learn.

These methods aim to learn some particular values of base-model parameters that are useful across a family of related tasks ( Thrun & Pratt, 2012 ). The meta-knowledge captures commonalities across the family, so that base-learning on a new task from the family can be done more quickly. Examples include methods for transfer learning, multi-task learning and few-shot learning. Early methods operate by partitioning the parameters of the base-model into two sets: those that are specific to a task and those that are common across tasks. For example, a popular approach for neural net base-models is to share the weights of the lower layers across all tasks, so that they capture the commonalities across tasks. See this post by Chelsea Finn for an overview of the more recent methods in this area.

These methods aim to learn which base-model is best suited for a task ( Brazdil et al., 2008 ). The meta-knowledge captures correlations between different base-models and their performance on different tasks. The challenge lies in parameterizing the space of base-models in a way that is expressive and efficiently searchable, and in parameterizing the space of tasks that allows for generalization to unseen tasks. Different methods make different trade-offs between expressiveness and searchability: ( Brazdil et al., 2003 ) uses a database of predefined base-models and exemplar tasks and outputs the base-model that performed the best on the nearest exemplar task. While this space of base-models is searchable, it does not contain good but yet-to-be-discovered base-models. ( Schmidhuber, 2004 ) represents each base-model as a general-purpose program. While this space is very expressive, searching in this space takes exponential time in the length of the target program. ( Hochreiter et al., 2001 ) views an algorithm that trains a base-model as a black box function that maps a sequence of training examples to a sequence of predictions and models it as a recurrent neural net. Meta-training then simply reduces to training the recurrent net. Because the base-model is encoded in the recurrent net’s memory state, its capacity is constrained by the memory size. A related area is hyperparameter optimization, which aims for a weaker goal and searches over base-models parameterized by a predefined set of hyperparameters. It needs to generalize across hyperparameter settings (and by extension, base-models), but not across tasks, since multiple trials with different hyperparameter settings on the same task are allowed.

While methods in the previous categories aim to learn about the outcome of learning, methods in this category aim to learn about the process of learning. The meta-knowledge captures commonalities in the behaviours of learning algorithms. There are three components under this setting: the base-model, the base-algorithm for training the base-model, and the meta-algorithm that learns the base-algorithm. What is learned is not the base-model itself, but the base-algorithm, which trains the base-model on a task. Because both the base-model and the task are given by the user, the base-algorithm that is learned must work on a range of different base-models and tasks. Since most learning algorithms optimize some objective function, learning the base-algorithm in many cases reduces to learning an optimization algorithm. This problem of learning optimization algorithms was explored in ( Li & Malik, 2016 ), ( Andrychowicz et al., 2016 ) and a number of subsequent papers. Closely related to this line of work is ( Bengio et al., 1991 ), which learns a Hebb-like synaptic learning rule. The learning rule depends on a subset of the dimensions of the current iterate encoding the activities of neighbouring neurons, but does not depend on the objective function and therefore does not have the capability to generalize to different objective functions.

Generalization

Learning of any sort requires training on a finite number of examples and generalizing to the broader class from which the examples are drawn. It is therefore instructive to consider what the examples and the class correspond to in our context of learning optimizers for training base-models. Each example is an objective function, which corresponds to the loss function for training a base-model on a task. The task is characterized by a set of examples and target predictions, or in other words, a dataset, that is used to train the base-model. The meta-training set consists of multiple objective functions and the meta-test set consists of different objective functions drawn from the same class. Objective functions can differ in two ways: they can correspond to different base-models, or different tasks. Therefore, generalization in this context means that the learned optimizer works on different base-models and/or different tasks.

Why is generalization important?

Suppose for moment that we didn’t care about generalization. In this case, we would evaluate the optimizer on the same objective functions that are used for training the optimizer. If we used only one objective function, then the best optimizer would be one that simply memorizes the optimum: this optimizer always converges to the optimum in one step regardless of initialization. In our context, the objective function corresponds to the loss for training a particular base-model on a particular task and so this optimizer essentially memorizes the optimal weights of the base-model. Even if we used many objective functions, the learned optimizer could still try to identify the objective function it is operating on and jump to the memorized optimum as soon as it does.

Why is this problematic? Memorizing the optima requires finding them in the first place, and so learning an optimizer takes longer than running a traditional optimizer like gradient descent. So, for the purposes of finding the optima of the objective functions at hand, running a traditional optimizer would be faster. Consequently, it would be pointless to learn the optimizer if we didn’t care about generalization.

Therefore, for the learned optimizer to have any practical utility, it must perform well on new objective functions that are different from those used for training.

What should be the extent of generalization?

If we only aim for generalization to similar base-models on similar tasks, then the learned optimizer could memorize parts of the optimal weights that are common across the base-models and tasks, like the weights of the lower layers in neural nets. This would be essentially the same as learning- what -to-learn formulations like transfer learning.

Unlike learning what to learn, the goal of learning how to learn is to learn not what the optimum is, but how to find it. We must therefore aim for a stronger notion of generalization, namely generalization to similar base-models on dissimilar tasks. An optimizer that can generalize to dissimilar tasks cannot just partially memorize the optimal weights, as the optimal weights for dissimilar tasks are likely completely different. For example, not even the lower layer weights in neural nets trained on MNIST(a dataset consisting of black-and-white images of handwritten digits) and CIFAR-10(a dataset consisting of colour images of common objects in natural scenes) likely have anything in common.

Should we aim for an even stronger form of generalization, that is, generalization to dissimilar base-models on dissimilar tasks? Since these correspond to objective functions that bear no similarity to objective functions used for training the optimizer, this is essentially asking if the learned optimizer should generalize to objective functions that could be arbitrarily different.

It turns out that this is impossible. Given any optimizer, we consider the trajectory followed by the optimizer on a particular objective function. Because the optimizer only relies on information at the previous iterates, we can modify the objective function at the last iterate to make it arbitrarily bad while maintaining the geometry of the objective function at all previous iterates. Then, on this modified objective function, the optimizer would follow the exact same trajectory as before and end up at a point with a bad objective value. Therefore, any optimizer has objective functions that it performs poorly on and no optimizer can generalize to all possible objective functions.

If no optimizer is universally good, can we still hope to learn optimizers that are useful? The answer is yes: since we are typically interested in optimizing functions from certain special classes in practice, it is possible to learn optimizers that work well on these classes of interest. The objective functions in a class can share regularities in their geometry, e.g.: they might have in common certain geometric properties like convexity, piecewise linearity, Lipschitz continuity or other unnamed properties. In the context of learning- how -to-learn, each class can correspond to a type of base-model. For example, neural nets with ReLU activation units can be one class, as they are all piecewise linear. Note that when learning the optimizer, there is no need to explicitly characterize the form of geometric regularity, as the optimizer can learn to exploit it automatically when trained on objective functions from the class.

How to Learn the Optimizer

The first approach we tried was to treat the problem of learning optimizers as a standard supervised learning problem: we simply differentiate the meta-loss with respect to the parameters of the update formula and learn these parameters using standard gradient-based optimization. (We weren’t the only ones to have thought of this; ( Andrychowicz et al., 2016 ) also used a similar approach.)

This seemed like a natural approach, but it did not work: despite our best efforts, we could not get any optimizer trained in this manner to generalize to unseen objective functions, even though they were drawn from the same distribution that generated the objective functions used to train the optimizer. On almost all unseen objective functions, the learned optimizer started off reasonably, but quickly diverged after a while. On the other hand, on the training objective functions, it exhibited no such issues and did quite well. Why is this?

It turns out that optimizer learning is not as simple a learning problem as it appears. Standard supervised learning assumes all training examples are independent and identically distributed (i.i.d.); in our setting, the step vector the optimizer takes at any iteration affects the gradients it sees at all subsequent iterations. Furthermore, how the step vector affects the gradient at the subsequent iteration is not known, since this depends on the local geometry of the objective function, which is unknown at meta-test time. Supervised learning cannot operate in this setting, and must assume that the local geometry of an unseen objective function is the same as the local geometry of training objective functions at all iterations.

Consider what happens when an optimizer trained using supervised learning is used on an unseen objective function. It takes a step, and discovers at the next iteration that the gradient is different from what it expected. It then recalls what it did on the training objective functions when it encountered such a gradient, which could have happened in a completely different region of the space, and takes a step accordingly. To its dismay, it finds out that the gradient at the next iteration is even more different from what it expected. This cycle repeats and the error the optimizer makes becomes bigger and bigger over time, leading to rapid divergence.

This phenomenon is known in the literature as the problem of compounding errors . It is known that the total error of a supervised learner scales quadratically in the number of iterations, rather than linearly as would be the case in the i.i.d. setting ( Ross and Bagnell, 2010 ). In essence, an optimizer trained using supervised learning necessarily overfits to the geometry of the training objective functions. One way to solve this problem is to use reinforcement learning.

Background on Reinforcement Learning

Consider an environment that maintains a state, which evolves in an unknown fashion based on the action that is taken. We have an agent that interacts with this environment, which sequentially selects actions and receives feedback after each action is taken on how good or bad the new state is. The goal of reinforcement learning is to find a way for the agent to pick actions based on the current state that leads to good states on average.

More precisely, a reinforcement learning problem is characterized by the following components:

A state space, which is the set of all possible states,
An action space, which is the set of all possible actions,
A cost function, which measures how bad a state is,
A time horizon, which is the number of time steps,
An initial state probability distribution, which specifies how frequently different states occur at the beginning before any action is taken, and
A state transition probability distribution, which specifies how the state changes (probabilistically) after a particular action is taken.

While the learning algorithm is aware of what the first five components are, it does not know the last component, i.e.: how states evolve based on actions that are chosen. At training time, the learning algorithm is allowed to interact with the environment. Specifically, at each time step, it can choose an action to take based on the current state. Then, based on the action that is selected and the current state, the environment samples a new state, which is observed by the learning algorithm at the subsequent time step. The sequence of sampled states and actions is known as a trajectory. This sampling procedure induces a distribution over trajectories, which depends on the initial state and transition probability distributions and the way action is selected based on the current state, the latter of which is known as a policy . This policy is often modelled as a neural net that takes in the current state as input and outputs the action. The goal of the learning algorithm is to find a policy such that the expected cumulative cost of states over all time steps is minimized, where the expectation is taken with respect to the distribution over trajectories.

Formulation as a Reinforcement Learning Problem

Recall the learning framework we introduced above, where the goal is to find the update formula that minimizes the meta-loss. Intuitively, we think of the agent as an optimization algorithm and the environment as being characterized by the family of objective functions that we’d like to learn an optimizer for. The state consists of the current iterate and some features along the optimization trajectory so far, which could be some statistic of the history of gradients, iterates and objective values. The action is the step vector that is used to update the iterate.

Under this formulation, the policy is essentially a procedure that computes the action, which is the step vector, from the state, which depends on the current iterate and the history of gradients, iterates and objective values. In other words, a particular policy represents a particular update formula. Hence, learning the policy is equivalent to learning the update formula, and hence the optimization algorithm. The initial state probability distribution is the joint distribution of the initial iterate, gradient and objective value. The state transition probability distribution characterizes what the next state is likely to be given the current state and action. Since the state contains the gradient and objective value, the state transition probability distribution captures how the gradient and objective value are likely to change for any given step vector. In other words, it encodes the likely local geometries of the objective functions of interest. Crucially, the reinforcement learning algorithm does not have direct access to this state transition probability distribution, and therefore the policy it learns avoids overfitting to the geometry of the training objective functions.

We choose a cost function of a state to be the value of the objective function evaluated at the current iterate. Because reinforcement learning minimizes the cumulative cost over all time steps, it essentially minimizes the sum of objective values over all iterations, which is the same as the meta-loss.

We trained an optimization algorithm on the problem of training a neural net on MNIST, and tested it on the problems of training different neural nets on the Toronto Faces Dataset (TFD), CIFAR-10 and CIFAR-100. These datasets bear little similarity to each other: MNIST consists of black-and-white images of handwritten digits, TFD consists of grayscale images of human faces, and CIFAR-10/100 consists of colour images of common objects in natural scenes. It is therefore unlikely that a learned optimization algorithm can get away with memorizing, say, the lower layer weights, on MNIST and still do well on TFD and CIFAR-10/100.

As shown, the optimization algorithm trained using our approach on MNIST (shown in light red) generalizes to TFD, CIFAR-10 and CIFAR-100 and outperforms other optimization algorithms.

To understand the behaviour of optimization algorithms learned using our approach, we trained an optimization algorithm on two-dimensional logistic regression problems and visualized its trajectory in the space of the parameters. It is worth noting that the behaviours of optimization algorithms in low dimensions and high dimensions may be different, and so the visualizations below may not be indicative of the behaviours of optimization algorithms in high dimensions. However, they provide some useful intuitions about the kinds of behaviour that can be learned.

The plots above show the optimization trajectories followed by various algorithms on two different unseen logistic regression problems. Each arrow represents one iteration of an optimization algorithm. As shown, the algorithm learned using our approach (shown in light red) takes much larger steps compared to other algorithms. In the first example, because the learned algorithm takes large steps, it overshoots after two iterations, but does not oscillate and instead takes smaller steps to recover. In the second example, due to vanishing gradients, traditional optimization algorithms take small steps and therefore converge slowly. On the other hand, the learned algorithm takes much larger steps and converges faster.

More details can be found in our papers:

Learning to Optimize Ke Li, Jitendra Malik arXiv:1606.01885 , 2016 and International Conference on Learning Representations (ICLR) , 2017

Learning to Optimize Neural Nets Ke Li, Jitendra Malik arXiv:1703.00441 , 2017

I’d like to thank Jitendra Malik for his valuable feedback.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Perspective
Published: 25 January 2022

Intelligent problem-solving as integrated hierarchical reinforcement learning

Manfred Eppe ORCID: orcid.org/0000-0002-5473-3221 1 nAff4 ,
Christian Gumbsch ORCID: orcid.org/0000-0003-2741-6551 2 , 3 ,
Matthias Kerzel 1 ,
Phuong D. H. Nguyen 1 ,
Martin V. Butz ORCID: orcid.org/0000-0002-8120-8537 2 &
Stefan Wermter 1

Nature Machine Intelligence volume 4 , pages 11–20 ( 2022 ) Cite this article

5328 Accesses

32 Citations

8 Altmetric

Metrics details

Cognitive control
Computational models
Computer science
Learning algorithms
Problem solving

According to cognitive psychology and related disciplines, the development of complex problem-solving behaviour in biological agents depends on hierarchical cognitive mechanisms. Hierarchical reinforcement learning is a promising computational approach that may eventually yield comparable problem-solving behaviour in artificial agents and robots. However, so far, the problem-solving abilities of many human and non-human animals are clearly superior to those of artificial systems. Here we propose steps to integrate biologically inspired hierarchical mechanisms to enable advanced problem-solving skills in artificial agents. We first review the literature in cognitive psychology to highlight the importance of compositional abstraction and predictive processing. Then we relate the gained insights with contemporary hierarchical reinforcement learning methods. Interestingly, our results suggest that all identified cognitive mechanisms have been implemented individually in isolated computational architectures, raising the question of why there exists no single unifying architecture that integrates them. As our final contribution, we address this question by providing an integrative perspective on the computational challenges to develop such a unifying architecture. We expect our results to guide the development of more sophisticated cognitively inspired hierarchical machine learning architectures.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

111,21 € per year

only 9,27 € per issue

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Phy-Q as a measure for physical reasoning intelligence

Hierarchical motor control in mammals and machines

Hierarchical generative modelling for autonomous robots

Gruber, R. et al. New Caledonian crows use mental representations to solve metatool problems. Curr. Biol. 29 , 686–692 (2019).

Article Google Scholar

Butz, M. V. & Kutter, E. F. How the Mind Comes into Being (Oxford Univ. Press, 2017).

Perkins, D. N. & Salomon, G. in International Encyclopedia of Education (eds. Husen T. & Postelwhite T. N.) 6452–6457 (Pergamon Press, 1992).

Botvinick, M. M., Niv, Y. & Barto, A. C. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113 , 262–280 (2009).

Tomov, M. S., Yagati, S., Kumar, A., Yang, W. & Gershman, S. J. Discovery of hierarchical representations for efficient planning. PLoS Comput. Biol. 16 , e1007594 (2020).

Arulkumaran, K., Deisenroth, M. P., Brundage, M. & Bharath, A. A. Deep reinforcement learning: a brief survey. IEEE Signal Process. Mag. 34 , 26–38 (2017).

Li, Y. Deep reinforcement learning: an overview. Preprint at https://arxiv.org/abs/1701.07274 (2018).

Sutton, R. S. & Barto, A. G. Reinforcement Learning : An Introduction 2nd edn (MIT Press, 2018).

Neftci, E. O. & Averbeck, B. B. Reinforcement learning in artificial and biological systems. Nat. Mach. Intell. 1 , 133–143 (2019).

Eppe, M., Nguyen, P. D. H. & Wermter, S. From semantics to execution: integrating action planning with reinforcement learning for robotic causal problem-solving. Front. Robot. AI 6 , 123 (2019).

Oh, J., Singh, S., Lee, H. & Kohli, P. Zero-shot task generalization with multi-task deep reinforcement learning. In Proc. 34th International Conference on Machine Learning ( ICML ) (eds. Precup, D. & Teh, Y. W.) 2661–2670 (PMLR, 2017).

Sohn, S., Oh, J. & Lee, H. Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies. In Proc. 32nd International Conference on Neural Information Processing Systems ( NeurIPS ) (eds Bengio S. et al.) Vol. 31, 7156–7166 (ACM, 2018).

Hegarty, M. Mechanical reasoning by mental simulation. Trends Cogn. Sci. 8 , 280–285 (2004).

Klauer, K. J. Teaching for analogical transfer as a means of improving problem-solving, thinking and learning. Instruct. Sci. 18 , 179–192 (1989).

Duncker, K. & Lees, L. S. On problem-solving. Psychol. Monographs 58, No.5 (whole No. 270), 85–101 https://doi.org/10.1037/h0093599 (1945).

Dayan, P. Goal-directed control and its antipodes. Neural Netw. 22 , 213–219 (2009).

Dolan, R. J. & Dayan, P. Goals and habits in the brain. Neuron 80 , 312–325 (2013).

O’Doherty, J. P., Cockburn, J. & Pauli, W. M. Learning, reward, and decision making. Annu. Rev. Psychol. 68 , 73–100 (2017).

Tolman, E. C. & Honzik, C. H. Introduction and removal of reward, and maze performance in rats. Univ. California Publ. Psychol. 4 , 257–275 (1930).

Google Scholar

Butz, M. V. & Hoffmann, J. Anticipations control behavior: animal behavior in an anticipatory learning classifier system. Adaptive Behav. 10 , 75–96 (2002).

Miller, G. A., Galanter, E. & Pribram, K. H. Plans and the Structure of Behavior (Holt, Rinehart & Winston, 1960).

Botvinick, M. & Weinstein, A. Model-based hierarchical reinforcement learning and human action control. Philos. Trans. R. Soc. B Biol. Sci. 369 , 20130480 (2014).

Wiener, J. M. & Mallot, H. A. ’Fine-to-coarse’ route planning and navigation in regionalized environments. Spatial Cogn. Comput. 3 , 331–358 (2003).

Stock, A. & Stock, C. A short history of ideo-motor action. Psychol. Res. 68 , 176–188 (2004).

Hommel, B., Müsseler, J., Aschersleben, G. & Prinz, W. The theory of event coding (TEC): a framework for perception and action planning. Behav. Brain Sci. 24 , 849–878 (2001).

Hoffmann, J. in Anticipatory Behavior in Adaptive Learning Systems : Foundations , Theories and Systems (eds Butz, M. V. et al.) 44–65 (Springer, 2003).

Kunde, W., Elsner, K. & Kiesel, A. No anticipation-no action: the role of anticipation in action and perception. Cogn. Process. 8 , 71–78 (2007).

Barsalou, L. W. Grounded cognition. Annu. Rev. Psychol. 59 , 617–645 (2008).

Butz, M. V. Toward a unified sub-symbolic computational theory of cognition. Front. Psychol. 7 , 925 (2016).

Pulvermüller, F. Brain embodiment of syntax and grammar: discrete combinatorial mechanisms spelt out in neuronal circuits. Brain Lang. 112 , 167–179 (2010).

Sutton, R. S., Precup, D. & Singh, S. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112 , 181–211 (1999).

Article MathSciNet MATH Google Scholar

Flash, T. & Hochner, B. Motor primitives in vertebrates and invertebrates. Curr. Opin. Neurobiol. 15 , 660–666 (2005).

Schaal, S. in Adaptive Motion of Animals and Machines (eds. Kimura, H. et al.) 261–280 (Springer, 2006).

Feldman, J., Dodge, E. & Bryant, J. in The Oxford Handbook of Linguistic Analysis (eds Heine, B. & Narrog, H.) 111–138 (Oxford Univ. Press, 2009).

Fodor, J. A. Language, thought and compositionality. Mind Lang. 16 , 1–15 (2001).

Frankland, S. M. & Greene, J. D. Concepts and compositionality: in search of the brain’s language of thought. Annu. Rev. Psychol. 71 , 273–303 (2020).

Hummel, J. E. Getting symbols out of a neural architecture. Connection Sci. 23 , 109–118 (2011).

Haynes, J. D., Wisniewski, D., Gorgen, K., Momennejad, I. & Reverberi, C. FMRI decoding of intentions: compositionality, hierarchy and prospective memory. In Proc. 3rd International Winter Conference on Brain-Computer Interface ( BCI ), 1-3 (IEEE, 2015).

Gärdenfors, P. The Geometry of Meaning : Semantics Based on Conceptual Spaces (MIT Press, 2014).

Book MATH Google Scholar

Lakoff, G. & Johnson, M. Philosophy in the Flesh (Basic Books, 1999).

Eppe, M. et al. A computational framework for concept blending. Artif. Intell. 256 , 105–129 (2018).

Turner, M. The Origin of Ideas (Oxford Univ. Press, 2014).

Deci, E. L. & Ryan, R. M. Self-determination theory and the facilitation of intrinsic motivation. Am. Psychol. 55 , 68–78 (2000).

Friston, K. et al. Active inference and epistemic value. Cogn. Neurosci. 6 , 187–214 (2015).

Berlyne, D. E. Curiosity and exploration. Science 153 , 25–33 (1966).

Loewenstein, G. The psychology of curiosity: a review and reinterpretation. Psychol. Bull. 116 , 75–98 (1994).

Oudeyer, P.-Y., Kaplan, F. & Hafner, V. V. Intrinsic motivation systems for autonomous mental development. In IEEE Transactions on Evolutionary Computation (eds. Coello, C. A. C. et al.) Vol. 11, 265–286 (IEEE, 2007).

Pisula, W. Play and exploration in animals—a comparative analysis. Polish Psychol. Bull. 39 , 104–107 (2008).

Jeannerod, M. Mental imagery in the motor context. Neuropsychologia 33 , 1419–1432 (1995).

Kahnemann, D. & Tversky, A. in Judgement under Uncertainty : Heuristics and Biases (eds Kahneman, D. et al.) Ch. 14, 201–208 (Cambridge Univ. Press, 1982).

Wells, G. L. & Gavanski, I. Mental simulation of causality. J. Personal. Social Psychol. 56 , 161–169 (1989).

Taylor, S. E., Pham, L. B., Rivkin, I. D. & Armor, D. A. Harnessing the imagination: mental simulation, self-regulation and coping. Am. Psychol. 53 , 429–439 (1998).

Kaplan, F. & Oudeyer, P.-Y. in Embodied Artificial Intelligence , Lecture Notes in Computer Science Vol. 3139 (eds Iida, F. et al.) 259–270 (Springer, 2004).

Schmidhuber, J. Formal theory of creativity, fun, and intrinsic motivation. IEEE Trans. Auton. Mental Dev. 2 , 230–247 (2010).

Friston, K., Mattout, J. & Kilner, J. Action understanding and active inference. Biol. Cybern. 104 , 137–160 (2011).

Oudeyer, P.-Y. Computational theories of curiosity-driven learning. In The New Science of Curiosity (ed. Goren Gordon), 43-72 (Nova Science Publishers, 2018); https://arxiv.org/abs/1802.10546

Colombo, M. & Wright, C. First principles in the life sciences: the free-energy principle, organicism and mechanism. Synthese 198 , 3463–3488 (2021).

Article MathSciNet Google Scholar

Huang, Y. & Rao, R. P. Predictive coding. WIREs Cogn. Sci. 2 , 580–593 (2011).

Friston, K. The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11 , 127–138 (2010).

Knill, D. C. & Pouget, A. The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci. 27 , 712–719 (2004).

Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 36 , 181–204 (2013).

Clark, A. Surfing Uncertainty : Prediction , Action and the Embodied Mind (Oxford Univ. Press, 2016).

Zacks, J. M., Speer, N. K., Swallow, K. M., Braver, T. S. & Reyonolds, J. R. Event perception: a mind/brain perspective. Psychol. Bull. 133 , 273–293 (2007).

Eysenbach, B., Ibarz, J., Gupta, A. & Levine, S. Diversity is all you need: learning skills without a reward function. In International Conference on Learning Representations (ICLR, 2019).

Frans, K., Ho, J., Chen, X., Abbeel, P. & Schulman, J. Meta learning shared hierarchies. In Proc. International Conference on Learning Representations https://openreview.net/pdf?id=SyX0IeWAW (ICLR, 2018).

Heess, N. et al. Learning and transfer of modulated locomotor controllers. Preprint at https://arxiv.org/abs/1610.05182 (2016).

Jiang, Y., Gu, S., Murphy, K. & Finn, C. Language as an abstraction for hierarchical deep reinforcement learning. In Neural Information Processing Systems ( NeurIPS ) (eds. Wallach, H. et al.) 9414–9426 (ACM, 2019).

Li, A. C., Florensa, C., Clavera, I. & Abbeel, P. Sub-policy adaptation for hierarchical reinforcement learning. In Proc. International Conference on Learning Representations https://openreview.net/forum?id=ByeWogStDS (ICLR, 2020).

Qureshi, A. H. et al. Composing task-agnostic policies with deep reinforcement learning. In Proc. International Conference on Learning Representations https://openreview.net/forum?id=H1ezFREtwH (ICLR, 2020).

Sharma, A., Gu, S., Levine, S., Kumar, V. & Hausman, K. Dynamics-aware unsupervised discovery of skills. In Proc. International Conference on Learning Representations https://openreview.net/forum?id=HJgLZR4KvH (ICLR, 2020).

Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J. & Mannor, S. A deep hierarchical approach to lifelong learning in minecraft. In Proc. 31st AAAI Conference on Artificial Intelligence 1553–1561 (AAAI, 2017).

Vezhnevets, A. et al. Strategic attentive writer for learning macro-actions. In Neural Information Processing Systems ( NIPS ) (eds. Lee, D. et al.) 3494–3502 (NIPS, 2016).

Devin, C., Gupta, A., Darrell, T., Abbeel, P. & Levine, S. Learning modular neural network policies for multi-task and multi-robot transfer. In Proc. International Conference on Robotics and Automation ( ICRA ) (eds. Okamura, A. et al.) 2169–2176 (IEEE, 2017).

Hejna, D. J., Abbeel, P. & Pinto, L. Hierarchically decoupled morphological transfer. In Proc. International Conference on Machine Learning ( ICML ) (eds. Daumé III, H. & Singh, A.) 11409–11420 (PMLR, 2020).

Hamrick, J. B. et al. On the role of planning in model-based deep reinforcement learning. In Proc. International Conference on Learning Representations https://openreview.net/pdf?id=IrM64DGB21 (ICLR, 2021).

Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proc. 7th International Conference on Machine Learning ( ICML ) (eds. Porter, B. W. & Mooney, R. J.) 216–224 (Morgan Kaufmann, 1990).

Nau, D. et al. SHOP2: an HTN planning system. J. Artif. Intell. Res. 20 , 379–404 (2003).

Article MATH Google Scholar

Lyu, D., Yang, F., Liu, B. & Gustafson, S. SDRL: interpretable and data-efficient deep reinforcement learning leveraging symbolic planning. In Proc. AAAI Conference on Artificial Intelligence Vol. 33, 2970–2977 (AAAI, 2019).

Ma, A., Ouimet, M. & Cortés, J. Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning. Auton. Robot. 44 , 485–503 (2020).

Bacon, P.-L., Harb, J. & Precup, D. The option-critic architecture. In Proc. 31st AAAI Conference on Artificial Intelligence 1726–1734 (AAAI, 2017).

Dietterich, T. G. State abstraction in MAXQ hierarchical reinforcement learning. In Advances in Neural Information Processing Systems ( NIPS ) (eds. Solla, S. et al.) Vol. 12, 994–1000 (NIPS, 1999).

Kulkarni, T. D., Narasimhan, K. R., Saeedi, A. & Tenenbaum, J. B. Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Neural Information Processing Systems ( NIPS ) (eds. Lee, D. et al.) 3675–3683 (NIPS, 2016).

Shankar, T., Pinto, L., Tulsiani, S. & Gupta, A. Discovering motor programs by recomposing demonstrations. In Proc. International Conference on Learning Representations https://openreview.net/attachment?id=rkgHY0NYwr&name=original_pdf (ICLR, 2020).

Vezhnevets, A. S., Wu, Y. T., Eckstein, M., Leblond, R. & Leibo, J. Z. Options as responses: grounding behavioural hierarchies in multi-agent reinforcement learning. In Proc. International Conference on Machine Learning ( ICML ) (eds. Daumé III, H. & Singh, A.) 9733–9742 (PMLR, 2020).

Ghazanfari, B., Afghah, F. & Taylor, M. E. Sequential association rule mining for autonomously extracting hierarchical task structures in reinforcement learning. IEEE Access 8 , 11782–11799 (2020).

Levy, A., Konidaris, G., Platt, R. & Saenko, K. Learning multi-level hierarchies with hindsight. In Proc. International Conference on Learning Representations https://openreview.net/pdf?id=ryzECoAcY7 (ICLR, 2019).

Nachum, O., Gu, S., Lee, H. & Levine, S. Data-efficient hierarchical reinforcement learning. In Proc. 32nd International Conference on Neural Information Processing Systems (NIPS) (eds. Bengio, S. et al.) 3307–3317 (NIPS, 2018).

Rafati, J. & Noelle, D. C. Learning representations in model-free hierarchical reinforcement learning. In Proc. 33rd AAAI Conference on Artificial Intelligence 10009–10010 (AAAI, 2019).

Röder, F., Eppe, M., Nguyen, P. D. H. & Wermter, S. Curious hierarchical actor-critic reinforcement learning. In Proc. International Conference on Artificial Neural Networks ( ICANN ) (eds. Farkaš, I. et al.) 408–419 (Springer, 2020).

Zhang, T., Guo, S., Tan, T., Hu, X. & Chen, F. Generating adjacency-constrained subgoals in hierarchical reinforcement learning. In Neural Information Processing Systems ( NIPS ) (eds. Larochelle, H. et al.) 21579-21590 (NIPS, 2020).

Lample, G. & Chaplot, D. S. Playing FPS games with deep reinforcement learning. In Proc. 31st AAAI Conference on Artificial Intelligence 2140–2146 (AAAI, 2017).

Vezhnevets, A. S. et al. FeUdal networks for hierarchical reinforcement learning. In Proc. 34th International Conference on Machine Learning ( ICML ) (eds. Precup, D. & Teh, Y. W.) Vol. 70, 3540–3549 (PMLR, 2017).

Wulfmeier, M. et al. Compositional Transfer in Hierarchical Reinforcement Learning. In Robotics: Science and System XVI (RSS) (eds. Toussaint M. et al.) (Robotics: Science and Systems Foundation, 2020); https://arxiv.org/abs/1906.11228

Yang, Z., Merrick, K., Jin, L. & Abbass, H. A. Hierarchical deep reinforcement learning for continuous action control. IEEE Trans. Neural Netw. Learn. Syst. 29 , 5174–5184 (2018).

Toussaint, M., Allen, K. R., Smith, K. A. & Tenenbaum, J. B. Differentiable physics and stable modes for tool-use and manipulation planning. In Proc. Robotics : Science and Systems XIV ( RSS ) (eds. Kress-Gazit, H. et al.) https://ipvs.informatik.uni-stuttgart.de/mlr/papers/18-toussaint-RSS.pdf (Robotics: Science and Systems Foundation, 2018).

Akrour, R., Veiga, F., Peters, J. & Neumann, G. Regularizing reinforcement learning with state abstraction. In Proc. IEEE / RSJ International Conference on Intelligent Robots and Systems ( IROS ) 534–539 (IEEE, 2018).

Schaul, T. & Ring, M. Better generalization with forecasts. In Proc. 23rd International Joint Conference on Artificial Intelligence ( IJCAI ) (ed. Rossi, F.) 1656–1662 (AAAI, 2013).

Colas, C., Akakzia, A., Oudeyer, P.-Y., Chetouani, M. & Sigaud, O. Language-conditioned goal generation: a new approach to language grounding for RL. Preprint at https://arxiv.org/abs/2006.07043 (2020).

Blaes, S., Pogancic, M. V., Zhu, J. J. & Martius, G. Control what you can: intrinsically motivated task-planning agent. Neural Inf. Process. Syst. 32 , 12541–12552 (2019).

Haarnoja, T., Hartikainen, K., Abbeel, P. & Levine, S. Latent space policies for hierarchical reinforcement learning. In Proc. International Conference on Machine Learning ( ICML ) (eds. Dy, J. & Krause, A.) Vol. 4, 2965–2975 (PMLR, 2018).

Rasmussen, D., Voelker, A. & Eliasmith, C. A neural model of hierarchical reinforcement learning. PLoS ONE 12 , e0180234 (2017).

Riedmiller, M. et al. Learning by playing—solving sparse reward tasks from scratch. In Proc. International Conference on Machine Learning ( ICML ) (eds. Dy, J. & Krause, A.) Vol. 10, 6910–6919 (PMLR, 2018).

Yang, F., Lyu, D., Liu, B. & Gustafson, S. PEORL: integrating symbolic planning and hierarchical reinforcement learning for robust decision-making. In Proc. 27th International Joint Conference on Artificial Intelligence ( IJCAI ) (ed. Lang, J.) 4860–4866 (IJCAI, 2018).

Machado, M. C., Bellemare, M. G. & Bowling, M. A Laplacian framework for option discovery in reinforcement learning. In Proc. International Conference on Machine Learning (ICML) (eds. Precup, D. & Teh, Y. W.) Vol. 5, 3567–3582 (PMLR, 2017).

Pathak, D., Agrawal, P., Efros, A. A. & Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proc. 34th International Conference on Machine Learning ( ICML ) (eds. Precup, D. & Teh, Y. W.) 2778–2787 (PMLR, 2017).

Schillaci, G. et al. Intrinsic motivation and episodic memories for robot exploration of high-dimensional sensory spaces. Adaptive Behav. 29 549–566 (2020).

Colas, C., Fournier, P., Sigaud, O., Chetouani, M. & Oudeyer, P.-Y. CURIOUS: intrinsically motivated modular multi-goal reinforcement learning. In Proc. International Conference on Machine Learning ( ICML ) (eds. Chaudhuri, K. & Salakhutdinov, R.) 1331–1340 (PMLR, 2019).

Hafez, M. B., Weber, C., Kerzel, M. & Wermter, S. Improving robot dual-system motor learning with intrinsically motivated meta-control and latent-space experience imagination. Robot. Auton. Syst. 133 , 103630 (2020).

Yamamoto, K., Onishi, T. & Tsuruoka, Y. Hierarchical reinforcement learning with abductive planning. In Proc. ICML / IJCAI / AAMAS 2018 Workshop on Planning and Learning ( PAL-18 ) (2018).

Wu, B., Gupta, J. K. & Kochenderfer, M. J. Model primitive hierarchical lifelong reinforcement learning . In Proc. International Joint Conference on Autonomous Agents and Multiagent Systems ( AAMAS ) (eds. Agmon, N. et al.) Vol. 1, 34–42 (IFAAMAS, 2019).

Li, Z., Narayan, A. & Leong, T. Y. An efficient approach to model-based hierarchical reinforcement learning. In Proc. 31st AAAI Conference on Artificial Intelligence 3583–3589 (AAAI, 2017).

Hafner, D., Lillicrap, T. & Norouzi, M. Dream to control: learning behaviors by latent imagination. In Proc. International Conference on Learning Representations https://openreview.net/pdf?id=S1lOTC4tDS (ICLR, 2020).

Deisenroth, M. P., Rasmussen, C. E. & Fox, D. Learning to control a low-cost manipulator using data-efficient reinforcement learning. In Robotics : Science and Systems VII ( RSS ) (eds. Durrant-Whyte, H. et al.) 57–64 (Robotics: Science and Systems Foundation, 2011).

Ha, D. & Schmidhuber, J. Recurrent world models facilitate policy evolution. In Proc. 32nd International Conference on Neural Information Processing Systems (NeurIPS) (eds. Bengio, S. et al.) 2455–2467 (NIPS, 2018).

Battaglia, P. W. et al. Relational inductive biases, deep learning and graph networks. Preprint at https://arxiv.org/abs/1806.01261 (2018).

Andrychowicz, M. et al. Hindsight experience replay. In Proc. Neural Information Processing Systems ( NIPS ) (eds. Guyon I. et al.) 5048–5058 (NIPS, 2017); https://papers.nips.cc/paper/7090-hindsight-experience-replay.pdf

Schwartenbeck, P. et al. Computational mechanisms of curiosity and goal-directed exploration. eLife 8 , e41703 (2019).

Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proc. International Conference on Machine Learning ( ICML ) (eds. Dy, J. & Krause, A.) 1861–1870 (PMLR, 2018).

Yu, A. J. & Dayan, P. Uncertainty, neuromodulation and attention. Neuron 46 , 681–692 (2005).

Baldwin, D. A. & Kosie, J. E. How does the mind render streaming experience as events? Top. Cogn. Sci. 13 , 79–105 (2021).

Download references

Acknowledgements

We acknowledge funding from the DFG (projects IDEAS, LeCAREbot, TRR169, SPP 2134, RTG 1808 and EXC 2064/1), the Humboldt Foundation and Max Planck Research School IMPRS-IS.

Author information

Manfred Eppe

Present address: Hamburg University of Technology, Hamburg, Germany

Authors and Affiliations

Universität Hamburg, Hamburg, Germany

Manfred Eppe, Matthias Kerzel, Phuong D. H. Nguyen & Stefan Wermter

University of Tübingen, Tübingen, Germany

Christian Gumbsch & Martin V. Butz

Max Planck Institute for Intelligent Systems, Tübingen, Germany

Christian Gumbsch

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manfred Eppe .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Boxes 1–6 and Table 1.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Eppe, M., Gumbsch, C., Kerzel, M. et al. Intelligent problem-solving as integrated hierarchical reinforcement learning. Nat Mach Intell 4 , 11–20 (2022). https://doi.org/10.1038/s42256-021-00433-9

Download citation

Received : 18 December 2020

Accepted : 07 December 2021

Published : 25 January 2022

Issue Date : January 2022

DOI : https://doi.org/10.1038/s42256-021-00433-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Efficient stacking and grasping in unstructured environments.

Jinbiao Zhu

Journal of Intelligent & Robotic Systems (2024)

Four attributes of intelligence, a thousand questions

Matthieu Bardal
Eric Chalmers

Biological Cybernetics (2023)

An Alternative to Cognitivism: Computational Phenomenology for Deep Learning

Pierre Beckmann
Guillaume Köstner
Inês Hipólito

Minds and Machines (2023)

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Caltech Bootcamp / Blog / /

What is Reinforcement Learning in AI?

Written by John Terra
Updated on May 21, 2024

Decision-making is often a tricky thing. If you make the wrong decision, you inevitably suffer consequences. Eventually, through experience, we learn to take actions that offer the best outcomes while avoiding negative results. Machines can be trained to do this, too. It’s called reinforcement learning.

This article answers the question, “What is reinforcement learning in AI?” We will define the term and show how reinforcement learning works, including its uses, benefits, and challenges. We will also explore commonly used terms in reinforcement learning and their pros and cons. We will round things out by speculating about its future and sharing an online AI and machine learning bootcamp you can take to boost your career in this exciting field.

So, let’s begin. What is reinforcement learning in AI?

What is Reinforcement Learning?

Reinforcement learning (RL) is a sub-category of Machine Learning that trains a model via trial and error to learn optimal behavior and devise the optimal solution for a problem by making a sequence of decisions.

In essence, reinforcement learning is the science of decision, optimizing AI-driven systems by imitating natural intelligence and emulating human cognition without involving human interaction or the need for explicitly programmed AI systems.

In RL, data is accumulated from machine learning systems using trial-and-error methods. Reinforcement learning employs algorithms that learn from outcomes and decide what actions to take next. After each such action, the algorithm receives feedback that helps it decide whether the choice was correct, neutral, or incorrect.

So, reinforcement learning in AI is an autonomous, self-teaching system that learns by trial and error without humans getting involved.

Also Read: Machine Learning in Healthcare: Applications, Use Cases, and Careers

Essential Terms in Reinforcement Learning

Here are some terms often encountered when working with reinforcement learning.

Agent . The agent is the model being trained through reinforcement learning.
Environment. The environment is the training situation that the model must optimize.
Action. The action covers all possible steps the model can take.
State. The state is the current position or condition returned by the model.
Reward . The model is rewarded with points for moving in the right direction. Rewards are given to appraise a given action.
Policy . The policy determines how an agent behaves at any time, acting as a mapping between the action and the present state.

How Does Reinforcement Learning Work?

The reinforcement learning workflow encompasses training the agent while keeping the following key factors in mind:

Environment

Let’s understand each one in detail.

Step 1: Define and create the environment

The reinforcement process starts by defining the environment where the agent remains active. The environment can refer to an actual physical system or a simulated environment. Once you determine the environment, you can begin experimentation for the RL process.

Step 2: Specify rewards

In the next phase, you must define the reward for the agent. The reward acts as the agent’s performance metric and lets the agent evaluate the task’s quality against its goals. Additionally, offering appropriate agent rewards may require several experimental iterations to finalize the correct one for a specific action.

Step 3: Define the agent

Once you finalize the environment and rewards, you can define and create the agent that specifies the involved policies, including the reinforcement learning training algorithm. The process typically includes these two steps:

Using the appropriate lookup tables or neural networks to represent the policy
Selecting the suitable RL training algorithm

Step 4: Train and validate the agent

Next, you train and validate the agent to fine-tune the training policy. Additionally, you must focus on the reward structure RL design policy architecture and perpetuate the training process. Beware- reinforcement learning training is time-intensive and could be minutes to days, depending on the end application. So, you can achieve faster training for a complex set of applications by employing a system architecture where several GPUs, CPUs, and computing systems run parallel.

Step 5: Implement the policy

Policy in an RL-enabled system serves as the decision-making component. This component is deployed using C, C++, or CUDA development code. While you implement these policies, revisit the initial stages of the RL workflow. This action is sometimes necessary when optimal results or decisions aren’t achieved. The following factors may need fine-tuning, followed by retraining the agent:

Action/state signal detection
Environmental variables
Policy framework
RL algorithm configuration
Reward definition
Training structure

Also Read: What is Machine Learning? A Comprehensive Guide for Beginners

The Various Uses for Reinforcement Learning

Reinforcement learning is designed to optimize the rewards agents earn as they accomplish a specific task. Consequently, RL benefits several real-life applications and scenarios, including self-driving cars, surgeons, robotics, and AI bots.

Here are some critical reinforcement learning uses in our daily lives that shape the artificial intelligence field:

Addressing energy consumption problems. As reinforcement learning algorithms become increasingly popular, RL agents without knowledge of server conditions can control physical parameters surrounding an organization’s servers. This data is acquired via multiple sensors that collect power, temperature, and other data, helping deep neural network training. Thus, it contributes to data center cooling and regulates overall energy consumption.
Controlling self-driving cars. Vehicles need substantial support from ML models that simulate all possible scenarios or situations that the vehicle may encounter to operate autonomously in a city environment. Reinforcement learning is a superstar in these cases as these models need training in a dynamic environment, where all possible pathways are studied and sorted through the learning process. Learning from experience makes reinforcement learning the ideal choice for self-driving cars that must quickly make optimal decisions. RL methods can competently handle multiple variables like handling traffic, managing driving zones, monitoring vehicle speeds, and controlling accidents.
Gaming. Reinforcement learning agents learn and adapt to gaming environments as they apply logic via their experiences, achieving the desired results by performing a sequence of steps. Google’s DeepMind-created AlphaGo outclassed the master Go player in Oct. 2015, for example. But, in addition to designing games like AlphaGo that employ deep neural networks, reinforcement learning agents are used for bug detection and game testing within the gaming environment.
Healthcare. Reinforcement learning is valuable in healthcare as DTRs (Dynamic Treatment Regimes) have aided medical professionals in handling patients’ health. DTRs employ a sequence of decisions to generate a final solution. The sequential process typically involves these steps:
Determining the patient’s live status.
Deciding the type of treatment.
Discovering the appropriate medication dosage based on the patient’s condition.
Deciding dosage timings and other related variables.

Doctors can use this sequence of decisions to fine-tune their patient treatment strategies and diagnose complex diseases such as cancer or diabetes. In addition, DTRs can further help provide treatments at the correct time, avoiding complications that may arise from delayed actions.

Marketing. Reinforcement learning helps organizations maximize customer growth and streamline business strategies to achieve long-term goals. RL in the marketing arena helps professionals make personalized recommendations to users by predicting behavior, choices, and reactions toward specific products or services. Trained bots also consider variables such as evolving customer mindset and dynamically learning changing user requirements based on behavior. So, reinforcement learning lets businesses target quality recommendations, maximizing profit margins.
Robotics. Robotics trains robots to mimic human behavior while performing a given task. However, today’s robots don’t have social, moral, or common sense while accomplishing these jobs. In these cases, AI sub-fields like RL and deep learning (Deep Reinforcement Learning) can be combined to achieve better results. For example, deep RL is vital for robots that aid warehouse navigation while providing critical product parts, defect inspection, packaging, assembly, etc. Additionally, RL models can be trained on multimodal data that are key to identifying cracks, scratches, missing parts, and overall damage to warehouse machines by scanning images containing billions of data points. Also, deep RL helps in inventory management since the agents are trained to isolate empty containers and immediately restock them.
Traffic signal controls. Reinforcement learning offers a possible solution to increased urbanization and rising automobile use, as RL models introduce traffic light control based on an area’s traffic status. The model considers traffic from different directions, then adapts, learns, and adjusts traffic light signals.

Reinforcement Learning vs. Supervised Learning vs. Unsupervised Learning

The below table illustrates the differences between the three primary machine learning sub-branches.

Reinforcement Learning Challenges

Although reinforcement learning algorithms have successfully solved complex problems in many simulated environments, the real world has slowly adopted them. Here are some of the implementation obstacles RL faces:

An RL agent requires extensive experience. RL methods generate training data autonomously through environmental interaction. Thus, the data collection rate is limited by the environment’s dynamics. Thus, environments with high latency tend to slow down the learning curve. In addition, extensive exploration is required before an ideal solution can be found in complex environments with high-dimensional state spaces.
Delayed rewards. Learning agents can trade off short-term rewards for long-term gains. Although this foundational principle makes reinforcement learning useful, it also makes it challenging for the agent to discover optimal policies. This is particularly true in environments where you must take many sequential actions before finding the outcome. Assigning credit to previous actions becomes challenging as it introduces significant variances during training.
A lack of interpretability. Once the reinforcement learning agent has learned the optimal policy and is deployed, it acts based on experience. The reason for these actions might be hidden from an outside observer. This lack of interpretability stifles the act of fostering trust between the agent and the observer.

Also Read: Machine Learning Interview Questions & Answers

The Advantages and Disadvantages of Reinforcement Learning

Reinforcement learning has its share of pros and cons. For example:

Advantages of Reinforcement Learning

Reinforcement learning can be employed to tackle a diverse array of problems, including decision-making, control, and optimization
Reinforcement learning can solve complicated problems that conventional problem-solving techniques can’t otherwise solve
RL models can correct errors that happen during a training process
Reinforcement learning can handle non-deterministic environments, meaning the actions’ outcomes aren’t always predictable. This is especially helpful in real-world applications where the environment is uncertain or could change over time.
Reinforcement learning is a flexible problem-solving approach that can improve performance when used in conjunction with additional machine learning techniques, like deep learning.

Disadvantages of Reinforcement Learning

There are better choices than reinforcement learning for solving simple problems
Reinforcement learning requires a lot of data and computation
Reinforcement learning highly relies on the quality of the reward function. The agent might not learn the desired behavior if the reward function is poorly designed
Reinforcement learning can be challenging to debug and interpret. It’s only sometimes apparent why a given agent acts in a certain way, potentially making diagnosing and resolving issues more difficult.

What is the Future of Reinforcement Learning?

Deep reinforcement learning employs deep neural networks to model the value function (called “value-based”), the agent’s policy (known as “policy-based”), or both (“actor-critic”). Before the widespread success of deep neural networks, data scientists had to engineer complex features to train an RL algorithm, meaning reduced learning capacity and thus limiting the scope of reinforcement learning to only simple environments. With deep learning, however, models can be built using millions of trainable weights, thus freeing the user from redundant and tedious feature engineering. Instead, relevant features are automatically generated during training, enabling the agent to learn the best policies in complex environments.

Traditionally, reinforcement learning in AI is applied to one task at a time, with each task learned by a separate RL agent. These agents don’t share knowledge, making learning complex behaviors, like driving a car, slow and inefficient. Problems with a common information source, have related underlying structures, and are interdependent can significantly boost performance by allowing multiple agents to work together. A3C (Asynchronous Advantage Actor-Critic) is an exciting new development in this area, where multiple agents concurrently learn related tasks. This multi-task learning scenario is gradually driving RL closer to Artificial General Intelligence (AGI), where meta-agents learn how to learn, making problem-solving more autonomous than ever.

Do You Want Training in Artificial Intelligence and Machine Learning?

Artificial Intelligence and Machine Learning are dynamic, exciting fields that offer much potential. If these disciplines sound like something you would like to explore further, consider this comprehensive program in AI and machine learning .

This online course delivers a high-engagement learning experience that teaches Python, Natural Language Processing, Machine Learning, and much more. According to Indeed.com , machine learning engineers earn an average yearly salary of $166,572. So, if you want to move into a more challenging, cutting-edge career that provides security and generous compensation, check out this online AI/ML bootcamp and prepare your skills to face the exciting challenges of today’s Machine Learning revolution.

Q: What are some examples of reinforcement learning in AI? A: Examples include:

Self-driving cars
Industry automation
Improved Natural Language Processing (NLP)

Q: What are the benefits of reinforcement learning in AI? A: Benefits include:

Quicker understanding
Reduced expenses
Better decision making

Q: What is the importance of reinforcement learning? A: This technology lets computers learn from vast data sets faster and with better results, a vital function in our increasingly data-saturated world.

You might also like to read:

How to Become an AI Architect: A Beginner’s Guide

How to Become a Robotics Engineer? A Comprehensive Guide

Machine Learning Engineer Job Description – A Beginner’s Guide

How To Start a Career in AI and Machine Learning

Career Guide: How to Become an AI Engineer

Artificial Intelligence & Machine Learning Bootcamp

Learning Format:

Online Bootcamp

What Is Transfer Learning in Machine Learning?

This article discusses transfer learning in machine learning, including what it is, why it’s needed, when to use it, and how it works.

What is Sustainable AI? Definition, Significance, and Examples

This article covers sustainable AI, including its definition, importance, use cases, and more.

Natural Language Processing NLP in Data Science

The Top 10 Natural Language Processing Applications

Every day, our society inches closer to technological innovations and devices initially found only in science fiction stories, films, and television shows. One of the

Explainable AI: Bridging the Gap Between Human Cognition and AI Models

As AI proliferates across industries, many people are worried about the veracity of something they don’t fully understand, with good reason. Enter explainable AI.

AI in Human Resources: Improving Hiring Processes with Predictive Analytics

This article discusses the role of artificial intelligence in human resources. It defines AI, shows ways AI is used in HR, and how to deploy AI in HR.

What Is an AI Chatbot, and How Do They Work?

This article discusses AI chatbots, what they are, how they work, how artificial intelligence figures into them, and their benefits.

Learning Format

Program Benefits

9+ top tools covered, 25+ hands-on projects
Masterclasses by distinguished Caltech CTME instructors
In collaboration with IBM
Global AI and ML experts lead training
Call us on : 1800-212-7688

Monte Carlo Methods in Reinforcement Learning

Recall that when using Dynamic Programming algorithms to solve RL problems, we made an assumption about the complete knowledge of the environment. With Monte Carlo methods, we only require experience - sample sequences of states, actions, and rewards from simulated or real interaction with an environment.

Monte Carlo Methods #

Monte Carlo , named after a casino in Monaco, simulates complex probabilistic events using simple random events, such as tossing a pair of dice to simulate the casino’s overall business model.

Monte Carlo methods have been used in several different tasks:

Simulating a system and its probability distribution \begin{equation} x\sim\pi(x) \end{equation}
Estimating a quantity through Monte Carlo integration \begin{equation} c=\mathbb{E}_\pi\left[f(x)\right]=\int\pi(x)f(x)\,dx \end{equation}
Optimizing a target function to find its modes (maxima or minima) \begin{equation} x^*=\text{argmax}\,\pi(x) \end{equation}
Learning a parameters from a training set to optimize some loss functions, such as the maximum likelihood estimation from a set of examples $\{x_i,i=1,2,\dots,M\}$ \begin{equation} \Theta^*=\text{argmax}\sum_{i=1}^{M}\log p(x_i;\Theta) \end{equation}
Visualizing the energy landscape of a target function.

Monte Carlo Methods in Reinforcement Learning #

Monte Carlo ( MC ) methods are ways of solving the reinforcement learning problem based on averaging sample returns. Here, we define Monte Carlo methods only for episodic tasks. Or in other words, they learn from complete episodes of experience.

Monte Carlo Prediction 1 #

Since the value of a state $v_\pi(s)=\mathbb{E}_\pi\left[G_t|S_t=s\right]$ is defined as the expectation of the return when the process is started from the given state $s$, an obvious way of estimating this value from experience is to compute observed mean returns after visits to that state. As more returns are observed, the average should converge to the expected value. This is an instance of the so-called Monte Carlo method .

In particular, suppose we wish to estimate $v_\pi(s)$ given a set of episodes obtained by following $\pi$ and passing through $s$. Each time state $s$ appears in an episode, we call it a visit to $s$. There are two types of Monte Carlo methods:

The first time $s$ is visited in an episode is referred as the first visit to $s$.
The method estimates $v_\pi(s)$ as the average of the returns that have followed the first visit to $s$.
The method estimates $v_\pi(s)$ as the average of the returns that have followed all visits to to $s$.

The sample mean return for state $s$ is computed as \begin{equation} v_\pi(s)=\dfrac{\sum_{t=1}^{T}𝟙\left(S_t=s\right)G_t}{\sum_{t=1}^{T}𝟙\left(S_t=s\right)}, \end{equation} where $𝟙(\cdot)$ is an indicator function. In the case of first-visit MC , $𝟙\left(S_t=s\right)$ returns $1$ only in the first time $s$ is encountered in an episode. And for every-visit MC , $𝟙\left(S_t=s\right)$ gives value of $1$ every time $s$ is visited.

Following is pseudocode of first-visit MC prediction , for estimating $V\approx v_\pi$

First-visit MC vs. every-visit MC #

Both methods converge to $v_\pi(s)$ as the number of visits (or first visits) to $s$ goes to infinity. Each average is itself an unbiased estimate, and the standard deviation of its error falls as $\frac{1}{\sqrt{n}}$, where $n$ is the number of returns averaged.

Monte Carlo Control 2 #

Monte carlo estimation of action values #.

When model is not available, it is particular useful to estimate action values rather than state values (which alone are insufficient to determine a policy). We must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. Thus, one of our primary goals for MC methods is to estimate $q_*$. To achieve this, we first consider the policy evaluation problem for action values.

Similar to when using MC method to estimate $v_\pi(s)$, we can use both first-visit MC and every-visit MC to approximate the value of $q_\pi(s,a)$. The only thing we need to keep in mind is, in this case, we work with visits to a state-action pair rather than to a state. Likewise, we define two types of MC methods for estimating $q_\pi(s,a)$:

First-visit MC : estimates $q_\pi(s,a)$ as the average of the returns following the first time in each episode that the state $s$ was visited and the action $a$ was selected.
Every-visit MC : estimates $q_\pi(s,a)$ as the average of the returns that have followed all the visits to state-action pair $(s,a)$.

Exploring Starts #

However, here we must exercise exploration . Because many state-action pairs may never be visited, and if $\pi$ is a deterministic policy, then returns of only single one action for each state will be observed. That leads to the consequence that the other actions will not be evaluated since there are no returns to average.

There is one way to achieve this, which is called exploring starts - an assumption that assumes the episodes start in a state-action pair , and that every pair has a nonzero probability of being selected as the start. This assumption assures that all state-action pairs will be visited an infinite number of times in the limit of an infinite number of episodes.

Monte Carlo Policy Iteration #

To learn the optimal policy by MC, we apply the idea of GPI : \begin{equation} \pi_0\overset{\small \text{E}}{\rightarrow}q_{\pi_0}\overset{\small \text{I}}{\rightarrow}\pi_1\overset{\small \text{E}}{\rightarrow}q_{\pi_1}\overset{\small \text{I}}{\rightarrow}\pi_2\overset{\small \text{E}}{\rightarrow}\dots\overset{\small \text{I}}{\rightarrow}\pi_*\overset{\small \text{E}}{\rightarrow}q_* \end{equation} Specifically,

Policy evaluation (denoted $\overset{\small\text{E}}{\rightarrow}$): estimates action value function $q_\pi(s,a)$ using the episode generated from $s, a$, following by current policy $\pi$ \begin{equation} q_\pi(s,a)=\dfrac{\sum_{t=1}^{T}𝟙\left(S_t=s,A_t=a\right)G_t}{\sum_{t=1}^{T}𝟙\left(S_t=s,A_t=a\right)} \end{equation}
Policy improvement (denoted $\overset{\small\text{I}}{\rightarrow}$): makes the policy greedy with the current value function (action value function in this case) \begin{equation} \pi(s)\doteq\underset{a\in\mathcal{A(s)}}{\text{argmax}},q(s,a) \end{equation} The policy improvement can be done by constructing each $\pi_{k+1}$ as the greedy policy w.r.t $q_{\pi_k}$ because \begin{align} q_{\pi_k}\left(s,\pi_{k+1}(s)\right)&=q_{\pi_k}\left(s,\underset{a}{\text{argmax}},q_{\pi_k}(s,a)\right) \\ &=\max_a q_{\pi_k}(s,a) \\ &\geq q_{\pi_k}\left(s,\pi_k(s)\right) \\ &\geq v_{\pi_k}(s) \end{align} Therefore, by the policy improvement theorem , we have that $\pi_{k+1}\geq\pi_k$.

To solve this problem with Monte Carlo policy iteration, in the 1998 version of “ Reinforcement Learning: An Introduction ”, authors of the book introduced Monte Carlo ES ( MCES ), for Monte Carlo with exploring starts .

In MCES, value function is approximated by simulated returns and a greedy policy is selected at each iteration. Although MCES does not converge to any sub-optimal policy, the convergence to optimal fixed-point is still an open question. For solutions in particular settings, you can check out some results like Tsitsiklis (2002), Chen (2018), Liu (2020). Down below is pseudocode of the Monte Carlo ES.

On-policy Monte Carlo Control 3 #

In the previous section, we used the assumption of exploring starts (ES) to design a Monte Carlo control method called MCES. In this part, without making that impractical assumption, we will be talking about another Monte Carlo control method.

In on-policy control methods , the policy is generally soft (i.e. $\pi(a|s)>0,\forall s\in\mathcal{S},a\in\mathcal{A(s)}$, but gradually shifted closer and closer to a deterministic optimal policy). We can not simply improve the policy by following a greedy policy, since no exploration will take place. Then to get rid of ES, we use the on-policy MC method with $\varepsilon$- greedy policies, e.g, most of the time they choose an action that maximal estimated action value, but with probability of $\varepsilon$ they instead select an action at random. Specifically,

$Pr(\small\textit{non-greedy action})=\dfrac{\varepsilon}{\vert\mathcal{A(s)}\vert}$
$Pr(\small\textit{greedy action})=1-\varepsilon+\dfrac{\varepsilon}{\vert\mathcal{A(s)}\vert}$

The $\varepsilon$-greedy policies are examples of $\varepsilon$- soft policies, defined as ones for which $\pi(a\vert s)\geq\frac{\varepsilon}{\vert\mathcal{A(s)}\vert}$ for all states and actions, for some $\varepsilon>0$. Among $\varepsilon$-soft policies, $\varepsilon$-greedy policies are in some sense those that closest to greedy.

We have that any $\varepsilon$-greedy policy w.r.t $q_\pi$ is an improvement over any $\varepsilon$-soft policy is assured by the policy improvement theorem .

Proof Let $\pi’$ be the $\varepsilon$-greedy. The conditions of the policy improvement theorem apply because for any $s\in\mathcal{S}$, we have: \begin{align} q_\pi\left(s,\pi’(s)\right)&=\sum_a\pi’(a|s)q_\pi(s,a) \\ &=\dfrac{\varepsilon}{\vert\mathcal{A}(s)\vert}\sum_a q_\pi(s,a)+(1-\varepsilon)\max_a q_\pi(s,a) \\ &\geq\dfrac{\varepsilon}{\vert\mathcal{A(s)}\vert}\sum_a q_\pi(s,a)+(1-\varepsilon)\sum_a\dfrac{\pi(a|s)-\frac{\varepsilon}{\vert\mathcal{A}(s)\vert}}{1-\varepsilon}q_\pi(s,a) \\ &=\dfrac{\varepsilon}{\vert\mathcal{A}(s)\vert}\sum_a q_\pi(s,a)+\sum_a\pi(a|s)q_\pi(s,a)-\dfrac{\varepsilon}{\vert\mathcal{A}(s)\vert}\sum_a q_\pi(s,a) \\ &=v_\pi(s) \end{align} where in the third step, we have used the fact that the latter $\sum$ is a weighted average over $q_\pi(s,a)$. Thus, by the theorem, $\pi’\geq\pi$. The equality holds when both $\pi’$ and $\pi$ are optimal policies among the $\varepsilon$-soft ones.

Pseudocode of the complete algorithm is given below.

Off-policy Monte Carlo Prediction 4 #

When working with control methods, we have to solve a dilemma about exploitation and exploration . In other words, we have to evaluate a policy from episodes generated by following an exploratory policy.

A straightforward way to solve this problem is to use two different policies, one that is learned about and becomes the optimal policy, and one that is more exploratory and is used to generate behavior. The policy is being learned about is called the target policy , whereas behavior policy is the one which is used to generate behavior.

In this section, we will be considering the off-policy method on prediction task, on which both target (denoted as $\pi$) and behavior (denoted as $b$) policies are fixed and given. Particularly, we wish to estimate $v_\pi$ or $q_\pi$ from episodes retrieved from following another policy $b$, where $\pi\neq b$.

Assumption of Coverage #

In order to use episodes from $b$ to estimate values for $\pi$, we require that every action taken under $\pi$ is also taken, at least occasionally, under $b$. That means, we assume that $\pi(a|s)>0$ implies $b(s|a)>0$, which leads to a result that $b$ must be stochastic, while $\pi$ may be deterministic since $\pi\neq b$. This is the assumption of coverage .

Importance Sampling #

Let $X$ be a variable (or set of variables) that takes on values in some space $\textit{Val}(X)$. Importance sampling (IS) is a general approach for estimating the expectation of a function $f(x)$ relative to some distribution $P(X)$, typically called the target distribution . We can estimate this expectation by generating samples $x[1],\dots,x[M]$ from $P$, and then estimating \begin{equation} \mathbb{E}_P\left[f\right]\approx\dfrac{1}{M}\sum_{m=1}^{M}f(x[m]) \end{equation} In some cases, it might be impossible or computationally very expensive to generate samples from $P$, we instead prefer to generate samples from a different distribution, $Q$, known as the proposal distribution (or sampling distribution ).

Unnormalized Importance Sampling . If we generate samples from $Q$ instead of $P$, we cannot simply average the $f$-value of the samples generated. We need to adjust our estimator to compensate for the incorrect sampling distribution. The most obvious way of adjusting our estimator is based on the observation that \begin{align} \mathbb{E}_{P(X)}\left[f(X)\right]&=\sum_x f(x)P(x) \\ &=\sum_x Q(x)f(x)\dfrac{P(x)}{Q(x)} \\ &=\mathbb{E}_{Q(X)}\left[f(X)\dfrac{P(X)}{Q(X)}\right]\tag{1}\label{1} \end{align} Based on this observation \eqref{1}, we can use the standard estimator for expectations relative to $Q$. We generate a set of sample $\mathcal{D}=\{x[1],\dots,x[M]\}$ from $Q$, and then estimate: \begin{equation} \hat{\mathbb{E}}_\mathcal{D}(f)=\dfrac{1}{M}\sum_{m=1}^{M}f(x[m])\dfrac{P(x[m])}{Q(x[m])}\tag{2}\label{2}, \end{equation} where $\hat{\mathbb{E}}$ denotes empirical expectation. We call this estimator the unnormalized importance sampling estimator , this method is also often called unweighted importance sampling . The factor $\frac{P(x[m])}{Q(x[m])}$ (denoted as $w(x[m])$) can be viewed as a correction weight to the term $f(x[m])$, which we would have used had $Q$ been our target distribution.
Normalized Importance Sampling . In many situations, we have that $P$ is known only up to a normalizing constant $Z$. Particularly, what we have access to is a distribution $\tilde{P}(X)=ZP(X)$. Thus, rather than to define the weights relative to $P$ as above, we define: \begin{equation} w(X)\doteq\dfrac{\tilde{P}(X)}{Q(X)} \end{equation} We have that the weight $w(X)$ is a random variable, and has expected value equal to $Z$: \begin{equation} \mathbb{E}_{Q(X)}\left[w(X)\right]=\sum_x Q(x)\dfrac{\tilde{P}(x)}{Q(x)}=\sum_x\tilde{P}(x)=Z \end{equation} Hence, this quantity is the normalizing constant of the distribution $\tilde{P}$. We can now rewrite \eqref{1} as: \begin{align} \mathbb{E}_{P(X)}\left[f(X)\right]&=\sum_x P(x)f(x) \\ &=\sum_x Q(x)f(x)\dfrac{P(x)}{Q(x)} \\ &=\dfrac{1}{Z}\sum_x Q(x)f(x)\dfrac{\tilde{P}(x)}{Q(x)} \\ &=\dfrac{1}{Z}\mathbb{E}_{Q(X)}\left[f(X)w(X)\right] \\ &=\dfrac{\mathbb{E}_{Q(X)}\left[f(X)w(X)\right]}{\mathbb{E}_{Q(X)}\left[w(X)\right]}\tag{3}\label{3} \end{align} We can use an empirical estimator for both the numerator and denominator. Given $M$ samples $\mathcal{D}=\{x[1],\dots,x[M]\}$ from $Q$, we can estimate: \begin{equation} \hat{\mathbb{E}}_\mathcal{D}(f)=\dfrac{\sum_{m=1}^{M}f(x[m])w(x[m])}{\sum_{m=1}^{M}w(x[m])}\tag{4}\label{4} \end{equation} We call this estimator the normalized importance sampling estimator (or weighted importance sampling estimator ).

Off-policy Monte Carlo Prediction via Importance Sampling #

We apply IS to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance sampling ratio (which we denoted as $w$ as above, but now we change the notation to $\rho$ in order to follows the book).

The probability of the subsequent state-action trajectory, $A_t,S_{t+1},A_{t+1},\dots,S_T$, occurring under any policy $\pi$ given starting state $s$ is: \begin{align} Pr(A_t,S_{t+1},\dots,S_T|S_t,A_{t:T-1}\sim\pi)&=\pi(A_t|S_t)p(S_{t+1}|S_t,A_t)\dots p(S_T|S_{T-1},A_{T-1}) \\ &=\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k) \end{align} Thus, the importance sampling ratio as we defined is: \begin{equation} \rho_{t:T-1}\doteq\dfrac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_t,A_t)}{\prod_{k=t}^{T-1}b(A_k|S_k)p(S_{k+1}|S_t,A_t)}=\prod_{k=1}^{T-1}\dfrac{\pi(A_k|S_k)}{b(A_k|S_k)} \end{equation} which depends only on the two policies and the sequence, not on the MDP.

Since $v_b(s)=\mathbb{E}\left[G_t|S_t=s\right]$, then we have \begin{equation} \mathbb{E}\left[\rho_{t:T-1}G_t|S_t=s\right]=v_\pi(s) \end{equation} To estimate $v_\pi(s)$, we simply scale the returns by the ratios and average the results: \begin{equation} V(s)\doteq\dfrac{\sum_{t\in\mathcal{T}(s)}\rho_{t:T(t)-1}G_t}{\vert\mathcal{T}(s)\vert},\tag{5}\label{5} \end{equation} where $\mathcal{T}(s)$ is the set of all states in which $s$ is visited (only for every-visit). For a first-visit,$\mathcal{T}(s)$ would only include time steps that were first visits to $s$ within their episodes. $T(t)$ denotes the first time of termination following time $t$, and $G_t$ denotes the return after $t$ up through $T(t)$.

When importance sampling is done as simple average in this way, we call it ordinary importance sampling (OIS) (which corresponds to unweighted importance sampling in the previous section).

And the one corresponding to weighted importance sampling (WIS), which uses a weighted average, is defined as: \begin{equation} V(s)\doteq\dfrac{\sum_{t\in\mathcal{T}(s)}\rho_{t:T(t)-1}G_t}{\sum_{t\in\mathcal{T}(s)}\rho_{t:T(t)-1}},\tag{6}\label{6} \end{equation} or zero if the denominator is zero.

Incremental Implementation for Off-policy MC Prediction using IS #

Incremental method #.

Incremental method is a way of updating averages with small, constant computation required to process each new reward instead of maintaining a record of all the rewards and then performing this computation whenever the estimated value was needed. It follows the general rule: \begin{equation} NewEstimate\leftarrow OldEstimate+StepSize\left[Target-OldEstimate\right] \end{equation}

Applying to Off-policy MC Prediction using IS #

In ordinary IS, the returns are scaled by the IS ratio $\rho_{t:T(t)-1}$, then simply averaged, as in \eqref{5}. Thus, it’s easy to apply incremental method to OIS.

For WIS, as in the equation \eqref{6}, we have to form a weighted average of the returns, and a slightly different incremental incremental algorithm is required. Suppose we have a sequence of returns $G_1,G_2,\dots,G_{n-1}$, all starting in the same state and each with a corresponding random weight $W_i$, e.g. $W_i=\rho_{t_i:T(t_i)}$. We wish to form the estimate \begin{equation} V_n\doteq\dfrac{\sum_{k=1}^{n-1}W_kG_k}{\sum_{k=1}^{n-1}W_k},\hspace{1cm}n\geq2 \end{equation} and keep it up-to-date as we obtain a single additional return $G_n$. In addition to keeping track of $V_n$, we must maintain for each state the cumulative sum $C_n$ of the weights given to the first $n$ returns. The update rule for $V_n$ is \begin{equation} V_{n+1}\doteq V_n+\dfrac{W_n}{C_n}\big[G_n-V_n\big],\hspace{1cm}n\geq1, \end{equation} and \begin{equation} C_{n+1}\doteq C_n+W_{n+1}, \end{equation} where $C_0=0$. And here is pseudocode of our algorithm.

Off-policy Monte Carlo Control #

Similarly, we develop the algorithm for off-policy MC control, based on GPI and WIS, for estimating $\pi_*$ and $q_*$, which is shown below.

The target policy $\pi\approx\pi_*$ is the greedy policy w.r.t $Q$, which is an estimate of $q_\pi$. The behavior policy, $b$, can be anything, but in order to assure convergence of $\pi$ to the optimal policy, an infinite number of returns must be obtained for each pair of state and action. This can be guaranteed by choosing $b$ to be $\varepsilon$-soft.

The policy $\pi$ converges to optimal at all encountered states even though actions are selected according to a different soft policy $b$, which may change between or even within episodes.

Example - Racetrack #

(This example is taken from RL book , Exercise 5.12.)

Problem Consider driving a race car around a turn like that shown in Figure 4 . You want to go as fast as possible, but not so fast as to run off the track. In our simplified racetrack, the car is at one of a discrete set of grid positions, the cells in the diagram. The velocity is also discrete, a number of grid cells moved horizontally and vertically per time step. The actions are increments to the velocity components. Each may be changed by +1, -1, or 0 in each step, for a total of nine (3 x 3) actions. Both velocity components are restricted to be nonnegative and less than 5, and they cannot both be zero except at the starting line. Each episode begins in one of the randomly selected start states with both velocity components zero and ends when the car crosses the finish line. The rewards are -1 for each step until the car crosses the finish line. If the car hits the track boundary, it is moved back to a random position on the starting line, both velocity components are reduced to zero, and the episode continues. Before updating the car’s location at each time step, check to see if the projected path of the car intersects the track boundary. If it intersects the finish line, the episode ends; if it intersects anywhere else, the car is considered to have hit the track boundary and is sent back to the starting line. To make the task more challenging, with probability 0.1 at each time step the velocity increments are both zero, independently of the intended increments. Apply a Monte Carlo control method to this task to compute the optimal policy from each starting state. Exhibit several trajectories following the optimal policy (but turn the noise off for these trajectories).

Solution code (source code can be found here ).

We begin by importing some useful packages.

Next, we define our environment

We continue by defining our behavior policy and algorithm.

And wrapping everything up with the main function.

We end up with this result after running the code.

Discounting-aware Importance Sampling #

Recall that in the above section , we defined the estimator for $\mathbb{E}_P[f]$ as: \begin{equation} \hat{\mathbb{E}}_\mathcal{D}(f)=\dfrac{1}{M}\sum_{m=1}^{M}f(x[m])\dfrac{P(x[m])}{Q(x[m])} \end{equation} This estimator is unbiased because each of the samples it averages is unbiased: \begin{equation} \mathbb{E}_{Q}\left[\dfrac{P(x[m])}{Q(x[m])}f(x[m])\right]=\int_x Q(x)\dfrac{P(x)}{Q(x)}f(x)\hspace{0.1cm}dx=\int_x P(x)f(x)\hspace{0.1cm}dx=\mathbb{E}_{P}\left[f(x[m])\right] \end{equation} This IS estimate is unfortunately often of unnecessarily high variance. To be more specific, for example, the episodes last 100 steps and $\gamma=0$. Then $G_0=R_1$ will be weighted by \begin{equation} \rho_{0:99}=\dfrac{\pi(A_0|S_0)}{b(A_0|S_0)}\dots\dfrac{\pi(A_{99}|S_{99})}{b(A_{99}|S_{99})} \end{equation} but actually, it really needs to be weighted by $\rho_{0:1}=\frac{\pi(A_0|S_0)}{b(A_0|S_0)}$. The other 99 factors $\frac{\pi(A_1|S_1)}{b(A_1|S_1)}\dots\frac{\pi(A_{99}|S_{99})}{b(A_{99}|S_{99})}$ are irrelevant because after the first reward, the return has already been determined. These later factors are all independent of the return and of expected value $1$; they do not change the expected update, but they add enormously to its variance. They could even make the variance infinite in some cases.

One of the methods used to avoid this large extraneous variance is discounting-aware IS . The idea is to think of discounting as determining a probability of termination or, equivalently, a degree of partial termination.

We begin by defining flat partial returns : \begin{equation} \bar{G}_{t:h}\doteq R_{t+1}+R_{t+2}+\dots+R_h,\hspace{1cm}0\leq t<h\leq T, \end{equation} where flat denotes the absence of discounting, and partial denotes that these returns do not extend all the way to termination but instead stop at $h$, called the horizon . The conventional full return $G_t$ can be viewed as a sum of flat partial returns : \begin{align} G_t&\doteq R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\dots+\gamma^{T-t-1}R_T \\ &=(1-\gamma)R_{t+1} \\ &\hspace{0.5cm}+(1-\gamma)\gamma(R_{t+1}+R_{t+2}) \\ &\hspace{0.5cm}+(1-\gamma)\gamma^2(R_{t+1}+R_{t+2}+R_{t+3}) \\ &\hspace{0.7cm}\vdots \\ &\hspace{0.5cm}+(1-\gamma)\gamma^{T-t-2}(R_{t+1}+R_{t+2}+\dots+R_{T-1}) \\ &\hspace{0.5cm}+\gamma^{T-t-1}(R_{t+1}+R_{t+2}+\dots+R_T) \\ &=(1-\gamma)\sum_{h=t+1}^{T-1}\left(\gamma^{h-t-1}\bar{G}_{t:h}\right)+\gamma^{T-t-1}\bar{G}_{t:T} \end{align} Now we need to scale the flat partial returns by an IS ratio that is similarly truncated. As $\bar{G}_{t:h}$ only involves rewards up to a horizon $h$, we only need the ratio of the probabilities up to $h$. We define:

Discounting-aware OIS estimator \begin{equation} V(s)\doteq\dfrac{\sum_{t\in\mathcal{T}(s)}\left[(1-\gamma)\sum_{h=t+1}^{T(t)-1}\left(\gamma^{h-t-1}\rho_{t:h-1}\bar{G}_{t:h}\right)+\gamma^{T(t)-t-1}\rho_{t:T(t)-1}\bar{G}_{t:T(t)}\right]}{\vert\mathcal{T}(s)\vert} \end{equation}
Discounting-aware WIS estimator \begin{equation} V(s)\doteq\dfrac{\sum_{t\in\mathcal{T}(s)}\left[(1-\gamma)\sum_{h=t+1}^{T(t)-1}\left(\gamma^{h-t-1}\rho_{t:h-1}\bar{G}_{t:h}\right)+\gamma^{T(t)-t-1}\rho_{t:T(t)-1}\bar{G}_{t:T(t)}\right]}{\sum_{t\in\mathcal{T}(s)}\left[(1-\gamma)\sum_{h=t+1}^{T(t)-1}\left(\gamma^{h-t-1}\rho_{t:h-1}\right)+\gamma^{T(t)-t-1}\rho_{t:T(t)-1}\right]} \end{equation}

These two estimators take into account the discount rate $\gamma$ but have no effect if $\gamma=1$.

Per-decision Importance Sampling #

There is another way beside discounting-aware that may be able to reduce variance, even if $\gamma=1$.

Recall that in the off-policy estimator \eqref{5} and \eqref{6}, each term of the sum in the numerator is itself a sum: \begin{align} \rho_{t:T-1}G_t&=\rho_{t:T-1}\left(R_{t+1}+\gamma R_{t+2}+\dots+\gamma^{T-t-1}R_T\right) \\ &=\rho_{t:T-1}R_{t+1}+\gamma\rho_{t:T-1}R_{t+2}+\dots+\gamma^{T-t-1}\rho_{t:T-1}R_T\tag{7}\label{7} \end{align} We have that \begin{equation} \rho_{t:T-1}R_{t+k}=\dfrac{\pi(A_t|S_t)}{b(A_t|S_t)}\dots\dfrac{\pi(A_{t+k-1}|S_{t+k-1})}{b(A_{t+k-1}|S_{t+k-1})}\dots\dfrac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}R_{t+k} \end{equation} Of all these factors, only the first $k$ factors, $\frac{\pi(A_t|S_t)}{b(A_t|S_t)}\dots\frac{\pi(A_{t+k-1}|S_{t+k-1})}{b(A_{t+k-1}|S_{t+k-1})}$, and the last (the reward $R_{t+k}$) are related. All the others are for event that occurred after the reward. Moreover, we have that \begin{equation} \mathbb{E}\left[\dfrac{\pi(A_i|S_i)}{b(A_i|S_i)}\right]\doteq\sum_a b(a|S_i)\dfrac{\pi(a|S_i)}{b(a|S_i)}=1 \end{equation} Therefore, we obtain \begin{align} \mathbb{E}\Big[\rho_{t:T-1}R_{t+k}\Big]&=\mathbb{E}\left[\dfrac{\pi(A_t|S_t)}{b(A_t|S_t)}\dots\dfrac{\pi(A_{t+k-1}|S_{t+k-1})}{b(A_{t+k-1}|S_{t+k-1})}\right]\mathbb{E}\left[\dfrac{\pi(A_k|S_k)}{b(A_k|S_k)}\right]\dots\mathbb{E}\left[\dfrac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}\right] \\ &=\mathbb{E}\Big[\rho_{t:t+k-1}R_{t+k}\Big].1\dots 1 \\ &=\mathbb{E}\Big[\rho_{t:t+k-1}R_{t+k}\Big] \end{align} Plug the result we just got into the expectation of \eqref{7}, we have \begin{align} \mathbb{E}\Big[\rho_{t:T-1}G_t\Big]&=\mathbb{E}\Big[\rho_{t:T-1}R_{t+1}+\gamma\rho_{t:T-1}R_{t+2}+\dots+\gamma^{T-t-1}\rho_{t:T-1}R_T\Big] \\ &=\mathbb{E}\Big[\rho_{t:t}R_{t+1}+\gamma\rho_{t:t+1}R_{t+2}+\dots+\gamma^{T-t-1}\rho_{t:T-1}R_T\Big] \\ &=\mathbb{E}\Big[\tilde{G}_t\Big], \end{align} where $\tilde{G}_t=\rho_{t:T-1}R_{t+1}+\gamma\rho_{t:T-1}R_{t+2}+\dots+\gamma^{T-t-1}\rho_{t:T-1}R_T$.

We call this idea per-decision IS . Hence, we develop per-decision OIS estimator, using $\tilde{G}_t$: \begin{equation} V(s)\doteq\dfrac{\sum_{t\in\mathcal{T}(s)}\tilde{G}_t}{\vert\mathcal{T}(s)\vert} \end{equation}

References #

[1] Richard S. Sutton, Andrew G. Barto . Reinforcement Learning: An Introduction . MIT press, 2018.

[2] Adrian Barbu, Song-Chun Zhu. Monte Carlo Methods .

[3] David Silver. UCL course on RL .

[4] Csaba Szepesvári. Algorithms for Reinforcement Learning .

[5] Singh, S.P., Sutton, R.S. Reinforcement learning with replacing eligibility traces . Mach Learn 22, 123–158, 1996.

[6] John N. Tsitsiklis. On the Convergence of Optimistic Policy Iteration . Journal of Machine Learning Research 3 (2002) 59–72.

[7] Yuanlong Chen. On the convergence of optimistic policy iteration for stochastic shortest path problem , arXiv:1808.08763, 2018.

[8] Jun Liu. On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts . arXiv:2007.10916, 2020.

[9] Daphne Koller & Nir Friedman. Probabilistic Graphical Models: Principles and Techniques .

[10] A. Rupam Mahmood, Hado P. van Hasselt, Richard S. Sutton. Weighted importance sampling for off-policy learning with linear function approximation . Advances in Neural Information Processing Systems 27 (NIPS 2014).

Footnotes #

A prediction task in RL is where we are given a policy and our goal is to measure how well it performs. ↩︎

Along with prediction, a control task in RL is where the policy is not fixed, and our goal is to find the optimal policy. ↩︎

On-policy is a category of RL algorithms that attempts to evaluate or improve the policy that is used to make decisions. ↩︎

In contrast to on-policy, off-policy methods evaluate or improve a policy different from that used to generate the data. ↩︎

Computational Stochastic Optimization and Learning

Reinforcement Learning and Stochastic Optimization: A unified framework for stochastic optimization

Warren B. Powell, Professor Emeritus, Princeton University

Citation: Warren B. Powell, Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions, John Wiley and Sons, Hoboken, 2022 (1100 pages).

Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions is the first textbook to offer a comprehensive, unified framework of the rich field of sequential decisions under uncertainty. Up to now, this rich problem class has been fragmented into at least 15 distinct fields that have been studied under names such as dynamic programming, stochastic programming, optimal control, simulation optimization, optimal learning, and multiarmed bandit problems. Recently these have been primarily associated with “reinforcement learning” and “stochastic optimization.”

Purchasing: The book is available from Amazon (at a continually varying price) or Wiley . Both have their own e-book formats.

Audience: The book requires little more than a good course in probability and statistics/machine learning (and supporting linear algebra). There are occasional forays that draw on linear programming. The presentation is designed for people who want to plan sequential decision problems, with an emphasis on modeling and computation. Applications are drawn from numerous topics in engineering (electrical, civil, mechanical, chemical), physical sciences, computer science, social sciences, economics and finance, operations research and industrial engineering, business applications and statistics/machine learning.

Tutorial : Below is a four-part video tutorial (recorded March 13, 2022, based on my presentation at the Informs Optimization Society conference):

Part I: Introduction and the universal framework

Part II: An energy storage example, bridging machine learning to sequential decisions, and an introduction to the four classes of policies

Part III: The first three classes of policies: Policy function approximations (PFAs), cost function approximations (CFAs), and value function approximations (VFAs).

Part IV: The fourth class, direct lookahead approximations (DLAs), followed by a discussion of how different communities in the “jungle of stochastic optimization” have been evolving to adopt all four classes of policies, ending with a pitch for courses (and even a new academic field) on sequential decision analytics.

Major book themes and features:

Central to the book is that we are addressing sequential decision problems: decision, information, decision, information, … which combines making decisions over time (or iterations/experiments), under uncertainty. Every problem addressed in the dramatically growing literature on reinforcement learning can be described in this way.
We use a universal modeling framework with five-parts (state variables, decision variables, exogenous information, transition function, and objective function) that applies to any sequential decision problem (the framework draws heavily from the optimal control literature). Central to the framework is optimizing over policies.
We introduce four (meta)classes of policies (PFAs, CFAs, VFAs and DLAs) that we claim are universal – they include any method proposed in the literature, or used in practice. The policies are distinguished by their computational characteristics. Each of the four classes can produce an optimal policy for special problem classes, but this is rare.
The book is written for people who want to write software for real-world problems, using methods that scale not just in terms of size, but also complexity. Mathematical notation is designed to be translated directly into software (there is a one-to-one relationship between software and our mathematical modeling framework).
To help first-time readers, many sections are marked with * indicating they can be skipped the first time through the book. Sections marked with ** indicate more advanced mathematics (we do not use measure-theoretic terminology, but we have a section that introduces the interested reader who would like to learn it).
There are over 370 exercises, organized into seven categories at the end of each chapter. These consist of review questions, modeling, computation, theory, problem solving, problems drawn from the companion volume Sequential Decision Analytics and Modeling (with supporting python modules ), and finally a ““”diary” problem that the reader chooses in chapter 1, and then uses to answer a question at the end of each chapter. This allows readers to work on an application familiar to them.
There is a supplementary materials webpage here .

Chapter summaries:

Below I briefly summarize each chapter. Three chapters can still be downloaded from this webpage (these are prepublication versions): Chapter 1 (which introduces the entire universal framework), chapter 9 (on modeling) and chapter 11 (which describes the four classes of policies in depth). The table of contents and index is also available for download.

Table of contents

Chapter 1 – Introduction – The book presents a new universal framework for sequential decisions under uncertainty, where the focus is designing policies (methods for making decisions). We present four classes of policies that cover all possible methods. Chapter 1 provides a one-chapter overview of this entire framework.

Chapter 2 – Canonical models and applications – This chapter provides very brief summaries of 14 different fields that all deal with sequential decisions. We give the five-part universal framework in a bit more detail, and a series of examples.

Chapter 3 – Online learning – This is a single-chapter tutorial of all the major methods for online (that is, adaptive) learning, spanning lookup tables (with independent and correlated beliefs), parametric and nonparametric models (including neural networks).

Chapter 4 – Introduction to stochastic search – This chapter talks about three strategies for solving different types of sequential decision problems: methods using deterministic mathematics (where you can compute expectations), sampled methods (where we approximate expectations with samples), and adaptive methods, which dominates the rest of the book

Chapter 5 – Derivative-based stochastic optimization – This is classic material on stochastic gradient methods, but includes a description of how to model a stochastic gradient algorithm as a sequential decision problem (the decision is the stepsize).

Chapter 6 – Stepsize policies – An entire chapter dedicated to stepsize policies. We cover deterministic policies, adaptive (stochastic) policies, and then some results on optimal policies.

Chapter 7 – Derivative-free stochastic optimization – We describe derivative-free stochastic search (also known as a multi-armed bandit problem) using our universal framework (since it is a sequential decision problem), and then outline examples of all four classes of policies that have been applied to this broad problem class.

Chapter 8 – State-dependent problems – This chapter provides a tour of a wide range of general dynamic problems which we call “state-dependent” problems, to distinguish them from “state-independent” problems which are the pure learning problems of chapters 5 and 7.

Chapter 9 – Modeling sequential decision problems – This chapter presents the five-part universal modeling framework, first in the context of simple problems, and then a much more detailed presentation that provides the foundation for modeling much more complex problems (think of energy systems, transportation systems, health applications, supply chains).

Chapter 10 – Uncertainty modeling – We describe 12 sources of uncertainty, and then provide a brief tutorial into Monte Carlo sampling and uncertainty quantification.

Chapter 11 – Designing policies – This is a much more in-depth presentation of the four classes of policies, and includes guidance on how to choose a policy class for a problem you might be working on. Chapters 12-19 cover each of the four classes in depth (chapters 14-18 are dedicated to value function approximations).

Chapter 12 – Policy function approximations and policy search – We discuss the idea of PFAs in much greater depth (PFAs consist of any function class covered in chapter 3 for online learning), and describe four methods for performing a search over tunable parameters spanning derivative-free stochastic search, and three types of derivative-based methods: numerical derivatives, backpropagation (for control problems), and the policy-gradient method.

Chapter 13 – Cost function approximations – Cost function approximations are parameterized optimization models. Widely used in industry on an ad-hoc basis, they have been largely overlooked by the research literature. This book is the first to deal with them as a fundamental class of policy. CFAs feature much simpler scaling issues than the simpler PFAs.

Chapter 14 – Exact Dynamic Programming – This is classic material on Markov decision processes, as well as some other examples of dynamic programs that can be solved exactly, including linear quadratic regulation from optimal control.

Chapter 15 – Backward approximate dynamic programming – Backward approximate dynamic programming is a relatively recent methodology (it parallels fitted value iteration for infinite horizon problems), but we have had considerable success with it. It overcomes problems with high-dimensional and/or continuous states and uncertainties and, for some problems, high-dimensional and/or continuous decisions.

Chapter 16 – Forward ADP I: The value of a policy – This is from my 2011 ADP book, and introduces, for finite and infinite horizon problems, classical TD( \lambda ), approximate value iteration (single pass and double pass), LSTD and LSPE, projected Bellman minimization for linear architectures, Bayesian learning for value functions, designing stepsizes for approximate value iteration.

Chapter 17 – Forward ADP II: Policy optimization – Also from my 2011 ADP book, here we describe the rich methods for approximating value functions while simultaneously searching for policies. This, with chapter 16, is the material that was the original heart of reinforcement learning before the introduction (in this book) of the four classes of policies.

Chapter 18 – Forward ADP III: Convex functions – This is methods for approximate dynamic programming where we exploit convexity (concavity for maximizing), which arises frequently with high-dimensional resource allocation problems. We show how to use Benders cuts, piecewise linear separable and linear (in the resource variable) value function approximations that have been applied to real resource allocation problems.

Chapter 19 – Direct lookahead policies – We cover both deterministic lookahead policies (most often associated with “model predictive control”) and policies based on stochastic lookahead models. We describe six classes of approximation strategies, and then illustrate all four classes of policies in the context of policies to solve a lookahead model (the “policy-within-a-policy”).

Chapter 20 – Multiagent modeling and learning – Here we adopt our universal framework and illustrate it in the context of the rich domain of multiagent problems, beginning with two-agent problems for learning where we present our perspective on POMDPs.

Index – Check out the entry “Applications”

The roots of the book:

Note: This book used my 2011 book, Approximate Dynamic Programming: Solving the curses of dimensionality as a starting point. Chapter 3 on online learning evolved from chapter 8 in ADP on approximating value functions. The modeling chapter 5 from ADP is now chapter 9 (with major modifications) in RLSO. Chapter 14 in RLSO is based on the old chapter 3 on Markov decision processes, but now includes a section on optimal control and examples of dynamic programs that can be solved exactly. Chapter 15 is an entirely new chapter on backward approximate dynamic programming. Chapters 16-18 are based directly on the chapters in the ADP book for approximating value functions (they are now labeled as “Forward approximate dynamic programming”). Everything else is completely new.

Warren Powell [email protected]

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Fostering human learning in sequential decision-making: Understanding the role of evaluative feedback

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan, United States of America

Roles Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Writing – review & editing

Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – review & editing

Piyush Gupta,
Subir Biswas,
Vaibhav Srivastava

Published: May 28, 2024
https://doi.org/10.1371/journal.pone.0303949
Peer Review
Reader Comments

Cognitive rehabilitation, STEM (science, technology, engineering, and math) skill acquisition, and coaching games such as chess often require tutoring decision-making strategies. The advancement of AI-driven tutoring systems for facilitating human learning requires an understanding of the impact of evaluative feedback on human decision-making and skill development. To this end, we conduct human experiments using Amazon Mechanical Turk to study the influence of evaluative feedback on human decision-making in sequential tasks. In these experiments, participants solve the Tower of Hanoi puzzle and receive AI-generated feedback while solving it. We examine how this feedback affects their learning and skill transfer to related tasks. Additionally, treating humans as noisy optimal agents, we employ maximum entropy inverse reinforcement learning to analyze the effect of feedback on the implicit human reward structure that guides their decision making. Lastly, we explore various computational models to understand how people incorporate evaluative feedback into their decision-making processes. Our findings underscore that humans perceive evaluative feedback as indicative of their long-term strategic success, thus aiding in skill acquisition and transfer in sequential decision-making tasks. Moreover, we demonstrate that evaluative feedback fosters a more structured and organized learning experience compared to learning without feedback. Furthermore, our results indicate that providing intermediate goals alone does not significantly enhance human learning outcomes.

Citation: Gupta P, Biswas S, Srivastava V (2024) Fostering human learning in sequential decision-making: Understanding the role of evaluative feedback. PLoS ONE 19(5): e0303949. https://doi.org/10.1371/journal.pone.0303949

Editor: Rei Akaishi, RIKEN CBS: RIKEN Noshinkei Kagaku Kenkyu Center, JAPAN

Received: November 16, 2023; Accepted: May 2, 2024; Published: May 28, 2024

Copyright: © 2024 Gupta et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data and code is publicly available at https://github.com/piyushgupta221/Decision_making_ToH .

Funding: This work has been supported in part by the NSF awards IIS-1734272 and ECCS-2024649. The analysis portion of this work is supported in part by the ONR award N00014-22-1-2813. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

The integration of advanced Artificial Intelligence (AI) algorithms and affordable Internet of Things (IoT) devices has led to the widespread use of these technologies in various personal and professional devices. AI algorithms can handle complex decision-making challenges and support individuals in achieving their learning goals. However, it remains uncertain if embedding intelligent technology in these devices enhances individuals’ reasoning and decision-making abilities. To this end, we explore the potential benefits of offering feedback derived from AI’s optimal policies in the context of sequential decision-making tasks. Our primary goal is to evaluate whether this feedback can effectively enhance an individual’s performance in a specific task and whether the acquired knowledge and skills can be readily transferred to related tasks. Through this study, we aim to uncover the influence of AI-generated evaluative feedback on human decision-making.

The examination of the interaction between AI and embodied human intelligence has far-reaching implications for various domains such as cognitive rehabilitation after brain injuries or strokes [ 1 , 2 ], sports coaching, surgical training, driving instruction and human-supervisory systems [ 3 – 5 ]. The design of automated tutoring systems for assisting humans in learning new tasks has been a topic of significant interest [ 6 – 13 ]. Historically, these systems have been based on the manual coding of domain knowledge, which is then translated into a human-readable format. Recent works [ 8 ] have started to explore machine-learning approaches to design automated tutoring systems but do not account for human learning dynamics. Some researchers have also examined the role of cognitive architecture in the design of effective tutoring systems [ 14 , 15 ], yet these efforts still primarily rely on traditional methods of manual coding of domain knowledge.

In this work, we focus on sequential decision-making tasks [ 16 , 17 ] that inherently present significant cognitive challenges. They require continual decision-making at each time step, with each choice potentially influencing future states and overall outcomes. These tasks involve navigating the exploration-exploitation trade-off [ 18 – 20 ], which pertains to deciding whether to act based on current knowledge or to explore in order to enhance that knowledge. Proficiency in these tasks can significantly enhance problem-solving skills.

We selected the Tower of Hanoi (ToH) puzzle [ 21 – 23 ] as our choice for the sequential decision-making task. This choice is motivated by the simplicity of the ToH task, enabling efficient learning and evaluation within a reasonable timeframe for our experiment. Nonetheless, it’s essential to note that the framework discussed in our work is broadly applicable and can be generalized to other complex sequential decision-making tasks. In ToH, various-sized disks are arranged on three pegs, and the objective is to reach a specific disk configuration by moving one disk at a time. Importantly, only the uppermost disk on a peg can be moved, and larger disks cannot be placed on top of the smaller ones. Decision-making in ToH has been frequently employed in psychological research, serving as a valuable tool for examining developmental progress in children and adolescents [ 24 ]. In the cognitive assessment domain, ToH is instrumental for gauging visual-spatial and complex problem-solving capabilities in both adults [ 25 ] and children [ 26 ]. Solving the ToH task not only requires strong cognitive skills but also relies heavily on executive functions, especially planning [ 27 ]. Planning is essential for tackling complex reasoning tasks as it involves controlling impulsive actions and prioritizing strategic problem-solving.

In this study, we explore the impact of AI-generated evaluative feedback on human decision-making, specifically within the context of the ToH puzzle. The AI agent learns the optimal ToH policy and provides evaluative feedback to guide human participants. We evaluate various forms of feedback on decision-making performance and knowledge transfer, conduct experiments to visualize skill development with and without feedback, and investigate models for understanding how humans incorporate feedback into their decision-making processes. This research provides insights into the role of feedback in shaping human decisions.

There are three major contributions to this work.

(i). Exploring Evaluative Feedback Strategies: We investigate the impact of different evaluative feedback strategies on the performance of individuals learning to solve ToH, a widely studied sequential decision-making task. Furthermore, we explore how individuals trained with different feedback strategies transfer their skills to a more challenging task.
(ii). Understanding Reward Structures Induced by Evaluative Feedback: Treating humans as noisy optimal agents, we study how various evaluative feedback strategies affect their reward functions. Our research highlights the influence of different forms of evaluative feedback on the implicit reward structure that explains human decisions.
(iii). Developing Computational Models for Human Decision-Making: We create a set of candidate computational models that may explain how humans integrate evaluative feedback into their sequential decision-making processes. Our goal is to identify the model that best explains human decision-making under evaluative feedback conditions.

The rest of the manuscript is structured as follows. Sec. 2 presents background and problem formulation, and includes a discussion of the ToH structure, the application of maximum entropy IRL for learning human rewards, and the development of computational models aimed at integrating evaluative feedback into human decision-making processes. In Sec. 3, we provide details of the ToH experiments, conducted through the Amazon Mechanical Turk (AMT) platform, alongside the discussion of the various evaluative feedback strategies employed during these experiments. We discuss and analyze the experiment results in Section 4 and finally conclude in Sec 5.

2 Background and problem formulation

We investigate the influence of evaluative feedback on human performance in a sequential decision-making task through experimental evaluations and computational modeling. To this end, we conducted experiments, where the participants were asked to solve the ToH puzzle. ToH is a puzzle in which disks with a priority order are placed on three pegs. The priority order determines which disk can be placed on top of another disk and each instance of admissible disk placement is referred to as a configuration. Thus, for a four-disk and a five-disk ToH, there are 3 4 = 81 and 3 5 = 243 possible configurations, respectively. The goal is to move one disk at a time and reach the desired configuration while maintaining the priority order at each time.

Consider the ToH puzzle with n disks, where the disks are numbered {0, 1, …, n − 1} in ascending order of size, and the three pegs are numbered {0, 1, 2} from left to right. The state of the n -disk ToH can be represented as S n = ( s 0 s 1 … s n −1 ), where s i ∈ {0, 1, 2} denotes the peg on which disk i is placed, for 0 ≤ i ≤ n − 1. Each state in an n -disk ToH has either two or three possible state transitions as can be seen by the state space of a 4-disk ToH shown in Fig 1 .

PPT PowerPoint slide
PNG larger image
TIFF original image

Each state corresponds to a unique configuration of the disks on three pegs and edges encode allowed transitions between states. The task is to reach the configuration associated with a randomly selected target state (for example 2201 in this figure). Warmer colors are associated with the higher value function (see Sec. 2.1 for discussion).

https://doi.org/10.1371/journal.pone.0303949.g001

2.1 Evaluative feedback

Using the reward function in ( 5 ) results in an optimal value function that is proportional to the length of the shortest path for each state to the target state. The obtained optimal value function is utilized to provide evaluative feedback to the human player based on the change in the value at states before and after the move. We deploy several feedback mechanisms as detailed in Section 3.1 and systematically explore how human decision-making is influenced by different feedback mechanisms.

Remark 1 The value function for each state in the ToH problem is proportional to the shortest path length to the target state, allowing for the application of simpler graph-search algorithms rather than RL. However, it’s crucial to recognize that this characteristic is unique to ToH’s finite and structured state space, governed by a recursive pattern, and does not apply to all sequential decision-making problems. In complex sequential decision-making problems like chess, characterized by larger or continuous state and action spaces, simpler algorithms might not be available, necessitating the use of advanced AI techniques like RL or deep neural networks to obtain the optimal policy. However, our framework is broadly applicable and can be generalized to other complex sequential decision-making tasks .

The state space of the ToH problem exhibits a recursive structure. Specifically, the state space of a ToH puzzle with n disks can be effectively illustrated using three interlocking triangles. Each of these triangles symbolizes the state space of a ToH puzzle with n − 1 disks. To illustrate this concept, let’s examine the state space of a 4-disk ToH in Fig 2 , which is highlighted in red. In the same figure, the blue and green squares are employed to represent the state spaces of 3-disk and 2-disk ToH puzzles, respectively. Hence, the state space for the ToH with n − 1 disks can be simply achieved by removing the last digit from each state in the upper triangle of the n -disk ToH. This digit corresponds to the position of the largest disk.

Each state corresponds to a unique configuration of the disks on three pegs and edges encode allowed transitions between states. The state space can be visualized as comprising three triangular structures. The states that connect different triangular structures are critical states to transition between triangles.

https://doi.org/10.1371/journal.pone.0303949.g002

As illustrated in Fig 2 , the state space of the 4-disk ToH puzzle can be decomposed into three triangles labeled as T 1 , T 2 , and T 3 . Throughout the remainder of the manuscript, we will consistently refer to the regions of the state space as follows: the top triangle will be denoted as T 1 , the lower left triangle as T 2 , and the lower right triangle as T 3 . These triangles are interconnected at their vertices through single edges. These vertex states are critical states, transitioning from one triangle to another necessitates passing through these states. For instance, starting from an initial state in T 1 , the optimal path to reach a desired state in T 2 or T 3 must involve the state transitions 1110 → 1112 and 2220 → 2221, respectively. Indeed, to master the art of solving the ToH puzzle effectively, one must grasp its inherent recursive structure. Success in solving the puzzle relies on systematically working towards reaching the crucial critical states within the state space.

2.2 Human rewards using maximum entropy inverse reinforcement learning

In the context of human participants solving the ToH puzzle, we can perceive them as noisy optimal agents striving to optimize an implicit reward function. Utilizing their demonstrations, we can leverage Inverse Reinforcement Learning (IRL) techniques [ 31 , 32 ] to deduce a reward function. This reward function is designed to align the optimal policy with the observed human demonstrations.

The maximum entropy IRL [ 33 , 34 ] assumes human demonstrations are not perfect and allows us to learn from sub-optimal demonstrations by incorporating a probabilistic model that captures the variability in human behavior. Maximum entropy IRL has gained significant traction in the literature as a means to effectively learn from human demonstrations [ 35 ].

Interested readers are referred to [ 35 ] for detailed derivations.

We employ maximum entropy IRL to infer the reward functions associated with human behavior. Detailed results are presented in Section 4.2.

2.3 Modeling human sequential decision-making under feedback

A central and challenging aspect of designing efficient tutoring systems lies in understanding the impact of evaluative feedback from AI on human decision-making. Precisely, the question of how humans incorporate feedback into their decision-making processes is of paramount importance. In modeling this process, a foundational challenge is understanding how they interpret the feedback, including whether it’s seen as an immediate reward or an evaluation of long-term impacts. Further, it’s important to explore if feedback relates only to the current action or spans the sequence of actions. Additionally, understanding if evaluative feedback affects the assessment of value functions over time or momentarily influences action choices is vital. To tackle these questions, we develop candidate models that embody different mechanisms for incorporating feedback into human decision-making processes. Our models are inspired by the Training an Agent Manually via Evaluative Reinforcement (TAMER) framework [ 36 – 39 ] developed to incorporate human feedback into the policy of an artificial RL agent.

Model 1—Ignore feedback: This baseline model operates under the assumption that evaluative feedback isn’t directly integrated into human decision-making processes. Instead, individuals are postulated to focus on maximizing the long-term value derived from their personal reward functions. In this framework, evaluative feedback plays an indirect role by shaping and refining these reward functions. This is the default model studied in Sec. 2.2. The model encompasses | f | learned parameters.

We investigate these models in Sec 4.3 to understand how humans incorporate evaluative feedback in their decision-making.

3 Human experiments

In this section, we discuss the human experiments conducted using AMT.

3.1 Experiment design

The only difference among the experiments was the feedback provided during the 4-disk ToH training task. No feedback was provided during the 5-disk ToH transfer task in any of the experiments. In each experiment, participants were asked to try their best to get the highest scores. The feedback and scoring metrics used during the training task for the five experiments were:

Fig 3 shows the experimental interface utilized by participants during the training task of Experiment 5. As illustrated in Fig 3 , the participants had access to the numeric feedback, the number of moves taken, the current score S (total reward), the maximum available moves m allowed , the maximum possible reward, as well as information regarding intermediate and final goal configurations.

https://doi.org/10.1371/journal.pone.0303949.g003

The interface for the training tasks in the other experiments is similar to Experiment 5 with the following key differences with respect to Fig 3 :

(i). Experiment 1: Participants do not receive any numeric feedback, like “Bad Move: -2 (in Fig 3 )”, and intermediate goal configurations during the training tasks.
(ii). Experiment 2: Participants receive numeric feedback during training tasks, but no intermediate goal configurations are provided.
(iii). Experiment 3: Participants were given access to a button labeled “Get Feedback” during the training tasks. Numeric feedback is only provided upon user request by pressing the designated button.
(iv). Experiment 4: Participants receive intermediate goal configurations during training tasks, but no numeric feedback is provided.

3.2 Methods

After receiving the IRB consent (MSU IRB #8421) from Michigan State University’s IRB office, we recruited 238 participants using AMT for the study. Inclusion criteria were established as having completed a minimum of 500 prior studies and maintaining a 98% approval rate on the platform. Participants were compensated with a base payment of $6 and had the opportunity to earn additional performance-based bonuses ranging from $0 − $4. Of the recruited participants, 78 participants were excluded due to self-reported prior experience with the ToH task.

The recruitment of participants took place from July 3, 2023, to July 10, 2023. Before engaging in the experiment, each participant was required to give written informed consent online, which was then securely documented alongside their experimental data. Participation was restricted to individuals who were 18 years of age or older.

4 Results and discussion

In this section, we discuss the results of the experiments conducted on AMT.

4.1 Performance under evaluative feedback

First, we collect the data of 20 participants each for the 5 set of experiments detailed in Section 3.1.

In Fig 4a , we present box plots illustrating the percentage scores achieved in the training tasks (4-disk ToH). Notably, participants who underwent training with evaluative feedback in Experiment 2 (numeric feedback) and Experiment 5 (sub-goal with numeric feedback) exhibited significantly improved performance during these training tasks compared to participants in Experiment 1 (no feedback), who received no evaluative feedback.

Within each box plot, the median is represented by the red horizontal line, while the lower and upper edges of the box signify the 25th and 75th percentiles, respectively. Whiskers extend to encompass the most extreme data points that are not classified as outliers, and individual outliers are plotted using the symbol ‘+’.

https://doi.org/10.1371/journal.pone.0303949.g004

In Experiment 3 (optional feedback), participants seldom requested feedback to avoid the feedback penalty, resulting in performance levels akin to those observed in Experiment 1. Experiment 4 (sub-goal) introduced a unique approach, where participants were exclusively exposed to sub-goal configuration (1110 or 2220) crucial for reaching the desired target state. In the absence of evaluative feedback, this method resembled the conditions of Experiment 1, where the sub-goal can be effectively thought as a target state until the sub-goal state is reached. We hypothesize that supplying solely sub-goal configurations without evaluative feedback may induce confusion, as participants may now consider two target states simultaneously—the sub-goal and the target state. Consequently, participants in Experiment 4 exhibited a marginal decrease in performance compared to those in Experiment 1.

In Fig 4b , we present box plots illustrating the percentage scores achieved in the transfer tasks involving the 5-disk ToH. It’s important to note that solving the 5-disk ToH, with its 243 states, presents a significantly greater challenge compared to the training task, which involved the 4-disk ToH with 81 states. Furthermore, participants had no prior experience with the 5-disk ToH and relied solely on their training with the 4-disk ToH. Consequently, the transfer tasks yielded relatively lower scores, with many trials failing to solve the puzzle within the allotted number of moves, which can make it challenging to interpret the box plots in Fig 4b .

To focus on successful outcomes, we filtered for positive percentage scores in each experiment, representing the trials where participants successfully solved the ToH puzzle. Table 1 provides an overview of the percentage of successful trials for each experiment, both in the training and transfer tasks. Notably, Experiment 2 and Experiment 5 demonstrated a substantial improvement in successful trials, showing increases of 33.5% and 36%, respectively, compared to Experiment 1 in the training tasks. In the transfer tasks, Experiments 2 and 5 also showed notable improvements, with success rates increasing by 13% and 26%, respectively, compared to Experiment 1.

https://doi.org/10.1371/journal.pone.0303949.t001

To assess the statistical significance of these findings, we conducted a two-sample t -test comparing the results of Experiments 2 and 5 with the data from Experiment 1. Remarkably, the p values for Experiment 2 (in comparison to Experiment 1) and Experiment 5 (relative to Experiment 1) are 1.59 × 10 −12 and 1.71 × 10 −17 , respectively, in the training tasks, indicating highly significant differences. In the transfer tasks, the p values are 3.9 × 10 −2 and 7.17 × 10 −4 for Experiments 2 and 5 compared to Experiment 1, respectively. Consistent with the commonly accepted significance level of 0.05, a p value below this threshold leads us to reject the null hypothesis, indicating that the data from the two experiments do not arise from the same distribution at a 5% significance level. These results underscore the substantial impact of evaluative feedback on performance, both in the training and transfer tasks.

Fig 5a and 5b display box plots representing successful trials after filtering for positive scores. Notably, the medians of these box plots closely align with each other, suggesting that participants’ performances in the experiments can be effectively compared solely through the percentage of successful trials. Once participants have successfully learned to solve the ToH puzzle, their scores exhibit relatively little variation across experiments during successful trials. This observation highlights the stability and consistency of participants’ performance once they have mastered the task.

https://doi.org/10.1371/journal.pone.0303949.g005

Recall that each participant completed 10 trials of training and 5 trials of transfer tasks. In Fig 6a and 6b , bar plots represent the mean percentage scores for different trials in the training and transfer tasks, respectively. It’s evident that participants who received no feedback exhibited relatively low scores compared to those who received either numeric feedback or sub-goals with numeric feedback. Furthermore, while there is no consistent improvement over the trials for participants who did not receive feedback, participants who received evaluative feedback demonstrated performance enhancement with increasing scores across trials. Similar trends are observable in the transfer tasks, indicating that participants who received evaluative feedback found it easier to transfer their skills to related tasks and showed improvement across trials.

https://doi.org/10.1371/journal.pone.0303949.g006

The results in Table 1 underscore significant improvements in human decision-making attributed to evaluative feedback during training tasks, along with effective skill transfer to related tasks. We employ maximum entropy IRL [ 34 ] to investigate the pivotal role of evaluative feedback in shaping human decision-making, as detailed in Sections 4.2 and 4.3. To enable this analysis, we conducted additional data collection sessions with 20 participants each, encompassing experiments devoid of feedback (Experiment 1) and those involving evaluative feedback (Experiments 2 and 5).

4.2 Human rewards under evaluative feedback

In this section, we treat humans solving the ToH puzzle as noisy optimal agents striving for optimal play with some implicit reward structure. We examine participants from three sets of experiments: (a) No feedback (Experiment 1), (b) Numeric feedback (Experiment 2), and (c) Sub-goal with numeric feedback (Experiment 5). To gain insights into human learning under these varying feedback conditions, we employ maximum entropy IRL analysis to uncover the underlying human reward structures. Visualizing these human rewards can offer valuable insights into the learning process with and without evaluative feedback.

Fig 7a and 7b display the learned IRL rewards in the training tasks for all states. While IRL typically assumes expert demonstrations, it’s important to note that participants may still be learning the task during the initial trials. Since the performance does not vary significantly in the latter half of the trials (see Fig 6a ), we assume that the human rewards are relatively stationary from trials 6 to 10 and, therefore, exclusively utilize these trials for our IRL analysis. From these latter trajectories, we derive IRL rewards, considering both (a) all available trajectories and (b) only the successful ones, where success is defined by reaching the target state.

IRL plots displaying learned human rewards in the training tasks for all states, using trajectory datasets (from trials 6-10 for each participant) from each experiment that encompass (a) all available trajectories and (b) only successful trajectories, where success is defined by reaching the target state. The red color represents high rewards close to 1 and dark blue represents close to 0 reward.

https://doi.org/10.1371/journal.pone.0303949.g007

Each of these plots is organized into a grid with 2 rows and 3 columns. The top row represents trajectories with the target state in triangle T 2 , while the bottom row represents trajectories with the target state in triangle T 3 . The columns correspond to the three sets of experiments: no feedback, numerical feedback, and sub-goal with numerical feedback arranged from left to right.

In Fig 7a , it becomes apparent that participants’ rewards in the experiment with no feedback (first column) exhibit a distribution across all states, encompassing both T 2 and T 3 , despite the target state’s placement in T 2 for the first row and in T 3 for the second row. The occurrence of high rewards in T 3 (respectively T 2 ) when the target state resides in T 2 (respectively T 3 ) primarily stems from the unsuccessful attempts to solve the ToH puzzle in each experiment. Consequently, we observe that as participants’ performance improves across experiments from left to right, rewards increase within the triangle containing the target state while decreasing in the opposing triangle. Another noteworthy observation is the presence of high rewards at the critical states (vertices of the target triangle), which serve as pivotal entry points to the target triangle. These rewards become more pronounced as performance enhances from left to right.

Fig 7b depicts the learned IRL rewards derived exclusively from successful trajectories in each experiment. Due to the absence of failed trajectories in each experiment, the disparities in IRL rewards across experiments, from left to right, become less pronounced. In each experiment, states within the target triangle and critical states exhibit higher rewards compared to the opposing triangles. In Experiments 1 and 2, the elevated rewards along the edge in the opposite triangle, which is closer to the target triangle, suggest that participants in these experiments occasionally complete the puzzle by opting for suboptimal routes. In contrast, participants in Experiment 5 predominantly solve the puzzle utilizing the optimal trajectory.

Fig 8a and 8b present the learned IRL rewards for all states within the transfer tasks, utilizing trajectory datasets that encompass (a) all available trajectories and (b) only successful trajectories. It is important to note that the transfer tasks pose significant challenges, with none of the participants receiving any feedback. Consequently, the trajectories for the transfer tasks in each experiment comprise numerous failed trajectories.

IRL plots displaying learned human rewards in the transfer tasks for all states, using trajectory datasets from each experiment that encompass (a) all available trajectories and (b) only successful trajectories, where success is defined by reaching the target state. The red color represents high rewards close to 1 and dark blue represents close to 0 reward.

https://doi.org/10.1371/journal.pone.0303949.g008

However, a noticeable trend emerges: participants from Experiment 5, who were trained using sub-goals with numeric feedback, exhibit faster learning in solving the transfer tasks compared to participants from Experiments 1 and 2, who received no feedback and only numerical feedback, respectively. This is evident from the higher rewards within the target triangle and lower rewards in the opposite triangle for Experiment 5. When considering only successful trajectories to derive the IRL rewards in Fig 8b , the differences across experiments become less pronounced due to the exclusion of failed trajectories in all experiments.

The results presented in Figs 7 and 8 offer valuable insights into how humans acquire puzzle-solving skills under various evaluative feedback strategies. However, it’s worth noting that the learned rewards appear less sparse due to the predefined features, which permit non-zero rewards in all states. Consequently, while these learned IRL rewards for all states offer insights into critical states, they can complicate the comparison between experiments. Furthermore, most of the RL rewards are often sparse. To this end, we modify the predefined features to encourage sparser rewards, allowing non-zero rewards in only 8 states for both the training and transfer tasks. These 8 states were thoughtfully selected as the vertices of the smaller triangles within the state space. In Fig 2 , these states correspond to 2200, 1100, 1110, 2220, 0012, 2212, 1121, 0021.

Fig 9a and 9b illustrate the learned IRL rewards for a specific subset of 8 states during the training tasks. These rewards are derived from trajectory datasets obtained from the latter half of the trials (trials 6 to 10) for each participant. We consider two scenarios: (a) using all available trajectories and (b) using only the trajectories that resulted in successful task completion. It is evident that participants from Experiment 5 demonstrate non-zero rewards exclusively within the target triangle and the corresponding critical states. As we progress from left to right, the non-zero rewards in the opposite triangle diminish due to fewer instances of failure. These differences become less pronounced when we solely consider successful trajectories in Fig 9b .

IRL plots displaying learned human rewards in the training tasks for a subset of 8 states, using trajectory datasets (from trials 6-10 for each participant) from each experiment that encompass (a) all available trajectories and (b) only successful trajectories, where success is defined by reaching the target state. The red color represents high rewards close to 1 and dark blue represents close to 0 reward.

https://doi.org/10.1371/journal.pone.0303949.g009

Fig 10a and 10b depict the learned IRL rewards for a selected subset of 8 states within the transfer tasks, using trajectory datasets that encompass (a) all available trajectories and (b) only successful trajectories. While the distinctions are somewhat less pronounced due to the presence of numerous failure attempts in all experiments, the lower rewards in the opposite triangle indicate swifter learning when participants are trained with feedback, in contrast to participants who receive no feedback. These differences become less noticeable when we exclusively consider successful trajectories in Fig 10b , effectively eliminating most of the non-zero rewards in the opposite triangle.

IRL plots displaying learned human rewards in the transfer tasks for a subset of 8 states, using trajectory datasets from each experiment that encompass (a) all available trajectories and (b) only successful trajectories, where success is defined by reaching the target state. The red color represents high rewards close to 1 and dark blue represents close to 0 reward.

https://doi.org/10.1371/journal.pone.0303949.g010

The results of the max entropy IRL analysis underscore the significance of critical states and demonstrate how human learning in sequential decision-making tasks can be organized more effectively when evaluative feedback is provided, in contrast to participants solely learning through exploration without any feedback. The results further indicate that the participants trained with evaluative feedback exhibit an ability to transfer their learning to newer, related, and more demanding tasks at a significantly accelerated pace compared to those who learn without feedback.

4.3 Modeling human decision-making under evaluative feedback

In Sec. 4.1 and 4.2, we have demonstrated the pivotal role of evaluative feedback in enhancing learning and performance within the context of the ToH puzzle. In this section, we delve into exploring models that aim to elucidate the mechanisms through which humans integrate evaluative feedback into their decision-making processes.

It is important to note that this numeric feedback is determined based on the change in state value before and after the state transition. Consequently, it is intrinsically tied to the target state, given that the value function is contingent upon the target state.

Since the target state is subject to randomization in triangles T 2 and T 3 , we further segment these triangles into three sub-triangles each. This subdivision allows us to categorize the experimental data into six distinct groups, based on the location of the target state within these six sub-triangles. Within each group, we select the top vertex of the sub-triangle as the designated target state and truncate the trajectories to the point at which they initially enter the target sub-triangle.

https://doi.org/10.1371/journal.pone.0303949.t002

Table 3 presents the AIC and BIC values (normalized by the number of observations) for different models within each group when non-zero rewards are allowed for only a subset of 8 states. This setting represents a more realistic scenario with sparse rewards. Notably, in this context, Model 2 consistently emerges as the best fit according to both the AIC and BIC criteria. This suggests that humans tend to interpret evaluative feedback as a strong indicator of the long-term effectiveness of their strategic actions.

https://doi.org/10.1371/journal.pone.0303949.t003

Remark 3 Even though Model 2 stands out as the preferred model according to both AIC and BIC criteria, there is a small evidence of support for Model 4 as well (in case of non-sparse rewards). This suggests that there might be instances where some individuals do not primarily learn through interaction but instead focus on maximizing their evaluative feedback directly. Such individuals could potentially encounter challenges in transfer tasks where evaluative feedback is not available .

4.4 Broader implications of results

Human learning and the acquisition of problem-solving skills in sequential decision-making tasks have broad implications. They can assist in cognitive rehabilitation post-injuries or strokes, enhance mathematical reasoning and STEM skill development in children, and improve performance in sports. However, mastering these skills is often challenging due to the cognitive demands of continuous decision-making. Our work introduces a systematic approach to designing advanced AI-driven tutoring systems to foster human learning in sequential decision-making tasks. As shown in Section 4.1, fostering human learning with AI-generated feedback not only promotes skill development but also facilitates the transfer of learned skills to more complex tasks. Additionally, as evidenced in Section 4.2, learning through evaluative feedback creates a more structured and organized learning experience compared to learning without feedback. Hence, these AI-based tutoring systems can improve the problem-solving skills and cognitive capabilities of the individuals while improving their learning experience.

Our findings in Section 4.3 suggest that humans perceive feedback as an indicator of the long-term effectiveness of their strategic actions. This insight can be utilized to influence human decision-making through the appropriate design of IoT devices. Specifically, by crafting feedback strategies geared towards fostering long-term behavioral enhancements, we can effectively influence individuals’ long-term actions and decision-making processes.

5 Conclusions

In this work, we study the influence of AI-generated evaluative feedback on human decision-making, with a specific focus on sequential decision-making tasks exemplified by the Tower of Hanoi. Our study demonstrates that individuals who receive training with evaluative feedback not only experience significant improvements in their decision-making abilities but also excel in transferring these enhanced skills to similar tasks. Through an analysis utilizing the maximum entropy inverse reinforcement learning framework, we show that human learning exhibits a more structured and organized implicit reward pattern when evaluative feedback is provided during the training process. This highlights the critical role played by AI-generated feedback in improving the cognitive and strategic abilities of individuals.

Furthermore, our investigation explores various models to better comprehend how humans integrate feedback into their decision-making processes. Our findings provide substantial evidence suggesting that individuals tend to interpret evaluative feedback as a valuable indicator of the long-term effectiveness of their strategic actions. This valuable insight can be leveraged to design intelligent IoT devices, capable of enriching human learning experiences and shaping human decision-making.

View Article
Google Scholar
PubMed/NCBI
3. P. Gupta and V. Srivastava, “Optimal fidelity selection for human-in-the-loop queues using semi-Markov decision processes,” American Control Conference , pp. 5266–5271, 2019.
5. P. Gupta, S. D. Bopardikar, and V. Srivastava, “Achieving efficient collaboration in decentralized heterogeneous teams using common-pool resource games,” 58th Conference on Decision and Control , pp. 6924–6929, IEEE, 2019.
6. B. M. McLaren, R. Kenneth, M. Schneider, A. Harrer, and L. Bollen, “Bootstrapping novice data: Semi-automated tutor authoring using student log files,” Proceedings of the Workshop on Analyzing Student-Tutor Interaction Logs to Improve Educational Outcomes , Seventh International Conference on Intelligent Tutoring Systems , pp. 1–10, Aug. 2004.
8. M. C. Gombolay, R. Jensen, J. Stigile, S.-H. Son, and J. A. Shah, “Learning to tutor from expert demonstrators via apprenticeship scheduling,” The AAAI-17 Workshop on Human-Machine Collaborative Learning , pp. 664–669, 2017.
9. M. K. Rahman, S. Sanghvi, and N. El-Moughny, “Enhancing an automated Braille writing tutor,” International Conference on Intelligent Robots and Systems , pp. 2327–2333, 2009.
10. P. Gupta and V. Srivastava, “Optimal fidelity selection for improved performance in human-in-the-loop queues for underwater search,” arXiv preprint arXiv:2311.06381 , 2023.
13. P. Gupta, “Optimal & Game Theoretic Feedback Design for Efficient Human Performance in Human-Supervised Autonomy,” PhD thesis, Michigan State University , 2023.
15. M. W. Lewis, R. Milson, and J. R. Anderson, “The teacher’s apprentice: Designing an intelligent authoring system for high school mathematics,” Artificial Intelligence and Instruction: Applications and Methods , pp. 269–301, Addison-Wesley Publishing Company, 1987.
17. P. Gupta and V. Srivastava, “On robust and adaptive fidelity selection for human-in-the-loop queues,” European Control Conference , pp. 872–877, 2021.
18. D. Bertsekas, Dynamic Programming and Optimal Control , vol. 1. Athena Scientific, 2012.
19. M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, 1994.
20. P. Gupta and V. Srivastava, “Deterministic sequencing of exploration and exploitation for reinforcement learning,” 61st Conference on Decision and Control , pp. 2313–2318, IEEE, 2022.
28. R. S. Sutton and A. G. Barto, Reinforcement Learning , Second Edition: An Introduction . MIT Press, Nov. 2018.
30. C. Szepesvári, Algorithms for Reinforcement Learning . Springer Nature, 2022.
31. A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” Proceedings of the Seventeenth International Conference on Machine Learning , p. 663–670, 2000.
32. M. Lopes, F. Melo, and L. Montesano, “Active learning for reward estimation in inverse reinforcement learning,” Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pp. 31–46, Springer, 2009.
35. B. D. Ziebart, “Modeling purposeful adaptive behavior with the principle of maximum causal entropy,” PhD thesis, Carnegie Mellon University , 2010.
36. W. B. Knox and P. Stone, “Interactively shaping agents via human reinforcement: The tamer framework,” International Conference on Knowledge Capture , pp. 9–16, 2009.
39. W. B. Knox and P. Stone, “Learning non-myopically from human-generated reward,” International Conference on Intelligent User Interfaces , pp. 191–202, 2013.

Deep Reinforcement Learning for Solving Vehicle Routing Problems With Backhauls

Ieee account.

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Combining Reinforcement Learning and Tensor Networks, with an Application to Dynamical Large Deviations

Affiliations.

1 School of Physics and Astronomy, University of Nottingham, Nottingham NG7 2RD, United Kingdom.
2 Centre for the Mathematics and Theoretical Physics of Quantum Non-Equilibrium Systems, University of Nottingham, Nottingham NG7 2RD, United Kingdom.
3 Department of Physics and Astronomy, University College London, Gower Street, London WC1E 6BT, United Kingdom.
PMID: 38804929
DOI: 10.1103/PhysRevLett.132.197301

We present a framework to integrate tensor network (TN) methods with reinforcement learning (RL) for solving dynamical optimization tasks. We consider the RL actor-critic method, a model-free approach for solving RL problems, and introduce TNs as the approximators for its policy and value functions. Our "actor-critic with tensor networks" (ACTeN) method is especially well suited to problems with large and factorizable state and action spaces. As an illustration of the applicability of ACTeN we solve the exponentially hard task of sampling rare trajectories in two paradigmatic stochastic models, the East model of glasses and the asymmetric simple exclusion process, the latter being particularly challenging to other methods due to the absence of detailed balance. With substantial potential for further integration with the vast array of existing RL methods, the approach introduced here is promising both for applications in physics and to multi-agent RL problems more generally.

A Reinforcement Learning Method for Solving the Production Scheduling Problem of Silicon Electrodes

Conference paper
First Online: 30 July 2023
Cite this conference paper

Yu-Fang Huang 13 ,
Rong Hu 13 , 14 ,
Xing Wu 14 ,
Bin Qian 13 &
Yuan-Yuan Yang 13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14086))

Included in the following conference series:

International Conference on Intelligent Computing

1215 Accesses

In this paper, a new silicon electrode production (SEP) scheduling problem resulting from silicon electrode manufacturing procedure is discussed. We combine two coupled subproblems to model the problem. One subproblem is a silicon rod cutting scheduling problem in parallel machines, and the other is a silicon electrode product scheduling problem in a hybrid flowshop. We present a reinforcement learning (RL) method to address this SEP problem. RL uses Q-learning algorithm to autonomously select heuristics from a pre-designed low-level heuristic set (LLHs). The selected heuristic is used to optimize the solution space for better results. Considering the “single-batch” coupling relationship in the manufacturing process, a two-stage encoding strategy is used for subproblems, and the related decoding mechanism was designed in the silicon rod cutting stage and silicon electrode production processing stage. Experimental results on newly introduced examples demonstrate that the suggested method effectively competes with cutting-edge algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Available as PDF
Read on any device
Instant download
Own it forever
Available as EPUB and PDF
Compact, lightweight edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tang, L., Guan, J., Hu, G.: Steelmaking and refining coordinated scheduling problem with waiting time and transportation consideration. Comput. Ind. Eng. 58 , 239–248 (2010)

Article Google Scholar

Pan, Q.K.: An effective co-evolutionary artificial bee colony algorithm for steelmaking-continuous casting scheduling. Eur. J. Oper. Res. 250 (3), 702–714 (2016)

Article MathSciNet MATH Google Scholar

Hoogeveen, J.A., Lenstra, J.K., Veltman, B.: Preemptive scheduling in a two-stage multiprocessor flow shop is NP-hard. Eur. J. Oper. Res 89 , 172–175 (1996)

Article MATH Google Scholar

Madou, M.J., et al.: Bulk and surface characterization of the silicon electrode. Surf. Sci. 108 (1), 135–152 (1981)

Chandrasekaran., R., et al.: Analysis of lithium insertion/deinsertion in a silicon electrode particle at room temperature. J. Electrochemical Soc. 157 (10), A1139 (2010)

Google Scholar

Huertas, Z.C., et al.: High performance silicon electrode enabled by titanicone coating. Sci. Rep. 12 (1), 1–8 (2022)

Pacciarelli, D., Pranzo, M.: Production scheduling in a steelmaking-continuous casting plant. Comput. Chem. Eng. 28 , 2823–2835 (2004)

Tang, L., Lub, P.B., Liu, J., Fang, L.: Steelmaking process scheduling using Lagrangian relaxation. Int. J. Prod. Res. 40 (1), 55–70 (2002)

Xuan, H., Tang, L.: Scheduling a hybrid flowshop with batch production at the last stage. Comput. Oper. Res. 34 (9), 2718–2733 (2007)

Mao, K., Pan, Q.-K., Pang, X., Chai, T.: A novel Lagrangian relaxation approachfor a hybrid flowshop scheduling problem in the steelmaking-continuous casting process. Eur. J. Oper. Res. 236 , 51–60 (2014)

Atighehchian, A., Bijari, M., Tarkesh, H.: A novel hybrid algorithm for scheduling steel-making continuous casting production. Comput. Oper. Res. 36 (8), 2450–2461 (2009)

Zhu, D.-F., Zheng, Z., Gao, X.-Q.: Intelligent optimization-based production planning and simulation analysis for steelmaking and continuous casting process. Int. J. Iron Steel Res. 17 (9), 19–24 (2010)

Pan, Q.-K., Wang, L., Mao, K., Zhao, J.-H., Zhang, M.: An effective artificial bee colony algorithm for a real-world hybrid flowshop problem in steelmaking process. IEEE Trans. Autom. Sci. Eng. 10 (2), 307–322 (2013)

Han, W., Guo, F., Su, X.: A reinforcement learning method for a hybrid flow-shop scheduling problem. Algorithms 12 (11), 222 (2019)

Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn 8 , 279–292 (1992)

Choong, S.S., Wong, L.-P., Lim, C.P.: Automatic design of hyper-heuristic based on reinforcement learning. Inf. Sci. 436–437 , 89–107 (2018)

Article MathSciNet Google Scholar

Li, X., Guo, X., Tang, H., et al.: Improved cuckoo algorithm for the hybrid flow-shop scheduling problem in sand casting enterprises considering batch processing. Available at SSRN 41 , 18–112 (2018)

Falcao, D., Madureira, A., Pereira, I.: Q-learning based hyper-heuristic for scheduling system self-parameterization. In: 10th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–7. Aveiro, Portugal: IEEE (2015)

Montgomery, D.C.: Design and analysis of experiments. Wiley (2008)

Goodarzi, M., Raheleh, F.A., Farughi, H.: Integrated hybrid flow shop scheduling and vehicle routing problem. J. Ind. Syst. Eng., 223–244 (2021)

Li, W., et al.: Integrated production and transportation scheduling method in hybrid flow shop. Chinese J. Mech. Eng., 1–20 (2022)

Qin, H., Li, T., Teng, Y., Wang, K.: Integrated production and distribution scheduling in distributed hybrid flow shops. Memetic Comput. 13 (2), 185–202 (2021). https://doi.org/10.1007/s12293-021-00329-6

Download references

Acknowledgement

This research was supported by the National Natural Science Foundation of China (61963022 and 62173169) and the Basic Research Key Project of Yunnan Province (202201AS070030).

Author information

Authors and affiliations.

School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China

Yu-Fang Huang, Rong Hu, Bin Qian & Yuan-Yuan Yang

Faculty of Mechanical and Electrical Engineering, Kunming University of Science and Technology, Kunming, 650500, China

Rong Hu & Xing Wu

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rong Hu .

Editor information

Editors and affiliations.

Eastern Institute of Technology, Zhejiang, China

De-Shuang Huang

University of Wollongong, North Wollongong, NSW, Australia

Prashan Premaratne

Zhengzhou University of Light Industry, Zhengzhou, China

Zhong Yuan University of Technology, Zhengzhou, China

University of Ulsan, Ulsan, Korea (Republic of)

Kang-Hyun Jo

Department of Computer Science, Liverpool John Moores University, Liverpool, UK

Abir Hussain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper.

Huang, YF., Hu, R., Wu, X., Qian, B., Yang, YY. (2023). A Reinforcement Learning Method for Solving the Production Scheduling Problem of Silicon Electrodes. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science, vol 14086. Springer, Singapore. https://doi.org/10.1007/978-981-99-4755-3_36

Download citation

DOI : https://doi.org/10.1007/978-981-99-4755-3_36

Published : 30 July 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-4754-6

Online ISBN : 978-981-99-4755-3

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Publish with us

Policies and ethics

Find a journal
Track your research

IMAGES

Reinforcement Learning Algorithms and Applications
Reinforcement Learning: A Brief Guide
Reinforcement Learning Introduction
PPT
Developing Problem-Solving Skills for Kids
four steps of problem solving process

VIDEO

Reinforcement Learning 7: Planning and Models
Reinforcement Learning Applications
Reinforcement Learning: Bringing Together Computation, Behavior and Neural Coding
Mastering Problem Solving: The Power of Interleaved Practice
Clicker Fun with Dr. Deborah Jones: Click and Fix- Positive Solutions to Behavior Problems
Maze problem solved by Reinforcement Learning

COMMENTS

Reinforcement Learning Made Simple (Part 2): Solution Approaches
'Solving' a Reinforcement Learning problem basically amounts to finding the Optimal Policy (or Optimal Value). There are many algorithms, which we can group into different categories. Model-based vs Model-free. Very broadly, solutions are either: Model-based (aka Planning) Model-free (aka Reinforcement Learning)
PDF Reinforcement Learning: An Introduction
1 The Reinforcement Learning Problem 1 ... approximation, policy-gradient methods, and methods designed for solving o -policy learning problems. Part IV surveys some of the frontiers of rein-forcement learning in biology and applications. This book was designed to be used as a text in a one- or two-semester
Reinforcement Learning Made Simple (Part 1): Intro to Basic Concepts
You've probably started hearing a lot more about Reinforcement Learning in the last few years, ever since the AlphaGo model, which was trained using reinforcement-learning, stunned the world by beating the then reigning world champion at the complex game of Go. ... and how to apply an RL problem-solving framework to it using techniques from ...
An Introduction to Reinforcement Learning
More broadly, the marriage of reinforcement learning and artificial neural networks is termed deep reinforcement learning. These models incorporate the strengths of deep learning within reinforcement learning techniques. The most popular of these algorithms include the Deep Q-Networks (DQN), which were introduced by DeepMind in 2013. This ...
Learning to Optimize with Reinforcement Learning
In essence, an optimizer trained using supervised learning necessarily overfits to the geometry of the training objective functions. One way to solve this problem is to use reinforcement learning. Background on Reinforcement Learning. Consider an environment that maintains a state, which evolves in an unknown fashion based on the action that is ...
Intelligent problem-solving as integrated hierarchical reinforcement
We address this question from the perspective of reinforcement learning (RL) 6,7,8.Several studies suggest that RL is biologically and cognitively plausible 8,9.Many existing RL-based methods are ...
What is Reinforcement Learning in AI?
Reinforcement learning is a flexible problem-solving approach that can improve performance when used in conjunction with additional machine learning techniques, like deep learning. Disadvantages of Reinforcement Learning. There are better choices than reinforcement learning for solving simple problems
Reinforcement learning
Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward.Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
Introduction to Optimal Control and Reinforcement Learning
Reinforcement learning was introduced in Sect. 1.3.1 as an approach to solving optimal control problems without requiring full model information. The idea behind this approach is to approximate the solution of the HJB equation by performing some functional approximations using tools such as neural networks .
Monte Carlo Methods in Reinforcement Learning
Monte Carlo Methods in Reinforcement Learning# Monte Carlo (MC) methods are ways of solving the reinforcement learning problem based on averaging sample returns. Here, we define Monte Carlo methods only for episodic tasks. Or in other words, they learn from complete episodes of experience. ... Problem Consider driving a race car around a turn ...
Reinforcement Learning and Stochastic Optimization: A unified framework
Reinforcement Learning and Stochastic Optimization: ... modeling, computation, theory, problem solving, problems drawn from the companion volume Sequential Decision Analytics and ... (PFAs consist of any function class covered in chapter 3 for online learning), and describe four methods for performing a search over tunable parameters spanning ...
Challenges of real-world reinforcement learning: definitions ...
Reinforcement learning (RL) (Sutton and Barto 2018) is a powerful algorithmic paradigm encompassing a wide array of contemporary algorithmic approaches (Mnih et al. 2015; Silver et al. 2016; Hafner et al. 2018).RL methods have been shown to be effective on a large set of simulated environments (Mnih et al. 2015; Silver et al. 2016; Lillicrap et al. 2015; OpenAI 2018), but uptake in real-world ...
PDF Reinforcement Learning to Solve NP-hard Problems: an Application to the
to solve routing problems with Reinforcement Learning. 1.2 Research Objectives Solving combinatorial optimization problems with the Reinforcement Learning (RL) framework is a fairly new trend compared to the decades of research and the hundreds of algorithms developed for exact and heuristics methods to solve this type of problem.
Reinforcement Learning to Solve NP-hard Problems: an Application to the
In this paper, we evaluate the use of Reinforcement Learning (RL) to solve a classic combinatorial optimization problem: the Capacitated Vehicle Routing Problem (CVRP). We formalize this problem in the RL framework and compare two of the most promising RL approaches with traditional solving techniques on a set of benchmark instances. We measure the different approaches with the quality of the ...
PDF Reinforcement Learning in Control Theory: A New Approach to
Reinforcement Learning in Control Theory: A New Approach to Mathematical Problem Solving Kala Agbo Bidi a, Jean-Michel Coron , Amaury Hayatb, and Nathan Lichtléb,c aLaboratoire Jacques-Louis Lions, Sorbonne Université, Université de Paris, CNRS, INRIA, équipe Cage, Paris, France ([email protected]).
Learning Global Optimization by Deep Reinforcement Learning
3.3 Multi-Task Reinforcement Learning. Multi-task reinforcement learning is a subfield of reinforcement learning that deals with the problem of learning to solve multiple sequential-decision tasks at once [10, 23]. It is related to L2O by Deep RL due to the characteristic of training a (Deep) RL agent in a distribution of similar tasks.
Sensors
Reinforcement learning (RL) has emerged as a dynamic and transformative paradigm in artificial intelligence, offering the promise of intelligent decision-making in complex and dynamic environments. This unique feature enables RL to address sequential decision-making problems with simultaneous sampling, evaluation, and feedback. As a result, RL techniques have become suitable candidates for ...
Reinforcement Learning Explained Visually (Part 3): Model-free
This is the third article in my series on Reinforcement Learning (RL). Now that we understand what an RL Problem is, and the types of solutions available, we'll now learn about the core techniques used by all solutions. ... and how to apply an RL problem-solving framework to it using techniques from Markov Decision Processes and concepts such ...
Reinforcement learning for the traveling salesman problem: Performance
adaptation of three popular reinforcement algorithms for solving the TSP problem; ... Different ε-greedy: The ε-greedy strategy is a popular exploration-exploitation method used in reinforcement learning. In this experiment, four different dynamic decay methods are used to adjust the value of ε during the learning process: ...
Stochastic Linear Quadratic Optimal Control Problem: A Reinforcement
This article adopts a reinforcement learning (RL) method to solve infinite horizon continuous-time stochastic linear quadratic problems, where the drift and diffusion terms in the dynamics may depend on both the state and control. Based on the Bellman's dynamic programming principle, we presented an online RL algorithm to attain optimal control with partial system information. This algorithm ...
Fostering human learning in sequential decision-making: Understanding
Utilizing their demonstrations, we can leverage Inverse Reinforcement Learning (IRL) techniques [31, 32] to deduce a reward function. This reward function is designed to align the optimal policy with the observed human demonstrations. ... Human learning and the acquisition of problem-solving skills in sequential decision-making tasks have broad ...
A reinforcement learning-based metaheuristic algorithm for solving
It has been shown that reinforcement learning methods are more successful in finding new global areas than metaheuristic approaches, and have a more balanced behavior than metaheuristic methods. ... It can be offered by the end-user or generated by programmers using problem-solving techniques. An agent's experience can be thought of as a Q ...
Scalable Deep Reinforcement Learning in the Non-Stationary ...
Solving problems with a large number of items using Deep Reinforcement Learning (DRL) is challenging due to the large action space. This paper proposes a new Markov Decision Process (MDP) formulation to solve this problem, by decomposing the production quantity decisions in a period into sub-decisions, which reduces the action space dramatically.
Reinforcement Learning Problem Solving with Large Language Models
Reinforcement Learning Problem Solving with Large Language Models. Large Language Models (LLMs) encapsulate an extensive amount of world knowledge, and this has enabled their application in various domains to improve the performance of a variety of Natural Language Processing (NLP) tasks. This has also facilitated a more accessible paradigm of ...
Deep Reinforcement Learning for Solving Vehicle Routing Problems With
The vehicle routing problem with backhauls (VRPBs) is a challenging problem commonly studied in computer science and operations research. Featured by linehaul (or delivery) and backhaul (or pickup) customers, the VRPB has broad applications in real-world logistics. In this article, we propose a neural heuristic based on deep reinforcement learning (DRL) to solve the traditional and improved ...
Combining Reinforcement Learning and Tensor Networks, with an ...
We present a framework to integrate tensor network (TN) methods with reinforcement learning (RL) for solving dynamical optimization tasks. We consider the RL actor-critic method, a model-free approach for solving RL problems, and introduce TNs as the approximators for its policy and value functions. …
A Reinforcement Learning Method for Solving the Production ...
We present a reinforcement learning (RL) method to address this SEP problem. RL uses Q-learning algorithm to autonomously select heuristics from a pre-designed low-level heuristic set (LLHs). The selected heuristic is used to optimize the solution space for better results.
Multi-period, multi-timescale stochastic optimization model for
However, the computational demand for solving it becomes quickly intractable with problem size. To this end, we propose to develop a Markov decision process (MDP) formulation of the problem and use simulation-based reinforcement learning for multi-period capacity investments of the planning horizon.
PDF arXiv:2402.09051v2 [cs.AI] 15 Feb 2024
of autonomously learning problem-solving methods from the feedback of a formalized environment, without the need for human supervision. It leverages a pre-trained natural language model to establish a policy network for theorem selection and employ Monte Carlo Tree Search for heuristic exploration. The symbolic part is a reinforcement learning ...
Remote Sensing
Addressing the formidable challenges posed by multiple jammers jamming multiple radars, which arise from spatial discretization, many degrees of freedom, numerous model input parameters, and the complexity of constraints, along with a multi-peaked objective function, this paper proposes a cooperative jamming resource allocation method, based on evolutionary reinforcement learning, that uses ...

Learning to Optimize with Reinforcement Learning

Learning to Optimize

Learning to Learn

Learning What to Learn

Generalization

Why is generalization important?

What should be the extent of generalization?

How to Learn the Optimizer

Background on Reinforcement Learning

Formulation as a Reinforcement Learning Problem

Intelligent problem-solving as integrated hierarchical reinforcement learning

Access options

Similar content being viewed by others

Phy-Q as a measure for physical reasoning intelligence

Hierarchical motor control in mammals and machines

Hierarchical generative modelling for autonomous robots

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Supplementary information

Rights and permissions

About this article

Share this article

This article is cited by

Four attributes of intelligence, a thousand questions

An Alternative to Cognitivism: Computational Phenomenology for Deep Learning

Quick links

What is Reinforcement Learning in AI?

What is Reinforcement Learning?

Essential Terms in Reinforcement Learning

How Does Reinforcement Learning Work?

Step 1: Define and create the environment

Step 2: Specify rewards

Step 3: Define the agent

Step 4: Train and validate the agent

Step 5: Implement the policy

The Various Uses for Reinforcement Learning

Reinforcement Learning vs. Supervised Learning vs. Unsupervised Learning

Reinforcement Learning Challenges

The Advantages and Disadvantages of Reinforcement Learning

Advantages of Reinforcement Learning

Disadvantages of Reinforcement Learning

What is the Future of Reinforcement Learning?

Do You Want Training in Artificial Intelligence and Machine Learning?

You might also like to read:

Artificial Intelligence & Machine Learning Bootcamp

Online Bootcamp

Recommended Articles

What Is Transfer Learning in Machine Learning?

What is Sustainable AI? Definition, Significance, and Examples

The Top 10 Natural Language Processing Applications

Explainable AI: Bridging the Gap Between Human Cognition and AI Models

AI in Human Resources: Improving Hiring Processes with Predictive Analytics

What Is an AI Chatbot, and How Do They Work?

Program Benefits

Monte Carlo Methods in Reinforcement Learning

Monte Carlo Methods #

Monte Carlo Methods in Reinforcement Learning #

Monte Carlo Prediction 1 #

First-visit MC vs. every-visit MC #

Monte Carlo Control 2 #

Exploring Starts #

Monte Carlo Policy Iteration #

On-policy Monte Carlo Control 3 #

Off-policy Monte Carlo Prediction 4 #

Assumption of Coverage #

Importance Sampling #

Off-policy Monte Carlo Prediction via Importance Sampling #

Incremental Implementation for Off-policy MC Prediction using IS #

Applying to Off-policy MC Prediction using IS #

Off-policy Monte Carlo Control #

Example - Racetrack #

Discounting-aware Importance Sampling #

Per-decision Importance Sampling #

References #

Footnotes #

Reinforcement Learning and Stochastic Optimization: A unified framework for stochastic optimization