Get answers and suggestions for various questions from here

Strengthen the basics of learning



Element: actor (we can control, decide our behavior) , Env, Reward (we can't control the environment)

Main methods: model-baed (modeling Env, actor can understand the environment), model-free (policy-based, value-based); on-policy (learning and interacting actors are one), off-policy (learning Different from interactive actors); MC (eposidic), TD

It mainly introduces Policy Gradient, Q-learning, Actor-Critic, Sparse Reward, Imitation Learning and related content.

Policy Gradient

Consider an actor π interacting with the environment, collecting (s, a, r, s'), where π(a|s) is the policy we can control, and Env gives the next s' and r after receiving the actor action. This is beyond control! And r is generally a random variable, we consider a total reward when the trajectory is completed, that is, all r and R (there is a problem that the variance of R is extremely large). Since R is a random variable, we consider maximizing the mean of R, that is, sampling some π trajectory, and then finding the mean R'.

Where P(s) is given by Env, we can't control it, we can adjust the policy to P(a|s).

Use gradient ascend to maximize R', where the parameter θ is only related to π(a|s), and finally the formula that can be sampled. The data of each trajectory can only be used once, and the actor cannot be used after it is updated. In actual operation, it can be regarded as a weighted classification problem S->A, cross-entropy is selected, the weight is R, and R can be unguided.

Tip1 : add a baseline, maybe R is positive, which causes the probability to increase regardless of a. When the sample is large enough, this is not a problem, but a small sample will potentially reduce the probability of no sample a. b is generally related to s. For example, the mean value of R can be used.

Tip2: assigning credit, the r of all a in a trajectory is unreasonable, only consider the r after t, you can consider the discount rate. When the sample is large enough, it is not a problem, and the data can distinguish the effects of different a. Where R' - b is the advantage function, measure the model with θ, and take a good or bad (Critic) relative to other actions in s. finally:

=========================================================== =====


Cause: The policy gradient is on-policy. After the sample is updated, the actor changes. The original data cannot be used. You must re-engage with Env to collect new data.

Important sampling: Use other distributions q to obtain data to estimate the function expectation based on the distribution p, and the user cannot directly sample from p. It should be noted that p and q can't be too different, because the function expectation is unbiased, but the variance may be very large. Of course, if the sample data is enough, there is no such problem.

Objective: Use θ' sampled data to train θ actor. In the process, θ' is fixed, so you can use θ' data to train θ many times, increase data utilization, and improve training speed. It is the default method of openAI. Obviously, this method turns to off-policy.

Note: Since θ' is an actor that interacts with the environment, the estimate of the advantage function requires the use of θ' data .

This is the gradient required for the new one, which is assumed to be the same as the s probability seen in different actors. The gradient function can be used to reverse the target function.

Constraint: Since θ' and θ cannot be too different, a constraint is required, such as adding -βKL(θ, θ') as a constraint (PPO) after the objective function; or introducing KL(θ, θ') < δ outside. (TRPO, PPO predecessor, not well dealt with constraint calculation)

Note: Because the parameters may change little, the behavior changes greatly, and the measurement is meaningless. and so. KL(θ, θ') is not the distance on the parameter, but the distance in behavior! The distribution of π(a|s) is close!

Tip: adaptive KL Penalty , weighs β. Define KLmin and KLmax. When KL > KLmax, indicating that the penalty effect is not exerted, increase the value of β and vice versa.

PPO2: You can also control the difference between θ and θ' without calculating KL.

Explanation: When A>0, that is, a is good, we want to increase the probability of Pθ. However, Pθ can't be made too big. If it is too big, it will make a big gap with Pθ', which will result in poor effect. vice versa.

=========================================================== =====


Value-based, learning Critic, evaluating the relative quality of current behavior.

State value fucntion V_π(s) : Expected to accumulate rewards, given π, s, to measure the quality of a. Mainly he must be related to π, measuring this actor, not a simple s.

  1. MC (episodic), after π is subjected to a whole process, statistical estimation is performed. Since r is a random variable, the variance of this method is large, but unbiased .
  2. TD , based on V_π(s) = Vπ(s+1) + r_t, does not need to go through the complete process each time. Since the Vπ values ​​are all estimated values, there may be estimation inaccuracies, which only involve r, and the variance is small.

State-action value function(Q function) Q_π(s,a): Measure π under s, take a good or bad. Where π does not necessarily take a, here is just a measure of the situation if a is taken here.


  • Input s, a, output Q
  • Input s, output Q of each a, can only be used for discrete a

Goal: Given Q, find a better π' than π, where "better" means that for all s, there are V_π'(s) >= V_π(s)

Decision: π' = argmax_a(Q_π(s, a)), there is no policy, depending on Q, this form is not easy to solve for continuous a .

Attachment: Prove V_π'(s) >= V_π(s)

Since π' = argmax_a(Q_π(s, a)),

Must use tips

  1. Target network

The data in the study is... (st, at, r _ t, s_t+1)..., the training target is

Since the objective function changes after the update, the training is very unstable .

We use two networks, one for training, one for the target target (on the right), and the target is fixed during training. After training a certain number, update the parameters of the target network (replace with training network).

2. Exploration

Q-learning policy is dependent on Q. Each time you select the maximum Q of a, it is not conducive to collecting data. If you do not sample a at the beginning, Q is small, then you will not take this a again, and you cannot accurately estimate Q(s , a).

Method: ε-greedy : randomly select a with ε probability, usually ε becomes smaller as learning progresses;

Boltzmann Exploration : Based on Q-value probabilistic, select a by probability sampling,

3. Replay Buffer

Put each actor's experience...(st,at,r _ t, s_t+1)... into a buffer, where the exp in the buffer may come from a different actor, and replace the old one when the buffer is full. Exp. During the training process, each batch of samples from the buffer is trained.

Result: turned into an off-policy method (possibly from other actors)

Benefits: Interaction during RL training is time consuming, so that the data is fully utilized (the data can be reused), the number of Env interactions is reduced, the correlation within the batch is reduced, and the training effect is more stable (diverse), because we only use one exp training. Not trajectory, so off-policy doesn't matter.

Advanced tips

4. Double DQN

The Q value is generally overestimated. Assuming that a is overrated, the target will choose the overrated a, then the target will always be large.

We take the use of two functions Q and Q', where Q selects a (training Q), Q' to estimate the q value (target Q'), when an overestimated q of Q chooses an overestimated a, but another possible It will accurately estimate its q value; when Q' overestimates q, only Q does not choose a, so double decision will alleviate the overestimation problem. Using target network and main network can effectively reduce the amount of parameters.

5. Dueling DQN

Change the architecture so that Q(s,a) = A(s,a) + V(s). Sometimes the value of Q may be independent of a, but only related to s. For example, a bad s causes nothing. it is good. When action does not affect s, there is no need to measure the q value of each a (only important at critical moment a!)

Benefits: When there is no sample to a, you can also update a, use data more efficiently, and speed up training.
Note: In order to prevent the model from training V, you can impose constraints on A(s, a). For example, A(s, a) limits and is 0 and so on. The actual operation can add normalization op on the last layer of A(s, a).

6. Prioritized Replay

Change the distribution of sample data, increase the probability that the data that is difficult to train is sampled, and the data with large TD error; at the same time, change the training process.

7. Multi-step

Weighing MC and TD, combining the advantages and disadvantages of MC and TD, MC has a large variance and TD variance is small, and the estimation is not accurate. It can be weighed by adjusting N. In the replay buffer, the exp of the N step is stored, and the target becomes:

8. Noisy Net (State-dependent Exploration)

In the original exploration (ε-greedy), adding noise to the action is unreasonable in reality (giving the same s, taking a different, really π is not like this). Here, add gaussian noise to each parameter. Note that each time in the sample net, the sample net is started at each episode, and then the net is fixed to start training. This gives a similar a, more reasonable, and a more reasonable, systematically exploring environment (Explore in a consistent way).

9. Distributional Q-function

Q function is the cumulative expected reward, which is the expected value. In fact, when s takes a, all the obtained rewards are a distributed reward distribution, but different distributions may be expected to be the same. Simply using expectations to represent reward will lose some information.

The original Q function is to output the Q of each a, Q is the expectation, and now output the distribution of Q of each a! In fact, the operation assumes that the reward distribution is within a range, split into some intervals, and the output takes the probability that s takes a reward of falling within a certain interval. In test, you can choose a certain action that expects the largest action, or consider the variance of the distribution to reduce the risk of a. This method does not produce an overestimation of the q-value problem because it initially limits q to a range (possibly underestimating q).

10. Rainbow

All the above methods are combined!

Q-learning summary : more stable; only need Q function to find a better π; easy to operate. It is not easy to handle continuous action situations.

How to use Q-learning to handle the case of continuous a:

  1. Sampling action: sample N a, choose maxQ(a,s)
  2. Gradient ascent solve the optimization problem, there is local minimum, when you choose a, you must train a net..
  3. Design the network to make the optimization target a good solution, such as the output is miu, S, V, Q value

In this way, the optimal a is miu(s)

=========================================================== =====

Actor Critic (A2C, A3C, DPGP)

The gradient in the original policy-based method is

among them

The cumulative reward G is the sum of π taken from s to the end, which is the sum of random variables, so the variance may be large, and we may not be accurate by sample estimation.

Objective: Directly estimate the expected value of the cumulative reward instead of the estimate of the sample to make the estimate stable. The expected value of the reward is estimated to be a value-based method.

According to the definition of Q function: E[Gt] = Q(s_t, a_t), so G can be learned by Q-learning.

Baseline: related to s, can be expressed by V_π (S), etc., need to pay attention to EQ (s, a) = V (s), so QV has positive and negative!

Finally: Gb = Q(s, a) - V(s), in which case two networks are required to predict Q and V respectively, which would assume a 2x risk prediction inaccuracy.

Since Q_π(s_t, a_t) = E[r_t+V_π(s_t)], it is assumed that the expectation can be ignored because the random variable is r, but the variance of r is definitely less than G, so this assumption is feasible. Finally, you can use V to calculate Q. There is no need to estimate Q, but a slight variance is introduced.


This is the advantage function.

A2C: π first interact, collect exp, first estimate V(s) (TD, MC), then estimate π


  1. Shared shallow layer: two networks that need to learn can share parameters

2. output entropy as regularization for π(s), so that the action distribution is evenly distributed to achieve exploration

A3C: A2C training is slow, using multiple worker learning parameters and new gradients, together with updating Global Network parameters. Each worker takes parameters from the global network, interacts with the environment, calculates gradients, and updates the global network parameters. Note that each worker may not be the original parameter.

PDPG: From different perspectives, Critc only evaluates a, now can guide π, telling which a is good; a method of using Q-learning to solve continuous actions, the original Q-learning is not good enough to solve the problem of continuous a argmax, Now use an actor to solve.

Train Q first, then fix Q, train the actor to make Q the largest, and the architecture is the same as GAN! Generator is actor, discriminator is Q, is a conditional GAN

The techniques in Q-learning are available.

The whole algorithm:

Furthermore, the optimization method in GAN can be introduced here:

=========================================================== =====

Advanced technique

Sparse Reward

The agent may not have been reward most of the time, resulting in the same goodness of training regardless of what is taken.

Reward Shaping: Env has its own fixed reward. We guide the actor to deliberately design some rewards and learn the desired behavior. It may be that some actors are difficult to predict in the future, and the discount is very large. Design reward is not necessarily Env really reward, just the behavior we want, generally requires domain knowledge.

Curiosity: Intrinsic Curiosity module (ICM)

Encourage actor to risk, the harder to predict s, the greater the risk, increase the exploration. But it is not necessarily difficult to predict that this s is good, possibly a random irrelevant variable in the environment. In turn, the feature extractor is introduced to filter out irrelevant irrelevant variables. Then the ICM can be:

The feature thus designed is related to a.

Curriculum Learning: Planning for machine learning, training in sequence, from simple to difficult (not just for RL). For example, to identify numbers, first learn to distinguish 0, 1, and then learn to distinguish 0-9. Teachers need to design courses.

Reverse Curriculum Generation: Given a goal state->state the state close to the goal state according to the goal state s1-> interact from this s1 to see if it can reach the goal state, get the reward->reward the extreme state (reward too big, The description is too simple, has been learned, and vice versa) -> based on the moderate reward s1 continues sample s2.

Hierachical RL: There are many different levels of agents, the upper level is responsible for the target, and the underlying agent is responsible for the execution. Large tasks are broken down into small tasks one by one.

Note: If the low-level agent can't achieve the target, the advanced agent is punished to prevent the high-level agent from proposing too difficult targets; if the agent finally implements a wrong goal, then the correct goal is assumed to be wrong!

=========================================================== =====

Imitation Learning

Appewnticeship Learning: learning by demonstration, the whole task is not rewarded! Demonstrate how to solve the problem through an expert.

  • The machine can interact with the environment but won't get rewards for display
  • Some missions are difficult to define reward (such as driving a car can not determine how many rewards each man and woman have)
  • Some manually defined rewards can lead to uncontrollable results

Method: Behavior Cloning; Inverse Reinforcement Learning (inverse optimal control)

Behavior Cloning: Supervise learning and collect expert (s, a) for learning.

potential problem:

  1. Experts can only provide a limited sample (experts are very professional, do not experience some extreme situations, the machine is unable to make decisions).

Dataset Aggregation: This situation allows the expert to be in the machine's environment when the machine is making decisions ( collecting the behavior of the expert in extreme situations ), but the machine is still doing it yourself , and the expert gives guidance, which may result in the loss of an expert every time .

  1. The machine may simply copy the behavior of the expert, regardless of whether the behavior is relevant (such as learning bad behaviors that are not related to the individual), because the capacity of the machine is limited, and may only learn bad behavior ( supervised learning treats all errors equally )
  2. Training data and test data mismatch

In supervised learning we want the training set and the test set to have the same distribution.

But in BC, the training data is (s, a) ~ π ^ (expert), where the expert's action will affect the distribution of s (RL characteristics); the test data is (s', a') ~ π * (actor clone Expert). When π^=π*, the data has the same distribution; when not equal, s and s' may differ greatly.

Inverse Reinforcement Learning: There are Env, actor, Expert demonstration, but no reward function. Use the expert behavior to reverse the reward function; then you can use RL to find the optimal actor.

Benefits: The reward function may be simpler and can lead to complex behavior of the expert.


  1. Have an expert interact with the environment to get some expert data
  2. Have an actor π interact with the environment to get the data
  3. To reverse a reward function, the principle is that the score obtained by expert is higher than the actor .
  4. Use this reward fucntion to learn actor using the RL method to get new data.
  5. Based on the new actor and expert data, update the reward function, follow the same principle, iterate over and over, and finally learn the reward function to make the actor and expert get the same high score.

Reward function: If it is a linear reward, it guarantees convergence. Or you may use NN, input trajectory, the output of a R & lt ; or input (s, a), the output r, to give the final summary R & lt .

Note: If the actor is a generator and the reward function is a discriminator, the entire framework is GAN!

Advantages: do not need too much training data

Application: Learn different expert driving, and learn the different styles of each expert driving; human demonstration training robot (hands, machine is in the first perspective of s); Chat-bot ( using maximum likelihood equivalent to behavior cloning, this is not enough! ; SeqGAN method relative to IRL )


Third Person IL: Use domain adversial training + IL to see people's behavioral learning (machine learning perspective is the third perspective).