Model predictive control combined reinforcement learning for automatic vehicles applied in intelligent transportation system

ABSTRACT


INTRODUCTION
The automatic parking system is one of the modern driver support services in the intelligent traffic system.The role of this system is to assist the driver in safely and quickly parking the vehicle [1], [2].Therefore, this system can reduce the skill requirements of the driver and human-caused accidents such as car collisions.Modern driver assistance system technologies typically perform three base steps: object detection (obstacle), decision-making, and control [3]- [5].The automatic parking system consists of three parts: parking environment recognition, route planning [6]- [8], and track tracking [9]- [11].The parking system controller automatically obtains information about parking positions and obstacles through various sensors, such as ultrasonic sensors, cameras, wheel speed sensors, and angle sensors driving [12]- [15].Sensors measure the vehicle's distance from obstacles, real-time visual data, current vehicle speed, and steering angle.Finally, the controller decides whether the autonomous vehicle should stop or continue from the information through the multifunction sensor [16], [17].Path planning, control, and monitoring constantly interact in the automatic car parking system.In particular, the path monitoring control algorithm is one of the critical technologies of the automated parking system.This supervisory control algorithm must ensure the accuracy of road monitoring, driving comfort when the vehicle changes direction and the driving position and orientation at the end of the parking operation.Therefore, this automated parking technology attracts many scientists to research and propose related control algorithms in theory and experiment.


Model predictive control combined reinforcement learning for automatic vehicles applied … (Vo Thanh Ha)

303
The document [18] proposed an automatic parking line control method that considers time delay, solving the problem of the control model of the traditional automated parking system that is not related to the parking system vehicle control delay.Another study [19] has proposed a semi-automatic parking assistance system based on the driver navigation area, which can recognize the environment information in real-time through the sensor to sense the environment and optimize the parking space.Optimize parking routes to avoid collisions.In another study [20], a fuzzy controller that supports automatic parking without a model was proposed to monitor the parking path.In addition, the improved research [21], [22] combines fuzzy control with neurons.This controller only needs to know the parking configuration.The vehicle will monitor the path and assist in correct parking.The algorithms can control and monitor the vehicle's automatic path and parking on demand.However, these controllers cannot coordinate the vehicle speed and steering wheel control with the change of parking path during the vehicle's movement.
Therefore, the control solutions are still limited to the accuracy of the control and monitoring path.Thus, the article will propose a solution to apply a model predictive controller (MPC) using a vehicle dynamics model to predict how the vehicle will react to a particular control action within the expected range.This behavior is similar to the fact that a driver understands and predicts the behavior of a driver's vehicle.To perform optimal control motion computation, this MPC controller needs to consider all input and output constraints on the system, such as speed limit, safe distance after, physical limit of vehicle, maximum steering angle, and obstacles for the controller to avoid [23]- [26].This paper will present a controller design that combines MPC for cars to follow the reference path in the parking lot with the reinforcement learning method (RL) trained to perform the parking maneuver.The MPC controller moves the vehicle constantly along the reference path while the MPC algorithm searches for an empty parking spot.Once the MPC control algorithm has found the location, the RL controller will perform the parking request.This hybrid controller performs simultaneous obstacle detection and avoidance in tight parking spaces without human intervention.This system uses an adaptive model predictive controller that updates both the predictive model and the mixed input and output constraints at each control interval.The correctness of the theory is proven through MATLAB simulation.
The article is expressed in five parts.The first part introduces the study of automatic parking vehicle control.In the next part, the mathematical model of the car is given.Based on this mathematical model, an MPC controller combined with RL for vehicle movement and obstacle avoidance in section 3. The correctness of the control solution is shown through the proof.MATLAB simulation in section 4. Finally, the paper makes conclusions about the main features of the automatic parking solution and future research directions.

MATHEMATICAL MODEL OF CAR
In Figure 1, the article employs a rectangular automobile model with dimensions of 5 meters in length and 2 meters in breadth.The vehicle can assist in overcoming obstacles using a Lidar sensor.This sensor calculates how far the car is from any obstacles in its lane and in front of it.Blocks might be stationary, like a giant pit, or moving, like a slowly driving car.The most frequent driver behavior is briefly switching lanes, crossing an obstruction, and retracing their steps.
In Figure 1, it is noticed that the car coordinate model has four state variables such as x, y is the central position of the x, y-axis of the car;  is the speed of the vehicle;  is the tilt angle of the car (value 0 when turning to the east, counterclockwise in the positive direction).Two variables interact such as  is the throttle (positive value when accelerating, negative when decelerating), and  is the steering angle (value 0 when aligned with the car, positive counterclockwise);   is the length of the vehicle.The paper uses a simple non-linear model to describe car dynamics as (1): ̇= ( ()   ). ̇= 0.5 According to Jacobian about the nonlinear state model used to build the linear predictive model at the operating point, the (2) is created: ̇= ( ()   ). + ((( 2 ) + 1)( ()   ) ̇= 0.5

MPC CONTROLLER DESIGN AND RL-PPO ADVANCED LEARNING 3.1. MPC model predictive controller
Model prediction controller uses object models, input and output noise to predict and estimate the state.The model structure used in the MPC controller is shown in Figure 2. The model prediction controller calculates the optimal control input by minimizing a cost function that penalizes deviations from the desired state trajectory.The predicted state is then used to update the control input in real time, allowing the controller to track the desired course accurately.
Where:   =  0 −1 ,   ,   ,   is a parameter of   ;   ,   ,   is a parameter of  0 −1   ; (), (), () are the measured and unmeasured input noises.The MPC controller is limited so   = 0 , means that the MPC controller does not allow direct transmission from any controlled variable to any output of the control object.Matrix , ,  and  is determined as follows:

Input noise model
The input noise model is determined by the (5): In there:   ,   ,   are constant state matrices;   () is the vector of the measured input noise when   ≥ 0 ;   () is the vector of input noise   can't measure;   is the input noise vector whose mean value is 0, when   ≥ 1.

Measured noise pattern
The measured noise pattern is determined by the (7): Where:   ,   ,  0 are constant state matrices;   () is the vector of the measured noise when   ≥ 0 ;   () is the output noise vector   ;   () is the input noise vector whose mean value is 0, when   ≥ 1.

Reinforcement learning
The structural principle of the reinforcement learning strategy is depicted in Figure 3. Machine learning's RL area investigates how an agent in a given environment should decide what behaviors to perform to maximize a particular reward over the long term.The RL algorithms seek a policy connecting the world's states to the actions the agent should do in each state.The RL algorithms used in this context are closely related to dynamic programming methods since the environment is often represented as a limited set of conditions.Unlike supervised learning, RL lacks good input/output pairings and does not explicitly assess near-optimal behaviors as true or false.Additionally, the action in question involves the pursuit of a balance between discovery (an untried condition) and exploitation (a known form).A set of "rewards" with no value is used to educate agents to execute a set of actions in a group of environmental conditions in the RL paradigm.The reward evaluates how well the last series of steps achieved the task goal.The agent has two parts, including a training algorithm and a policy.  = 2 −(0.05  2 +0.04  2 ) + 0.5 −40  2 − 0.05 2 + 100  − 50  (9) where:   ,   ,   and the errors in the position and angle of the car's inclination determined from the required position;  is the steering angle;   (0 and 1) indicates whether the vehicle is parked or not at the time ;   (0 and 1) indicates whether the vehicle collides with an obstacle at time .

Augmented agent design
The article proposes to design RL agents based on asymptotic proximal policy optimization (PPO).This is an online, model-free, gradient training method.This algorithm is a kind of policy gradient training that alternates between sampling the data through the environment interaction and optimizing the objective function using a random gradient function.The PPO RL agent is created by a neural network consisting of an input layer that receives information from the observer and an output layer.This neural network is trained empirically as agent training.The number of iterations steps is set to 200, and the number of training episodes is 150.The learning rate parameter of 0.2 improves the stability of the training.And a discount factor of 0.997 to maximize demand.The loss factor is 0.01.Calculate the output variance using the GAE advantage estimation method of 0.95.Conduct PPO training as follows: train up to 10000 episodes, each lasting up to 200-time steps.Movement stops when maxing out the target average of 80 episodes or more.The RL results are shown in Figure 5 and the automatic parking design results are in Figure 6. Figure 5 shows that the average number of steps achieved through each episode of 80 random executions, the training time is 3551.2seconds.Observation Figure 5 shows that in the first 200 episodes, the car only went under 20 steps.From the 200 th to the 900 th episode, the vehicle's object avoidance continuously improved and increased step count.While the step maximum is reached starting with the 900 th episode, this maximum is not always come in subsequent episodes.This level becomes more and more likely to go as the episode increases.The reasons for the results are: in the first episodes, the car did not know how to avoid static and dynamic obstacles, so it caused a very early collision.The show will stop the attack if a crash occurs and the vehicle speed is constant at 2 m/s.Therefore, the low step count corresponding to Figure 5 indicates that the vehicle is not responding well to obstacle avoidance.The more you train, that is, for more extensive episodes, the number of steps increases over time.This means that even though the vehicle moves continuously in an environment with static and dynamic obstacles, it can regulate the knowledge it has learned and make increasingly accurate decisions, avoiding the obstacles.Body.The step value reaches saturation at 1000 with increasing probability showing that the vehicle can operate well in a complex environment and achieve the maximum number of steps in future training times.This result proves that the algorithm has been installed successfully.With the Q-learning RL algorithm, the vehicle was able to train itself to achieve the skill of avoiding static and moving objects.
Figure 6 shows the advantage of an RL controller the vehicle has moved along the path and is in the correct parking position.The elapsed time of the car is 10.8 s.However, the part of the parked car is still wrong on the Y axis (meaning the parked car is slanted).
The response of the x and y positions and the tilt angle of the vehicle are shown in Figure 7. Through simulation results, the required location is (50.125;4.9 -1.5709).Thus, the car reaches the target position within the allowable error of +/-0.75 m (site) and +/-10 degrees (direction) valid request.The assist feature helps the ego to stop after 10.8 seconds.The response of the driving angle is shown in Figure 8. From this simulation result, the steering angle shows that the controller reaches a steady state after about 4.2 seconds with a vehicle speed of 2 m/s by the requirements.

CONCLUSION
This paper presents a successful controller design that combines the MPC model predictive control and RL method RL-PPO.This integrated controller has made the vehicle move to avoid obstacles and park the car as required with fast calculation time.The success of this research work has partly contributed to intelligent traffic systems, improving driver support services and traffic system management and administration agencies.However, to the convincingness and reliability of this smart control solution, the research work needs to be compared with other control methods such as deep learning RL (Q-deep learning), adaptive fuzzy tree (fuzzy tree), and the research results will be tested experimentally in the future.Furthermore, the comparison with other control methods will provide a comprehensive understanding of the strengths and weaknesses of the proposed integrated controller.Additionally, experimental testing in real-world scenarios will offer valuable insights into its practical applicability and performance under diverse conditions.This holistic approach will enhance the robustness and relevance of the intelligent control solution, thereby contributing to the advancement of intelligent transportation systems and autonomous vehicle technology.

Figure 1 .
Figure 1.Coordinate model of cars

Figure 2 .
Figure 2. The MPC controller architecture

TELKOMNIKAFigure 4 .
Figure 4. MATLAB simulation structure of automatic parking based on the MPC controller combined with RL-PPO

Figure 5 .Figure 6 .Figure 8 .
Figure 5. Training process of RL-PPO ( + 1) =     () +   () +   () +   ()   () =     ()  () +   () +  () ()Where:   ,   is the input and output variable of the object;   , ,  are state space matrices with constant zero delay;   is the input diagonal matrix;  0 is the output diagonal matrix; diagonal matrix of output scale factors;   is the state vector that includes all delay states;   is a vector of input variables consisting of manipulated variables, measured noise, and unmeasured input noise;   is a vector of output variables.State model (3) does not include input and output noise.So the car state model is rewritten as the (4): TELKOMNIKA Telecommun Comput El Control  Model predictive control combined reinforcement learning for automatic vehicles applied … (Vo Thanh Ha) 305