Scientific Papers

Research on task offloading optimization strategies for vehicular networks based on game theory and deep reinforcement learning

1 Introduction

With the advent of the Internet of Things (IoT), many sensing devices have been deployed in networks. The data generated by these devices and related large-scale mobile applications are growing explosively [1]. In the context of IoT, the Internet of Vehicles (IoV) is a study hotspot. It uses IoV technology to provide services for vehicles through onboard processors [2,3]. However, task data also increase with a significant increase in the number of vehicles. The emergence of various computing-intensive tasks poses a significant challenge to the onboard computing capability of the vehicle itself Zhou et al. [4]. Multi-Access Edge Computing (MEC) is considered a feasible method to tackle this issue. MEC has significant advantages in addressing compute-intensive tasks in the IoV system. By moving computational and data processing functions to the network edge, it reduces task processing latency, enabling faster real-time decision-making, which is crucial for areas such as autonomous driving, traffic optimization, and intelligent traffic management. Moreover, MEC alleviates the burden on TaV, optimizes network load, and reduces energy consumption. MEC is considered a prevalent computing paradigm that has been widely studied to promote data processing efficiency, which can perform computation services closer to the data sources Porambage et al. [5].

Specifically, in the IoV system, tasks are offloaded to the service nodes (SNs) with computing power, and tasks are processed cooperatively to improve efficiency. The premise of task offloading is that jobs can be split into multiple subtasks and offloaded to SNs. Parked or moving vehicles, as idle resources, can provide specific computing and storage resources for task processing Sookhak et al. [6]. In addition to offloading tasks to the service vehicles (SeVs), task vehicles (TaVs) can also offload tasks to the MEC servers. The MEC servers coexist with the base station (BS) and connect the roadside units (RSUs) to provide services Xiao and Zhu [7]. In recent years, the issue of task offloading in the IoV system has received extensive research Zhou et al. [8], and task processing delay and energy consumption are essential indicators. It is challenging to minimize overall delay and energy consumption while completing the task Li et al. [9]. When the amount of task data is large, the task transmission delay is high, increasing the total task delay and energy consumption. To solve this problem, integrated radar sensing and communication is a feasible solution. The integrated radar sensing and communication technology aims to reduce the task processing latency and energy consumption in the IoV, improving the efficiency and performance of task processing in IoV. By collecting data through radar sensing and sharing, instead of traditional data transmission, it reduces node waiting energy consumption and enhances the response speed of the IoV system. Its advantage lies in optimizing the overall performance of the IoV system, including perception of traffic data, improvement of communication quality, and increased accuracy of vehicle positioning, thereby enhancing the efficiency and safety of the entire IoV system. Game theory and optimization techniques provide technical support for it. In this study, a game theoretic approach was utilized to construct a game model, analyze the cooperative relationship between TaVs and SNs, and define the utility function for task offloading. This facilitated the development of an optimal task offloading decision strategy, encompassing task allocation and resource coordination. Optimization techniques were employed to achieve an optimal allocation of resources, including computing, storage, and communication resources, maximizing system utility while minimizing task processing delay and energy consumption.

Some scholars have conducted some studies on this issue. For example, in [10], a relatively practical IoV scenario was considered, and a matching game method was used to model the task allocation. The simulation results show that the input data transmission delay accounts for 73% of the total task processing time. In [11], the task is assigned to the MEC and the SeVs for processing. The results show that when the task size is 80 Mb, the input data transfer delay accounts for 50%. The delay in uploading data can significantly affect delay-sensitive applications. Therefore, several cars have an integrated radar system to sense the surrounding environmental data for local processing or assist connected vehicles in processing task data to ensure safe driving [12]. Furthermore, RSUs use radar to sense environmental data and use ecological data as input to reduce the transmission delay [13]. In summary, instruction transfer and environmental data sensing provide new possibilities for task offloading. For the issue of transmission delay, consider perceived environmental data and calculation instructions to reduce transmission delay [14].

Energy is currently a major concern worldwide, and the increase in the number of IoV equipment will lead to increased energy demand and higher energy costs. Therefore, reducing energy consumption has become one of the issues that the IoV system needs to resolve [15]-[16]. To tackle this issue, Cesarano et al. designed a greedy heuristic algorithm to reduce the energy consumption of the task [17]. Some scholars applied minimizing of energy consumption and execution delay as the objective functions and reasonably selected the task offloading strategies [18]. In [19], the IoV system data transmission scheme adopts the deep Q-network (DQN) method to reduce transmission costs. Altogether, energy consumption is a key factor influencing the task offloading strategy. The focus of future study will be the proper selection of task offloading strategies to ensure delay and energy consumption.

For the aforementioned issues, many studies have adopted heuristic algorithms to solve them. For example, the author considers the reliability of task offloading in the IoV scenarios and uses heuristic algorithms to optimize the reliability [20]. Aiming at the issue of poor robustness of traditional heuristic algorithms for continuous state and action space in the IoV scenario [21], an offloading strategy-based method was studied to learn the optimal mapping from constant input state to discrete output and deal with continuous state space and action space scenarios. Although the aforementioned algorithm can solve the issue of the task offloading strategy in the IoV, the algorithm used has poor robustness in ensuring the reliability of data transmission [22].

DRL algorithms have significant advantages over heuristic algorithms. First, DRL algorithms can automatically learn and optimize decision strategies through large-scale data, eliminating the need for manual design of complex rules. Second, DRL algorithms can handle high-dimensional and complex state and action spaces, making them suitable for solving complex real-world problems. Additionally, DRL algorithms have the ability to generalize learned knowledge to unseen environments, enabling more intelligent and flexible decision-making. Given the more significant potential and application value of policy-based deep reinforcement learning (DRL) [23], this paper discusses task offloading based on DRL. For sensitive applications with environmental data as input, we proposed a task offloading decision mechanism (TOMD) based on cooperative game and DRL. This paper is based on cooperative game theory, considered the overall task processing delay (OTPD) and overall task energy consumption (OTEC), and constructed a joint optimization issue. We transformed the joint optimization issue into a DRL issue and used the PPO algorithm to solve the issue. The main contributions are summarized as follows:

1. Considering dynamic wireless edge computing networks, a framework for joint task offloading is designed. On this basis, according to the wireless transmission requirements of SNs, combined with the game theory and communication function, a cooperative game and DRL-based TODM is proposed. The joint optimization issue is derived to minimize the delay and energy consumption.

2. DRL is more robust than the heuristic algorithm as it can make real-time online decisions. Therefore, combined with DRL, the designed joint optimization issues transformed into reinforcement learning (RL) issues. This paper develops an algorithm based on PPO to solve the aforementioned issues and theoretically analyze the algorithm’s complexity.

3. Finally, we designed a simulation experiment to evaluate the algorithm’s performance. The results show that the algorithm converges better than the soft actor-critic (SAC) algorithm, which can achieve the goal of a reasonable choice of the task offloading strategy. The proposed algorithm can reduce the task delay and energy consumption cost while improving the performance of the IoV system.

The remainder of this paper is arranged as follows: Section 2 presents relevant work. Section 3 presents the system model in detail, expounds on the task offloading mechanism TODM, and gives the issue formulation. Section 4 proposes a task offloading algorithm based on DRL to solve the aforementioned issues. Section 5 proposes a simulation for evaluating the solution. Finally, Section 6 summarizes this paper.

2 Related works

This section summarizes the current study of the IoV, including the connected study of task offloading, radar sensing and communication, game theory, and DRL of the IoV edge intelligent system.

2.1 Task offloading of IoV

With the advent of the 6G era, mission data volume has experienced a blowout growth. With the intellectual development of the IoV intelligent system applications, the requirements for task data computation have also improved. Because the cloud is relatively far from users, traditional cloud computing has relatively high latency, which has become the focus of the task offloading strategy [24]. Researchers considered MEC as an effective technique to address the delay issue. Because the MEC servers are closer to users than the cloud, they can reduce the delay in task processing and enhance the user experience [25].

In light of MEC characteristics, it will be widely used in the future IoV system. In [26], the architecture of the vehicle network was defined according to the properties of the MEC, which can enhance the scalability of the network. In [27], an SDN-enabled network architecture assisted by the MEC was proposed to provide low-latency and high-reliability communication. In [28], the optimal task offloading issue in MEC was studied, which was transformed into two subproblems, task offloading and resource allocation, to minimize the delay.[29], considers an edge server and describes the computing and physical resource problems as optimization issues. In [30], a new offloading method was proposed to minimize transmission delay while improving resource utilization. In [31], a task offloading scheme fuzzy-task-offloading-and-resource-allocation (F-TORA) based on Takagi–Sugeno fuzzy neural network (T-S FNN) and game theory is designed.[32] proposes a UAV-assisted offloading strategy, which has been experimentally verified to reduce the delay by 30%.

2.2 Radar sensing and communication in the IoV

The integrated radar and communication design has great potential in cost-constrained scenarios. For example, by combining radar and communication functions, an IoV system can be designed to solve the issues of high latency and energy consumption. Some scholars have proposed a path estimation method to realize longitudinal and lateral vehicles followed only by radar and vehicle-to-vehicle (V2V) [33]. This paper introduces an intelligent real-time dual-functional radar–communication (iRDRC) system for autonomous vehicles (AVs) [34]. Obstacle detection is a very important part of the realization of intelligent vehicles. To avoid the problem that metal objects seriously block the millimeter wave, an active obstacle detection method based on a millimeter-wave radar base station is proposed [35]. The radar and communication integrated system (RCIS) can overcome the time-consuming problems of data format transfer and complex data fusion across multiple sensors in autonomous driving vehicles (ADVs) [36]. In summary, the integrated radar and communication design is a promising direction for future autonomous driving technology development.

2.3 Game theory in ToV

Game theory provides a framework for analyzing strategic interactions among rational decision-makers, while optimization techniques are designed to seek the most favorable outcomes. Some scholars have proposed a dependable content distribution framework that combines big data-based vehicle trajectory prediction with coalition game-based resource allocation in cooperative vehicular networks [37]. This paper proposes an energy-efficient matching mechanism for resource allocation in device-to-device (D2D)-enabled cellular networks, which employs a game theoretical approach to formulate the interaction among end users and adopts the Gale–Shapley algorithm to achieve stable D2D matching [38]. Some scholars have proposed a novel game theoretical approach to encourage edge nodes to cooperatively provide caching services and reduce energy consumption [39]. In [40], the author has developed a two-player Stackelberg game-based opportunistic computation offloading scheme, which can significantly shorten task completion delay. In conclusion, game theory holds significant and extensive application prospects within the realm of the IoV.

2.4 DRL methods for IoV

Regarding resource optimization for the IoV, DRL has strong sensing and decision-making capabilities compared to traditional heuristic algorithms and can analyze the long-term impact of current resource allocation on the system. Many scholars have applied DRL techniques to the study of the IoV. For example, in [41], DRL technology was used to transfer vehicle tasks to the edge server when facing the challenge of task delay. In [42], a UAV was placed in the vehicle network to assist resource allocation, and the deep deterministic policy gradient (DDPG) method was used to reduce the task delay. In [43], an online computation offloading strategy based on DQN was proposed, which takes the discrete channel gain as input to minimize energy consumption and delay and realize computation offloading and resource allocation. In [44], a hybrid scheduling mechanism to reduce computation was proposed for vehicle-to-vehicle communication in a specific area. In [45], the author proposed a priority-sensitive task offloading and resource allocation scheme in an IoV network to validate the feasibility of distributed reinforcement learning for task offloading in future IoV networks. In [46], the author proposed a multi-agent deep reinforcement learning (MA-DRL) algorithm for optimizing the task offloading decision strategy, while improving the offloading rate of the tasks and ensuring that a higher number of offloaded tasks are completed.

Given the preponderance of DRL techniques in the IoV system, two metrics are considered: delay and energy consumption. This paper aims to select the optimal task offloading strategy to save delay and resource costs. Consequently, we propose a framework for task offloading that uses a DRL-based algorithm to achieve optimal solutions in the network.

3 System description and problem formulation

In this section, Section 3.1 presents an edge computing network of the IoV. Section 3.2 presents an optimization issue.

3.1 System model

3.1.1 TODM mechanism based on the cooperative game

In light of the issue that the large amount of task data in the IoV leads to significant overall task delay and energy consumption, we build an intelligent system for the IoV by using the sensing capabilities of SNs. To achieve a practical and distributed solution, we realize that the task assignment problem in MEC architectures can also be formulated as a cooperative game. The cooperative game is applicable to the case of multi-node cooperation, where multiple agents work together to formulate resource allocation strategies to minimize overall delay and energy consumption. First, this paper defines the participants of the game, i.e., TaV, SeVs, and MEC. Second, it defines the strategies for task offloading, decomposing tasks into multiple subtasks assigned to different SNs, with the delay and energy consumption for nodes completing the task as the criteria for cooperative cost allocation. Finally, cooperative constraints are introduced to construct a cooperative game theory model.

The intelligent architecture network of the IoV is featured in Figure 1. A BS and MEC servers are deployed at the same location to improve MEC computing power and save costs. For RSUs reasonably deployed along the road, each RSU is equipped with storage resources and radars for real-time sensing of ambient data. The storage resources of RSUs support the storage of all sensed task data and are periodically cleared to maintain usability. RSUs are linked to the MEC via wired links. Each car is equipped with computing, storage resources, and radars. The TaV is linked to the SeVs and RSUs via wireless transmission. The communication between nodes adopts frequency division multiplexing (FDM) access technology, and the upload and feedback process adopts the time division duplexing (TDD) mode. This paper assumes that BS covers the entire IoV system, including all RSUs and vehicles. The coverage of RSUs is tangential to each other, and the TaV is always within the range of the nearest RSUs when processing the task. The task can be divided into several subtasks. Each subtask is independent and can be processed in parallel [47]. Considering the impact of the delay and energy consumption on the offloading strategy, the MEC sends the offloading decision to the TaV and then offloads the task. In real-life scenarios, two-way roads are more practical. However, the study is still in its infancy. This paper only considers one-way lanes and ignores the car service in the opposite direction to the TaV in our model.

FIGURE 1. Intelligent architecture network of IoV.

The delay and energy consumption are critical technical indicators in the TODM design, and this paper aims to minimize the OTPD and OTEC. The OTPD includes task description delay, offloading decision delay, offloading decision transmission delay, task upload delay, task processing delay, and task feedback delay. In this paper, each variable is represented by 64 bits, i.e., a double float. The task description size is a few kilobits, and the delay can be disregarded. The offloading decision comprises task allocation, transmission bandwidth, transmission power, and transmission policy. Compared with the amount of the task input data, the size of the offloading decision is small, so the offloading decision delay and transmission delay are disregarded. The amount of data after the task is completed is smaller than the amount of data input, and the task feedback delay can be disregarded. Thus, the OTPD consists of the task upload and computation delays. Similarly, the OTEC consists of the task upload energy consumption and the task computation energy consumption. It should be noted that if SNs perform other tasks, there will be waiting delays, and energy consumption is possible. In this paper, it is assumed that only one task needs to be processed, and the waiting delay and energy consumption are neglected. Multi-tasking will be considered in future study.

Due to the different perspectives of sensing environmental data, the TaV coordinate in the calculation instruction is used for coordinate transformation (CdT) preprocessing to eliminate differences [48]. The TaV has two ways to transmit the task: conventional data transmission (DaT) and instruction transmission (InT) with cooperative environment awareness. Different transfer methods offer new options for task offloading. The delay and energy consumption constraints affect the task upload mode, further affecting the task offloading strategy. Therefore, the transmission strategy can be chosen adaptively based on the objective function, traffic size, propagation capability, transmission delay, energy consumption, etc. Compared to the traditional offloading mechanism, TODM can potentially reduce energy consumption and transmission delay caused by the inputs. However, this mechanism incurs an additional cost to the overall IoV system, which is ignored in this paper.

3.1.2 Task model

The task of TaV is computationally intensive and delay-sensitive. The total task data are denoted by SDaT and can be arbitrarily divided into infinitely many subtasks. The task ratio is denoted as xnxn[0,1], nN{0,1,2,,n,N+1}. H is used to denote the task, and the h-th subtask is denoted as h, hN{0,1,2,,n,N+1}, where hH. Some subtasks select local computations, while others select DaT or InT for SNs according to the task ratio. Task offloading to SNs can satisfy the delay and energy consumption constraints. Vn, nN{0,1,2,,n,N+1}, is used to denote the SNs; the wireless bandwidth ratio is denoted as bnbn[0,1], nN{0,1,2,,n,N+1}; and the transmission power is denoted as PnPn[0.5w,1.5w], nN{0,1,2,,n,N+1}. TaV is denoted as V0, and the SeVs are denoted as Vn. The computational resources of both TaV and SeVs satisfy all computing tasks. RSUs and MEC are connected by wires, denoted by VN + 1. The choice of the aforementioned three variables ensures the optimal task offloading strategy.

3.1.3 OTPD and OTEC of the TaV

When a subtask is selected to perform a local computation on V0, the OTPD of the subtask is the local computation. The OTEC of the subtasks is the energy consumption computed locally and uploaded without energy consumption. The delay is denoted as T0comput, and the energy consumption is denoted as E0comput. The T0comput and E0comput are given as follows [49]:


Here, M (in cycles/bit) is the task calculation strength, which refers to the computing resources required to input 1 bit of data. F0 represents the CPU cycles of V0. K0 is the effective switching capacitance related to the chip structure in the car. f0 is the computing capacity of the car itself. C0 represents the number of CPU revolutions required for processing the subtasks h0.

3.1.4 OTPD and OTEC of the SeVs

When a subtask is offloaded to the SeVs for processing, data or calculation instructions are transmitted wirelessly to the SeVs. For DaT, it is essential to consider the upload delay. For InT, the transmission delay is not considered, but it is essential to consider the CdT delay.

Uploading delay model: Based on comparing the delays between the two upload modes, the mode with the smaller delay is selected as the upload mode. The DaT and InT upload methods are considered, and an energy consumption model is built. The energy consumption corresponding to different upload methods is calculated. T1upload is used to denote the uploading delay for V0 transmitting the task to Vn. The uploading rate from V0 to Vn is given by


Here, RV0VnDaT(t) denotes the upload rate in time t. BToT denotes the total bandwidth of the wireless transmission. bn denotes the transmission bandwidth ratio. Pn denotes the transmission power. σn2 denotes the noise power. h̃n denotes the channel fading coefficient from V0 to Vn. dn(t) denotes the distance from V0 to Vn in time t. dnα(t) denotes the path loss from V0 to Vn. α denotes the path loss index.

During the task data upload, the car’s motion causes changes in dn(t). We assume that the coordinate of V0 is 0, and SDaTxn is Gn, where Gn ≠ 0. The moving speeds of the cars are v0 and vn. The formula for calculating dn(t) is given by


The cars are running on the expressway, and the maximum difference between their relative speeds does not exceed 30 km/h [50]. Take 10 ms as an example; vnv0t=0.008m. The relative position changes are relatively small and do not affect the optimization results. This paper ignores the change in position. The calculation formula of dn(t) is given by

For DaT, SDaTxn represents the amount of the task data allocated to Vn. BToTbn represents the transmission bandwidth from V0 to Vn. The upload delay T1aupload is given by


For InT, this paper needs to consider the delay of CdT. Assume Vn stores the environmental data sensed by the radar and performs CdT immediately after receiving the calculation instruction. The delay of CdT depends on the amount of sensed data and the strength of the CdT calculation. T1bupload is used to denote the upload delay; T1bupload is given by


where Mtra represents the computation intensity of CdT. Fn represents the CPU cycles of Vn.

Considering the TODM, DaT or InT with a lower delay is chosen as the upload method to minimize the task upload delay. The calculation formula of upload delay T1upload from V0 to Vn is given by


Uploading energy consumption model: Given the selected upload mode, the upload energy consumption model is built, and the upload energy consumption E1aupload and E1bupload are calculated. E1upload is used to denote the upload energy consumption; E1upload is given by [51]


where Kn is the effective switching capacitor related to chip structure in cars. fn is the calculation capacity of the car itself. Cn1 represents the number of CPU revolutions required for processing the subtasks hn.

Computing delay model: After the task is uploaded, the SeVs Vn start the parallel computation of the subtasks and obtain the computation delay. T1comput is used to denote the computing delay; T1comput is given by


Computing energy consumption model: The computing energy consumption model is designed according to the assigned task. Cn2 is used to denote the number of CPU revolutions required for processing the subtasks hn. E1comput is used to denote the computing energy consumption; E1comput is given by


3.1.5 OTPD and OTEC of MEC

When a subtask is offloaded to the MEC servers for processing, the upload delay includes both wireless and wired transmission delays. The upload energy consumption includes both wireless transmission energy consumption and wired transmission energy consumption.

Uploading delay model: The V0 transmits the subtasks’ data to RSUs via wireless transmission. RSUs transmit the subtasks’ data to the MEC servers via wired transmission. The uploading rate from V0 to VN+1 is given by


where RV0VN+1DaT(t) denotes the upload rate in time t. bN+1 denotes the transmission bandwidth ratio. PN+1 denotes the transmission power. σN+12 denotes the noise power. h̃N+1 denotes the channel fading coefficient from V0 to VN+1. dN+1(t) denotes the distance from V0 to VN+1 in time t. dN+1α(t) denotes the path loss from V0 to VN+1. During the upload of the task data, the movement of cars causes changes in dN+1(t). We assume that the coordinate of V0 is 0, and VN+1 is GN+1, where GN+1 ≠ 0. The moving speed of the car is v0; the calculation formula of dN+1(t) is given by


where DN+1 is the distance from RSUs to the centerline. HN+1 represents the height of RSUs. Take t = 20 ms as an example; when the speed of the car is 120 km/h, v0t = 0.67 m. The change in position is ignored compared with tens of meters. The calculation formula of dN+1(t) is given by


For DaT, SDaTxN+1 represents the amount of the task data allocated to VN+1. BToTbN+1 represents the transmission bandwidth from V0 to VN+1. The upload delay T2aupload is given by


After the task data are uploaded to RSUs, RSUs will transmit the data to MEC via wired transmission. T2Rupload is used to denote the wired upload delay. Rwired is used to denote the wired transmission speed. The wired upload delay T2Rupload is given by


Thus, let T2aRupload be the total upload delay, which is equal to the sum of T2aupload and T2Rupload. T2aRupload is given by


For InT, this paper assumes CdT is carried out immediately after the MEC servers receive the calculation instruction. Let T2bupload be the CdT delay. The calculation formula of T2bupload is given by


where FN+1 denotes the CPU cycles of the MEC. T2bRupload denotes the total upload delay, which is equal to the sum of T2bupload and T2Rupload. The T2bRupload is given by


Similarly, DaT or InT with a lower delay is chosen as the upload method. The formula for the upload delay from V0 to VN+1 is given by


Uploading energy consumption model: In light of the selected upload mode, build the upload energy consumption model and calculate the upload energy consumption E2aupload and E2bupload. Each transmission mode shall transmit data from RSUs to the MEC via wired mode, using E2Rupload to denote the energy consumption of wired transmission. The E2Rupload is given by


where PN+1 denotes the wired transmission power. E2upload denotes the total upload energy consumption. E2upload is given by


where KN+1 is the effective switching capacitor related to chip structure in the MEC. fN+1 is the calculation capacity of the server itself. CN+11 represents the number of CPU revolutions required for processing the subtasks hN+1.

Computing delay model: After the task is uploaded, the MEC servers start the parallel computation of the subtasks and obtain the computation delay. T2comput is used to denote the computing delay; T2comput is given by


Computing energy consumption model: The computing energy consumption model is created according to the assigned task. Let CN+12 be the number of CPU revolutions required for processing the subtasks hN+1. Let E2comput be the computing energy consumption; E2comput is given by


3.2 Problem formulation

This paper aims to solve the issue of joint task offloading based on the edge computing network of IoV, that is, to minimize the task delay and energy consumption under the constraints of limited system resources. The payoff function is the weighted sum of task processing, energy consumption, and delay. The additional weight balances the effect of energy consumption and delay on the payoff function. Ttotal is used to denote the total delay, which is given by


Etotal is used to denote the total energy consumption, which is given by


The payoff function SRtotal is expressed as


The payoff function is transformed into the total objective function of the joint optimization issue. The optimization problem can be described as minimizing the delay and energy consumption under task allocation, transmission bandwidth allocation, and transmit power control constraints. Thus, the optimization issue can be formulated as


In problem P1, constraint C1 represents the task allocation ratio, and the sum of the ratio is 1. C2 denotes the allocation ratio of wireless bandwidth. The sum of the wireless bandwidth allocation ratios is less than 1. C3 and C4 represent the value range of the task allocation ratio and wireless bandwidth ratio, respectively. C5 limits the transmit power of the uplink transmission rate. C6 represents the weight value.

4 DRL-based algorithm for task offloading

Section 4.1 presents DRL techniques and the Markov decision process (MDP). Section 4.2 proposes the conversion of the optimization issues in the model into DRL issues. Section 4.3 proposes a PPO-based approach to address the task offloading issue.

4.1 DRL-based framework

4.1.1 DRL techniques

Deep learning (DL) has strong perception ability but lacks specific decision-making abilities; RL has decision-making abilities but does not address the solving of perception issues. DRL integrates DL’s perception ability and RL’s decision-making ability, which solves the perceptual decision issue of complex systems. DRL is an end-to-end sensing and control method with strong generality. Its learning process can be described as follows: (i) at each moment, the agent interacts with the environment to get a high-dimensional observation and specific state characteristics. (ii) The current state is mapped to the corresponding action through the strategy, and the value function of each action is evaluated. (iii) The environment gives feedback to the action to obtain the next observation object. The optimal policy is obtained by successive cycles of the aforementioned procedure.

4.1.2 Markov decision process

Almost all issues can be formulated as MDP in the formal description of RL environments. MDP refers to the decision-maker who periodically or continuously observes the stochastic dynamic system with Markov properties and makes decisions. It includes the environmental state, action, reward, state transition probability matrix, and discount factor. The process is given a state. The agent obtains the new state by performing actions based on the state transition probability matrix. Each strategy is rewarded for its implementation.

4.2 Problem transformation

The IoV scenario has continuous state and action space, which will increase the issue’s complexity. So, it is a challenge to find the best task offloading strategy. Traditional optimization algorithms require significant iterations to achieve an approximate solution when solving such issues, which does not meet the requirements of time-varying systems. However, DRL algorithms can meet real-time decision-making requirements. Therefore, this paper adopted the DRL algorithm to solve the aforementioned issues. Get the optimal task offloading strategy through continuous interaction with the IoV environment.

Problem P1 is a complex issue with continuous real variables, which have strong coupling. Task allocation, transmission power, and transmission bandwidth are all continuous real variables. Therefore, P1 is a non-convex combined issue that cannot be solved directly through mathematical calculation. In light of this, this paper turns the optimization issue into a DRL issue and proposes adopting the DRL algorithm to solve the global optimization issue. Thus, the optimization issue (34) is established as follows:


where E() represents the mathematical expectation.

In the IoV system, the cars are moving, and the vehicle status, edge server status, wireless transmission channel status, and RSU status are changing. The system needs to make different decisions to minimize delay and energy consumption and meet the reasonable allocation of resources. The transmission bandwidth and computing resources allocated by the IoV system to cars and RSUs are continuous values. Traditional DQN is mainly for discrete space. The DDPG is mainly for constant action space. The SAC and PPO can be applied to discrete and continuous spaces. Therefore, this paper designs a PPO-based method to find the optimal task offloading strategy. Next, the paper delves into the environmental state, action space, and reward function of Markov games.

4.2.1 Environment state

The environment state S(t) reflects the impact of the channel condition information and agent behavior on the environment [52]. The state information includes the state of the cars, BS, and RSUs. The state of the car consists of the vehicle coordinates, transmission bandwidth, transmission power, and task allocation. The state of BS and RSUs includes the task size, transmission power, and transmission bandwidth. Each agent observes that the environment state is

where Un(t), Nn(t), and Rn(t) denote the status of vehicles, BS, and RSUs, respectively.

4.2.2 Action space

Although the computational complexity of DRL is relatively low in large-scale network scenarios, the spatial dimension changes as the number of agents increases. The high-dimensional space will make the system calculation difficult and affect the best decision. The algorithm’s performance will suffer from dimension disaster due to the high-dimensional action and state space [53]. The agent takes actions according to the currently observed state to avoid the high computational complexity, that is, jointly optimize the task allocation, transmission bandwidth allocation, uplink power control, and offloading decision. Hence, the action is

xn represents the task allocation policy.

bn represents the uplink transmission bandwidth.

Pn(t) represents the uplink transmission power.

The agent selects the offloading decision based on the present state. If the agent sets local computing for the task, the computing resources must meet the requirements. However, if the agent selects to calculate the task on the SeVs or MEC, the transmission bandwidth and computing resources must meet the needs.

4.2.3 Rewards

This paper should strictly follow constraints C1–C6 in the design of state space, action space, and reward function to optimize the task offloading strategies. The sum of the reward functions of nodes in all states is constant, and there is a competitive and cooperative relationship between nodes. Therefore, the paper sets the reward value as the opposite of the objective function. In optimization issue P2, the agent maximizes the interests through action selection to affect the system’s state. The reward function is

4.3 PPO-based algorithm framework

This paper proposes a PPO-based task offloading and resource allocation (PPOTR) algorithm to obtain stable performance in the actual changing network. The agent chooses the action to interact with the environment according to the policy, thus affecting the environment state and updating the environment parameters. Next, according to the new policy, the agent chooses actions to interact with the environment. Let rt(θ) denote the action probability ratio of new and old strategies.


When rt(θ) > 1, it indicates that the current strategy is more inclined to select the sampling action. Otherwise, it is not. The PPO algorithm improves the original policy gradient (PG) algorithm, and the formula of the new objective function is given by


4.3.1 Training algorithm

PPO uses a new objective function to control the change in the strategy in each iteration, which is uncommon in other algorithms. The objective function is


where θ denotes the policy parameters. Et denotes the empirical expectation of the time step. rt denotes the probability ratio under the new and old strategies. At represents the estimated advantage. ɛ denotes the hyperparameter. The value is usually 0.1 or 0.2. clip (rt(θ), 1 − ɛ, 1 + ɛ) is given by


4.3.2 Replay buffer

The static data in the DL differ from the data in the DRL, which is obtained according to machine learning. At each time step, the agent observes the current environment state and saves the state, action, reward, and prediction data datat=st,at,rt,st+1 of the following environment state to the replay buffer [54]. In particular, in our model, the data of the training network will be aggregated after 1,000 time steps. We can see the data changes in the training process and avoid the correlation in the observation state sequence to reduce the update variance. Moreover, the data of each experiment can be used continuously in other weight updates to improve the efficiency of data use.

4.3.3 Algorithm steps

In the aforementioned architecture, the agent is the car, MEC is the policy decision center, and the SeVs and RSUs are the intermediaries of perception information. The algorithm’s input is the environment state information, and the output is the optimal offloading policy and target value. Algorithm 1 presents the pseudo-code.

Algorithm 1.PPO-based algorithm for task offloading and resource allocation.

Input: initial policy parameters and initial value function parameters ϕ0

1: for k = 0,1,2 … do do

2:  Collect a set of trajectories Dk=τi by running policy πk = π(θk) in the environment.

3:  Compute rewards-to-go Rt

4:  Compute advantage estimates, At (using any method of advantage estimation) based on the current value function Vθk.

5:  Update the policy by maximizing the PPO-clip objective, normally via stochastic gradient ascent with Adam. θk+1=argmaxθ1DkTτDkTt=0mπθatstπθkatstAπθkst,at,g,Aπθkst,at

6:  Fit value function by regression on the mean-squared error, normally via some gradient descent algorithm. ϕk+1=argminϕ1DkTTDkTt=0VϕstRt2

7: end for

The following illustrates the steps of the proposed PPOTR algorithm. First, enter the initial policy parameter θ0 and the value function parameter ϕ0. Second, start iteration and collect a set of trajectories Dk=τi by running policy πk = π(θk) in the environment. Then, in the fourth and fifth steps, calculate the reward value Rt and use the advantage estimation method based on the current value function Vθk to calculate the advantage estimation At. Then, in the sixth and seventh steps, update the strategy θk+1 through PPO-clip objective function Lclip(θ) and the fit value function ϕk+1 through mean square error regression. Finally, the algorithm iteration is ended.

Complexity analysis: The algorithm’s main computational costs include the interaction with the environment, the action, and evaluation under the old and new strategies. In the process of interacting with the environment, the agent determines the input state according to the policy. Furthermore, the agent calculates the probability ratio of the action under the new and old policies through the transmission between the action network and the critic network. The time complexity of the training process interacting with the environment is given by [55]


The time complexity of policy updates is given by


where x, y are the quantities of full connection layers of the network, respectively. ΩxA represents the unit of the x-th actor network, and ΩyC represents the artificial neuron of the y-th critic network. Then, the total time complexity of Algorithm 1 is given by


The space complexity of the algorithm [56] is given by


where N is the space complexity of the experience replay buffer in the algorithm.

5 Simulation results

In this section, a series of simulation experiments to verify the performance of the proposed algorithm have been proposed. The simulation results of different algorithms under the same network settings are given to compare the characteristics of different algorithms. This paper analyzes the convergence curves of delay, energy consumption, and objective function under the offloading strategy and shows that the algorithm is reasonable. This paper adopts four benchmark schemes from the perspective of convergence, and the effectiveness and efficiency of the algorithm are verified through the analysis of energy consumption and delay. The four schemes are as follows:

• Actor critical (AC) algorithm based on SAC [57]: Under the same environmental settings, this paper uses the AC algorithm with a soft update mechanism to solve the issue.

• Local computing policies for all the tasks (AllLocal): All the tasks are performed locally, and the local computing resources meet the requirements of task calculation. Calculate the corresponding delay, energy consumption, and objective function value.

• All-edge server-only execution policy (AllEdge): Offload all the tasks to the edge server, and the edge computing resources and transmission bandwidth meet the task’s requirements.

• Random offloading policy (Random): Offload the task randomly and allocate resources randomly.

5.1 Simulation setup

This paper evaluates the performance of the proposed algorithm through multiple simulation experiments. We assume that four RSUs are set at the roadside, the coverage diameter of each RSU is 160 m, and the coverage is tangential to each other. For the sake of driving safety, we assume that there is one TaV and ten SeVs within the coverage of RSUs. The TaV and SeVs are always within the coverage of RSUs. Assume the input data SDaT size is 25 Mb (one frame with a resolution of 1920*1080, 12 bits per pixel), and the total transmission bandwidth BToT is 100 MHz [58].

The calculation strength is 2,640 cycles/bit [59]. The CPU frequency of each car is randomly selected within the range of 0.3 × 1012 ∼ 0.6 × 1012 cycles/s, and the CPU frequency of the MEC servers is randomly selected within the range of 1 *1012 ∼ 2*1012 cycles/s [60]. The effective switching capacitor of the vehicle and MEC is 10–27. The transmission rate Rwired for RSU wired transmission to the MEC is 100 Gb/s [50]. The calculation capacity of cars and MEC is set to 1.4 Gr/s and 2.8 Gr/s, respectively Song et al. [51].

This paper considers the effect of small-scale fading on transmission performance; the channel fading coefficients h̃n2 and h̃N+12 are 1 [61]. The simulation experiment is completed in the environment of Pytorch 1.11.0 using Windows 10 system and Python 3.10 software. Other system parameters used in simulations are shown in Table 1.

TABLE 1. Parameter setting.

5.2 Results

The learning curve of the PPOTR algorithm is shown in Figure 2, including the training losses of the action and criticism network. In the simulation experiment, if the value function of training loss does not tend to 0 for the action network, it proves that the whole action space has many places not explored, and there are still differences between the new and old action space. For the critic network, if the value function of training loss does not tend to 0, it proves that the critic network cannot perfectly predict the value of the state space. The simulations show that starting from the 50th training set, the value function fluctuates in a small range, and the gradient of the loss value decreases gradually. This indicates that the algorithm begins to converge and can quickly learn the optimal strategy.

FIGURE 2. Training curve of PPOTR.

Figures 3A, B show the convergence curves of the delay, energy consumption, and objective function of the PPOTR algorithm to solve the aforementioned issues. Figure 3A shows the unsmoothed curve of the training process, illustrating the total delay sum_OTPD in seconds, the total energy consumption sum_OTEC tin joules, and the reward value. Figure 3B is the convergence curve after smoothing in Figure 3A; the smooth curve is obtained by averaging the data under each training step with the previous 999 data. The simulations show that, although the curve fluctuates, the whole process tends to be flat and the algorithm converges.

FIGURE 3. Convergence curve of PPOTR. (A) Unsmoothed convergence curve of PPOTR. (B) Smoothed convergence curve of PPOTR.

The learning curve of the SAC algorithm is shown in Figure 4, including the training loss of the action and critic networks. It can be seen from the simulation results that the convergence speed of the loss function of the SAC algorithm is fast. Therefore, the algorithm can quickly learn the optimal strategy.

FIGURE 4. Training curve of SAC.

Figure 5A, B show the convergence curves of the delay, energy consumption, and objective function of the SAC algorithm to solve the aforementioned issues. Figure 5A shows the unsmoothed curve of the training process, illustrating the total delay sum_SAC_OTPD in seconds, the total energy consumption sum_SAC_OTEC in joules, and the reward value. Figure 5B is the convergence curve after smoothing in Figure 5A. The simulations show that the SAC algorithm converges quickly and can solve the aforementioned issues.

FIGURE 5. Convergence curve of SAC. (A) Unsmoothed convergence curve of SAC. (B) Smoothed convergence curve of SAC.

The aforementioned two algorithms can solve the issue in this paper. The SAC algorithm has a fast convergence rate because it scales the state characteristics before inputting data parameters into the model. There is no difference in orders of magnitude between variables, which is conducive to optimizing the initial model. However, the convergence effect of the PPOTR algorithm is better because the algorithm is trained based on dynamic fitting data parameters. In this paper, the simulation is set to train once every 2,048 steps, so the convergence speed of the PPOTR algorithm is slow, but the convergence effect is good.

5.3 Performance comparison

In this part, this paper compares the PPOTR algorithm with the four benchmark algorithms in terms of delay, energy consumption, and reward value to verify the proposed algorithm’s performance.

As shown in Figure 6, it represents the total delay of different policies under the same task data. Each scheme will converge to the optimal value with increased training times. Under the same computing task, the SAC algorithm converges faster, but the PPOTR algorithm converges better. When the task volume increases to 25 Mb, the proposed algorithm saves approximately 17.33%, 32.74%, 56.63%, and 32.63%, respectively, compared with the SAC algorithm, local computing, edge execution, and random computing of the time cost. This shows that the algorithm proposed in this paper can achieve better performance in terms of task processing delay.

FIGURE 6. Total delay under different policies.

Figure 7 illustrates the total energy consumption corresponding to different policies under the same task data. When the task data volume is 25 Mb, the energy consumption cost of edge execution calculation is about 2.3 times that of the PPOTR algorithm. The simulations show that the PPOTR algorithm saves approximately 25.79%, 77.53%, and 63.31%, respectively, compared with SAC, AllLocal, and Random of the energy consumption. The proposed algorithm achieves the lowest energy consumption cost.

FIGURE 7. Total energy consumption under different policies.

Finally, this paper normalizes the delay and energy consumption and converts the objective function value into the reward value in DRL, as shown in Figure 8. The reward value is composed of delay and energy consumption, with these variables being highly coupled and interactive. Under the constraint conditions, the reward value is minimized to obtain the best task-unloading strategy. The reward values of the proposed algorithm in this paper were improved by 15%, 28%, 30%, and 44% compared to four baseline algorithms. Numerical comparative analysis provides strong evidence for the reliability of the algorithm and approach proposed in this paper. Compared with the four benchmark algorithms, the algorithm proposed in this paper is superior in terms of delay, energy consumption, and reward value. Therefore, this scheme can guarantee to minimize the energy consumption cost under the tolerable delay.

FIGURE 8. Reward under different policies.

6 Conclusion

This paper investigates a joint optimization strategy for task offloading in the IoV edge computing network. In the IoV scenario, while considering the timeliness of task data and resource constraints, we constructed a model based on cooperative games and transformed it into a joint optimization issue. This paper models the optimization issue as a Markov game based on intelligent edge, game theory, communication, and DRL. The reward function is devised as the sum of delay and energy consumption. We adopted the PPO-based algorithm to solve the previously mentioned issue. Finally, the performance of the algorithm is verified using the simulation experiments. The numerical results show that, compared with SAC and other baseline schemes, this scheme can achieve stable convergence in the system environment and obtain the optimal reward value. This scheme minimizes the system cost and meets the development needs of the future IoV.

Data availability statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Author contributions

LW: investigation, methodology, validation, writing–original draft, and writing–review and editing. WZ: writing–review and editing. HX: methodology, writing–original draft, and writing–review and editing. LL: writing–original draft. LC: writing–review and editing. XZ: writing–review and editing.


The author(s) declare financial support was received for the research, authorship, and/or publication of this article. Network data security monitoring technology and cross-border control for smart cars (2022YFB3104900). Scientific and Technological Innovation Foundation of Shunde Graduate School, USTB (BK22BF002).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.


Source link