Skip to main content

Table 9 Summary of RL-based decentralized, partially decentralized, and centralized routing models

From: A comprehensive survey on machine learning for networking: evolution, applications and research opportunities

Ref.

Technique

Application

Dataset

Featuresa

Action set

Evaluation

 

(selection)

(network)

   

Settingsa

Improvementb

AdaR [461]

Partially decentralized LSPI (ε-greedy)

Unicast routing (WSN)

Simulations ·400 sensors ·20 data sources ·1 sink

State: \(\mathcal {N}_{i}\) Reward: function of · node load · residual energy · hop cost to sink · link reliability

Next-hop nodes to destination

·S=#nodes·A=#neighbors

Compared to Q-learning: · Faster convergence (by 40 episodes) · Less sensitive to initial parameters

FROMS [151]

Q-learning (variant of ε-greedy)

Multicast routing (WSN)

Omnet++ Mobility Framework with 50 random topologies ·50 nodes ·5 sources ·45 sinks

State: (\(\mathcal {N}^{k}_{i}\), D k ) Reward: function of hop cost

\(\{a_{1} \cdots a_{m}\} a_{k} = (\mathcal {N}^{k}_{j}, D_k) N^{k}_{j} =\) next hop along the path to sink D k

·S=#nodes·A=#neighbors

Comparedto directed diffusion: · up to 5× higher delivery rate ·≈20% lower overhead

Q-PR [24]

Variant of Q-learning (ε-greedy)

Localization-aware routing to achieve a trade-off between packet delivery rate, ETX, and network lifetime (WSN)

Simulations ·50 different topologies ·100 nodes

State: \(\mathcal {N}_{i}\) Reward: function of · distance(\(\mathcal {N}_{i}\),\(\mathcal {N}_{j}\)) · distance(\(\mathcal {N}_{j}\),d) · energy at \(\mathcal {N}_{j} \cdot \) ETX \(\cdot \mathcal {N}_{j}\)’s neighbors for any neighbor \(\mathcal {N}_{j}\) and destination

Next-hop nodes to destination

·S=#nodes·A=#neighbors

Delivery rate: ·25% more than GPSR Network lifetime ·3× more than GPSR ·4× more than EFE

Ref.

Technique

Application

Dataset

Featuresa

Action set

Evaluation

 

(selection)

(network)

   

Settingsa

Improvementb

Xia et al. [482]

DRQ-learning (greedy)

Spectrum-aware routing (CRN)

OMNET++ simulations · stationary multi-hop CRN · 10 nodes · 2 PUs

State: \(\mathcal {N}_{i}\) Reward: # available channels between current node and next-hop node

Next-hop nodes to destination

·S=#nodes·A=#neighbors

Compared to Q-routing: ·50% faster at lower activity level Compared to Q-routing and SP-routing: · lower converged end-to-end delay

QELAR [197]

Model-based Q-learning (greedy)

Distributed energy-efficient routing (underwater WSN)

Simulations (ns-2) ·250 sensors in 5003m3 space ·100m transmission range · fixed source/sink ·1m/s maximum speed for intermediate nodes

State: \(\mathcal {N}_{i}\) Reward: function of the residual energy of the node receiving the packet and the energy distribution among its neighbor nodes.

Next-hop nodes to destination ∪ packet withdrawal

·S=#nodes·A=1+#neighbors

Compared to Q-learning: · Faster convergence (40 episodes less) · Less sensitive to initial parameters

Lin et al. [277]

n−step TD (greedy)

Delay-sensitive application routing(multi-hop wireless ad hoc networks)

Simulations 2 users transmitting video sequences to the same destination node ·3∼4-hops wireless network

State: current channel states and queue sizes at the nodes in each hop Reward: goodput at destination

Next-hop nodes to destination

\(\cdot S=n_{q}^{N}\times n_{c}^{H} \cdot A=(N_{h}^2)^{H-1}\times N_{h} N=\# nodes N_h=\# nodes\)at hop h H=#hopsn q =#queuestates n c =#channelstates

Complexity ≈2×108 for the 3−hop network With 95% less information exchanges ·∼10% higher PSNR · slightly slower convergence (+1∼2sec)

d-AdaptOR [59]

Q-learning with adaptive learning rate (ε−greedy)

Opportunistic routing (multi-hop wireless ad hoc networks)

Simulations on QualNet with 36 randomly placed wireless nodes in a 150m×150m

State: \(\mathcal {N}_{i}\) Reward: · fixed negative transmission cost is receiver is not the destination · fixed positive reward if receiver is the destination · 0 if packet is withdrawn

Next-hop nodes to destination ∪ packet withdrawal

·S=#nodes·A=1+#neighbors

After convergence (≈300sec) · ETX comparable to a topology-aware routing algorithm ·>30% improvement over greedy-SR, greedy ExOR and SRCR with a single flow · Improvement decreases with # flows

QAR [276]

Centralized SARSA (ε-greedy)

QoS-aware adaptive routing(SDN)

Sprint GIP network trace-driven simulations [418] · 25 switches, 53 links

State: \(\mathcal {N}_{i}\) Reward: function of delay, loss, throughput

Next-hop nodes to destination

·S=#nodes ·A=#neighbors

Compared to Q-learning with QoS-awareness: · Faster convergence time (20 episodes less)

  1. a\(\mathcal {N}_{i}\): node i; D k : sink k; S: number of state variables; A: number of possible actions per state; #: number of
  2. bAverage values. Results vary according to experimental settings.