Skip to main content

Table 9 Summary of RL-based decentralized, partially decentralized, and centralized routing models

From: A comprehensive survey on machine learning for networking: evolution, applications and research opportunities

Ref. Technique Application Dataset Featuresa Action set Evaluation
  (selection) (network)     Settingsa Improvementb
AdaR [461] Partially decentralized LSPI (ε-greedy) Unicast routing (WSN) Simulations ·400 sensors ·20 data sources ·1 sink State: \(\mathcal {N}_{i}\) Reward: function of · node load · residual energy · hop cost to sink · link reliability Next-hop nodes to destination ·S=#nodes·A=#neighbors Compared to Q-learning: · Faster convergence (by 40 episodes) · Less sensitive to initial parameters
FROMS [151] Q-learning (variant of ε-greedy) Multicast routing (WSN) Omnet++ Mobility Framework with 50 random topologies ·50 nodes ·5 sources ·45 sinks State: (\(\mathcal {N}^{k}_{i}\), D k ) Reward: function of hop cost \(\{a_{1} \cdots a_{m}\} a_{k} = (\mathcal {N}^{k}_{j}, D_k) N^{k}_{j} =\) next hop along the path to sink D k ·S=#nodes·A=#neighbors Comparedto directed diffusion: · up to 5× higher delivery rate ·≈20% lower overhead
Q-PR [24] Variant of Q-learning (ε-greedy) Localization-aware routing to achieve a trade-off between packet delivery rate, ETX, and network lifetime (WSN) Simulations ·50 different topologies ·100 nodes State: \(\mathcal {N}_{i}\) Reward: function of · distance(\(\mathcal {N}_{i}\),\(\mathcal {N}_{j}\)) · distance(\(\mathcal {N}_{j}\),d) · energy at \(\mathcal {N}_{j} \cdot \) ETX \(\cdot \mathcal {N}_{j}\)’s neighbors for any neighbor \(\mathcal {N}_{j}\) and destination Next-hop nodes to destination ·S=#nodes·A=#neighbors Delivery rate: ·25% more than GPSR Network lifetime ·3× more than GPSR ·4× more than EFE
Ref. Technique Application Dataset Featuresa Action set Evaluation
  (selection) (network)     Settingsa Improvementb
Xia et al. [482] DRQ-learning (greedy) Spectrum-aware routing (CRN) OMNET++ simulations · stationary multi-hop CRN · 10 nodes · 2 PUs State: \(\mathcal {N}_{i}\) Reward: # available channels between current node and next-hop node Next-hop nodes to destination ·S=#nodes·A=#neighbors Compared to Q-routing: ·50% faster at lower activity level Compared to Q-routing and SP-routing: · lower converged end-to-end delay
QELAR [197] Model-based Q-learning (greedy) Distributed energy-efficient routing (underwater WSN) Simulations (ns-2) ·250 sensors in 5003m3 space ·100m transmission range · fixed source/sink ·1m/s maximum speed for intermediate nodes State: \(\mathcal {N}_{i}\) Reward: function of the residual energy of the node receiving the packet and the energy distribution among its neighbor nodes. Next-hop nodes to destination packet withdrawal ·S=#nodes·A=1+#neighbors Compared to Q-learning: · Faster convergence (40 episodes less) · Less sensitive to initial parameters
Lin et al. [277] n−step TD (greedy) Delay-sensitive application routing(multi-hop wireless ad hoc networks) Simulations 2 users transmitting video sequences to the same destination node ·34-hops wireless network State: current channel states and queue sizes at the nodes in each hop Reward: goodput at destination Next-hop nodes to destination \(\cdot S=n_{q}^{N}\times n_{c}^{H} \cdot A=(N_{h}^2)^{H-1}\times N_{h} N=\# nodes N_h=\# nodes\)at hop h H=#hopsn q =#queuestates n c =#channelstates Complexity ≈2×108 for the 3−hop network With 95% less information exchanges ·10% higher PSNR · slightly slower convergence (+12sec)
d-AdaptOR [59] Q-learning with adaptive learning rate (εgreedy) Opportunistic routing (multi-hop wireless ad hoc networks) Simulations on QualNet with 36 randomly placed wireless nodes in a 150m×150m State: \(\mathcal {N}_{i}\) Reward: · fixed negative transmission cost is receiver is not the destination · fixed positive reward if receiver is the destination · 0 if packet is withdrawn Next-hop nodes to destination packet withdrawal ·S=#nodes·A=1+#neighbors After convergence (≈300sec) · ETX comparable to a topology-aware routing algorithm ·>30% improvement over greedy-SR, greedy ExOR and SRCR with a single flow · Improvement decreases with # flows
QAR [276] Centralized SARSA (ε-greedy) QoS-aware adaptive routing(SDN) Sprint GIP network trace-driven simulations [418] · 25 switches, 53 links State: \(\mathcal {N}_{i}\) Reward: function of delay, loss, throughput Next-hop nodes to destination ·S=#nodes ·A=#neighbors Compared to Q-learning with QoS-awareness: · Faster convergence time (20 episodes less)
  1. a\(\mathcal {N}_{i}\): node i; D k : sink k; S: number of state variables; A: number of possible actions per state; #: number of
  2. bAverage values. Results vary according to experimental settings.