Ref. | Technique | Application | Dataset | Featuresa | Action set | Evaluation | |
---|---|---|---|---|---|---|---|
(selection) | (network) | Settingsa | Improvementb | ||||
AdaR [461] | Partially decentralized LSPI (ε-greedy) | Unicast routing (WSN) | Simulations ·400 sensors ·20 data sources ·1 sink | State: \(\mathcal {N}_{i}\) Reward: function of · node load · residual energy · hop cost to sink · link reliability | Next-hop nodes to destination | ·S=#nodes·A=#neighbors | Compared to Q-learning: · Faster convergence (by 40 episodes) · Less sensitive to initial parameters |
FROMS [151] | Q-learning (variant of ε-greedy) | Multicast routing (WSN) | Omnet++ Mobility Framework with 50 random topologies ·50 nodes ·5 sources ·45 sinks | State: (\(\mathcal {N}^{k}_{i}\), D k ) Reward: function of hop cost | \(\{a_{1} \cdots a_{m}\} a_{k} = (\mathcal {N}^{k}_{j}, D_k) N^{k}_{j} =\) next hop along the path to sink D k | ·S=#nodes·A=#neighbors | Comparedto directed diffusion: · up to 5× higher delivery rate ·≈20% lower overhead |
Q-PR [24] | Variant of Q-learning (ε-greedy) | Localization-aware routing to achieve a trade-off between packet delivery rate, ETX, and network lifetime (WSN) | Simulations ·50 different topologies ·100 nodes | State: \(\mathcal {N}_{i}\) Reward: function of · distance(\(\mathcal {N}_{i}\),\(\mathcal {N}_{j}\)) · distance(\(\mathcal {N}_{j}\),d) · energy at \(\mathcal {N}_{j} \cdot \) ETX \(\cdot \mathcal {N}_{j}\)’s neighbors for any neighbor \(\mathcal {N}_{j}\) and destination | Next-hop nodes to destination | ·S=#nodes·A=#neighbors | Delivery rate: ·25% more than GPSR Network lifetime ·3× more than GPSR ·4× more than EFE |
Ref. | Technique | Application | Dataset | Featuresa | Action set | Evaluation | |
(selection) | (network) | Settingsa | Improvementb | ||||
Xia et al. [482] | DRQ-learning (greedy) | Spectrum-aware routing (CRN) | OMNET++ simulations · stationary multi-hop CRN · 10 nodes · 2 PUs | State: \(\mathcal {N}_{i}\) Reward: # available channels between current node and next-hop node | Next-hop nodes to destination | ·S=#nodes·A=#neighbors | Compared to Q-routing: ·50% faster at lower activity level Compared to Q-routing and SP-routing: · lower converged end-to-end delay |
QELAR [197] | Model-based Q-learning (greedy) | Distributed energy-efficient routing (underwater WSN) | Simulations (ns-2) ·250 sensors in 5003m3 space ·100m transmission range · fixed source/sink ·1m/s maximum speed for intermediate nodes | State: \(\mathcal {N}_{i}\) Reward: function of the residual energy of the node receiving the packet and the energy distribution among its neighbor nodes. | Next-hop nodes to destination ∪ packet withdrawal | ·S=#nodes·A=1+#neighbors | Compared to Q-learning: · Faster convergence (40 episodes less) · Less sensitive to initial parameters |
Lin et al. [277] | n−step TD (greedy) | Delay-sensitive application routing(multi-hop wireless ad hoc networks) | Simulations 2 users transmitting video sequences to the same destination node ·3∼4-hops wireless network | State: current channel states and queue sizes at the nodes in each hop Reward: goodput at destination | Next-hop nodes to destination | \(\cdot S=n_{q}^{N}\times n_{c}^{H} \cdot A=(N_{h}^2)^{H-1}\times N_{h} N=\# nodes N_h=\# nodes\)at hop h H=#hopsn q =#queuestates n c =#channelstates | Complexity ≈2×108 for the 3−hop network With 95% less information exchanges ·∼10% higher PSNR · slightly slower convergence (+1∼2sec) |
d-AdaptOR [59] | Q-learning with adaptive learning rate (ε−greedy) | Opportunistic routing (multi-hop wireless ad hoc networks) | Simulations on QualNet with 36 randomly placed wireless nodes in a 150m×150m | State: \(\mathcal {N}_{i}\) Reward: · fixed negative transmission cost is receiver is not the destination · fixed positive reward if receiver is the destination · 0 if packet is withdrawn | Next-hop nodes to destination ∪ packet withdrawal | ·S=#nodes·A=1+#neighbors | After convergence (≈300sec) · ETX comparable to a topology-aware routing algorithm ·>30% improvement over greedy-SR, greedy ExOR and SRCR with a single flow · Improvement decreases with # flows |
QAR [276] | Centralized SARSA (ε-greedy) | QoS-aware adaptive routing(SDN) | Sprint GIP network trace-driven simulations [418] · 25 switches, 53 links | State: \(\mathcal {N}_{i}\) Reward: function of delay, loss, throughput | Next-hop nodes to destination | ·S=#nodes ·A=#neighbors | Compared to Q-learning with QoS-awareness: · Faster convergence time (20 episodes less) |