AdaR [461]

Partially decentralized LSPI (εgreedy)

Unicast routing (WSN)

Simulations ·400 sensors ·20 data sources ·1 sink

State: \(\mathcal {N}_{i}\) Reward: function of · node load · residual energy · hop cost to sink · link reliability

Nexthop nodes to destination

·S=#nodes·A=#neighbors

Compared to Qlearning: · Faster convergence (by 40 episodes) · Less sensitive to initial parameters

FROMS [151]

Qlearning (variant of εgreedy)

Multicast routing (WSN)

Omnet++ Mobility Framework with 50 random topologies ·50 nodes ·5 sources ·45 sinks

State: (\(\mathcal {N}^{k}_{i}\), D_{
k
}) Reward: function of hop cost

\(\{a_{1} \cdots a_{m}\} a_{k} = (\mathcal {N}^{k}_{j}, D_k) N^{k}_{j} =\) next hop along the path to sink D_{
k
}

·S=#nodes·A=#neighbors

Comparedto directed diffusion: · up to 5× higher delivery rate ·≈20% lower overhead

QPR [24]

Variant of Qlearning (εgreedy)

Localizationaware routing to achieve a tradeoff between packet delivery rate, ETX, and network lifetime (WSN)

Simulations ·50 different topologies ·100 nodes

State: \(\mathcal {N}_{i}\) Reward: function of · distance(\(\mathcal {N}_{i}\),\(\mathcal {N}_{j}\)) · distance(\(\mathcal {N}_{j}\),d) · energy at \(\mathcal {N}_{j} \cdot \) ETX \(\cdot \mathcal {N}_{j}\)’s neighbors for any neighbor \(\mathcal {N}_{j}\) and destination

Nexthop nodes to destination

·S=#nodes·A=#neighbors

Delivery rate: ·25% more than GPSR Network lifetime ·3× more than GPSR ·4× more than EFE

Ref.

Technique

Application

Dataset

Features^{a}

Action set

Evaluation


(selection)

(network)
   
Settings^{a}

Improvement^{b}

Xia et al. [482]

DRQlearning (greedy)

Spectrumaware routing (CRN)

OMNET++ simulations · stationary multihop CRN · 10 nodes · 2 PUs

State: \(\mathcal {N}_{i}\) Reward: # available channels between current node and nexthop node

Nexthop nodes to destination

·S=#nodes·A=#neighbors

Compared to Qrouting: ·50% faster at lower activity level Compared to Qrouting and SProuting: · lower converged endtoend delay

QELAR [197]

Modelbased Qlearning (greedy)

Distributed energyefficient routing (underwater WSN)

Simulations (ns2) ·250 sensors in 500^{3}m^{3} space ·100m transmission range · fixed source/sink ·1m/s maximum speed for intermediate nodes

State: \(\mathcal {N}_{i}\) Reward: function of the residual energy of the node receiving the packet and the energy distribution among its neighbor nodes.
 Nexthop nodes to destination ∪ packet withdrawal 
·S=#nodes·A=1+#neighbors

Compared to Qlearning: · Faster convergence (40 episodes less) · Less sensitive to initial parameters

Lin et al. [277]

n−step TD (greedy)

Delaysensitive application routing(multihop wireless ad hoc networks)
 Simulations 2 users transmitting video sequences to the same destination node ·3∼4hops wireless network 
State: current channel states and queue sizes at the nodes in each hop Reward: goodput at destination

Nexthop nodes to destination

\(\cdot S=n_{q}^{N}\times n_{c}^{H} \cdot A=(N_{h}^2)^{H1}\times N_{h} N=\# nodes N_h=\# nodes\)at hop h H=#hopsn_{
q
}=#queuestates n_{
c
}=#channelstates

Complexity ≈2×10^{8} for the 3−hop network With 95% less information exchanges ·∼10% higher PSNR · slightly slower convergence (+1∼2sec)

dAdaptOR [59]

Qlearning with adaptive learning rate (ε−greedy)

Opportunistic routing (multihop wireless ad hoc networks)

Simulations on QualNet with 36 randomly placed wireless nodes in a 150m×150m

State: \(\mathcal {N}_{i}\) Reward: · fixed negative transmission cost is receiver is not the destination · fixed positive reward if receiver is the destination · 0 if packet is withdrawn
 Nexthop nodes to destination ∪ packet withdrawal 
·S=#nodes·A=1+#neighbors

After convergence (≈300sec) · ETX comparable to a topologyaware routing algorithm ·>30% improvement over greedySR, greedy ExOR and SRCR with a single flow · Improvement decreases with # flows

QAR [276]

Centralized SARSA (εgreedy)

QoSaware adaptive routing(SDN)

Sprint GIP network tracedriven simulations [418] · 25 switches, 53 links

State: \(\mathcal {N}_{i}\) Reward: function of delay, loss, throughput

Nexthop nodes to destination

·S=#nodes ·A=#neighbors

Compared to Qlearning with QoSawareness: · Faster convergence time (20 episodes less)
