Fehu.BufferSourceExperience replay buffer for trajectory storage.
Buffers accumulate trajectories for batch training or off-policy learning.
Experience collection buffers for reinforcement learning algorithms.
This module provides two buffer types for storing agent-environment interactions: replay buffers for off-policy algorithms and rollout buffers for on-policy algorithms. Both support efficient batch sampling and storage management.
Replay buffers store transitions with complete state information, supporting off-policy algorithms like DQN, SAC, and TD3. They maintain a fixed-capacity circular buffer that overwrites oldest experiences when full.
Rollout buffers store sequential steps with optional value estimates and log probabilities, supporting on-policy algorithms like PPO and A2C. They compute advantages using Generalized Advantage Estimation (GAE) before returning batches.
Create a replay buffer and add transitions:
let buffer = Buffer.Replay.create ~capacity:10000 in
let transition =
{ observation; action; reward; next_observation; terminated; truncated }
in
Buffer.Replay.add buffer transitionSample a batch for training:
let batch = Buffer.Replay.sample buffer ~rng ~batch_size:32 in
Array.iter (fun t -> (* train on transition *)) batchUse rollout buffers for on-policy data:
let buffer = Buffer.Rollout.create ~capacity:2048 in
Buffer.Rollout.add buffer
{ observation; action; reward; terminated; truncated; value; log_prob };
Buffer.Rollout.compute_advantages buffer ~last_value ~last_done ~gamma:0.99 ~gae_lambda:0.95;
let steps, advantages, returns = Buffer.Rollout.get buffertype ('obs, 'act) transition = {observation : 'obs;Current state observation
*)action : 'act;Action taken in current state
*)reward : float;Immediate reward received
*)next_observation : 'obs;Resulting next state observation
*)terminated : bool;Whether episode ended naturally
*)truncated : bool;Whether episode was artificially truncated
*)}Basic transition for off-policy algorithms.
Represents a complete state transition containing both the current and next observations. Used by replay buffers for algorithms that learn from arbitrary past experiences.
type ('obs, 'act) step = {observation : 'obs;State observation at this step
*)action : 'act;Action taken at this step
*)reward : float;Immediate reward received
*)terminated : bool;Whether episode ended at this step
*)truncated : bool;Whether the episode was truncated at this step
*)value : float option;Value estimate V(s) from critic, if available
*)log_prob : float option;Log probability log π(a|s) from policy, if available
*)}Rollout step for on-policy algorithms.
Represents a single timestep with optional policy information. Unlike transitions, steps do not store next observations since on-policy data is processed sequentially. Value estimates and log probabilities support policy gradient methods.