Buffer (u.8740a562e61d5e8848a0ff3f1029643e.fehu.1.0.0~alpha2.doc.fehu.Fehu.Buffer)

Buffer Types

Replay buffers store transitions with complete state information, supporting off-policy algorithms like DQN, SAC, and TD3. They maintain a fixed-capacity circular buffer that overwrites oldest experiences when full.

Rollout buffers store sequential steps with optional value estimates and log probabilities, supporting on-policy algorithms like PPO and A2C. They compute advantages using Generalized Advantage Estimation (GAE) before returning batches.

Usage

Create a replay buffer and add transitions:

  let buffer = Buffer.Replay.create ~capacity:10000 in
  let transition =
    { observation; action; reward; next_observation; terminated; truncated }
  in
  Buffer.Replay.add buffer transition

Sample a batch for training:

  let batch = Buffer.Replay.sample buffer ~rng ~batch_size:32 in
  Array.iter (fun t -> (* train on transition *)) batch

Use rollout buffers for on-policy data:

  let buffer = Buffer.Rollout.create ~capacity:2048 in
  Buffer.Rollout.add buffer
    { observation; action; reward; terminated; truncated; value; log_prob };
  Buffer.Rollout.compute_advantages buffer ~last_value ~last_done ~gamma:0.99 ~gae_lambda:0.95;
  let steps, advantages, returns = Buffer.Rollout.get buffer

Sourcetype ('obs, 'act) transition = {

observation : 'obs;
(*
Current state observation
*)
action : 'act;
(*
Action taken in current state
*)
reward : float;
(*
Immediate reward received
*)
next_observation : 'obs;
(*
Resulting next state observation
*)
terminated : bool;
(*
Whether episode ended naturally
*)
truncated : bool;
(*
Whether episode was artificially truncated
*)

}

Basic transition for off-policy algorithms.

Represents a complete state transition containing both the current and next observations. Used by replay buffers for algorithms that learn from arbitrary past experiences.

Sourcetype ('obs, 'act) step = {

observation : 'obs;
(*
State observation at this step
*)
action : 'act;
(*
Action taken at this step
*)
reward : float;
(*
Immediate reward received
*)
terminated : bool;
(*
Whether episode ended at this step
*)
truncated : bool;
(*
Whether the episode was truncated at this step
*)
value : float option;
(*
Value estimate V(s) from critic, if available
*)
log_prob : float option;
(*
Log probability log π(a|s) from policy, if available
*)

}

Rollout step for on-policy algorithms.

Represents a single timestep with optional policy information. Unlike transitions, steps do not store next observations since on-policy data is processed sequentially. Value estimates and log probabilities support policy gradient methods.

Replay Buffer (Off-Policy: DQN, SAC, TD3)

Sourcemodule Replay : sig ... end

Replay buffer for off-policy algorithms.

Rollout Buffer (On-Policy: PPO, A2C)

Sourcemodule Rollout : sig ... end

Rollout buffer for on-policy algorithms.