Nx_datasetsSourceDataset loading and generation utilities for Nx.
This module provides functions to load common machine learning datasets and generate synthetic datasets for testing and experimentation. Real datasets are downloaded and cached in the platform-specific cache directory.
Functions to load classic machine learning datasets as Nx tensors.
load_mnist () loads MNIST handwritten digits dataset.
Returns training and test sets with images as uint8 tensors of shape |samples; 28; 28; 1| and labels as uint8 tensors of shape |samples; 1|. Training set has 60,000 samples, test set has 10,000 samples.
Loading MNIST and checking shapes:
let (x_train, y_train), (x_test, y_test) = Nx_datasets.load_mnist () in
Nx.shape x_train = [| 60000; 28; 28; 1 |]
&& Nx.shape y_train = [| 60000; 1 |]
&& Nx.shape x_test = [| 10000; 28; 28; 1 |]
&& Nx.shape y_test = [| 10000; 1 |]load_fashion_mnist () loads Fashion-MNIST clothing dataset.
Returns same format as MNIST: images as uint8 tensors of shape |samples; 28; 28; 1| and labels as uint8 tensors of shape |samples; 1|.
load_cifar10 () loads CIFAR-10 color image dataset.
Returns training and test sets with images as uint8 tensors of shape |samples; 32; 32; 3| and labels as uint8 tensors of shape |samples; 1|. Training set has 50,000 samples, test set has 10,000 samples.
load_iris () loads Iris flower classification dataset.
Returns features as float64 tensor of shape |150; 4| and labels as int32 tensor of shape |150; 1|. Features are sepal length/width and petal length/width. Labels are 0 (setosa), 1 (versicolor), 2 (virginica).
load_breast_cancer () loads Breast Cancer Wisconsin dataset.
Returns features as float64 tensor of shape |569; 30| and labels as int32 tensor of shape |569; 1|. Labels are 0 (malignant) or 1 (benign).
load_diabetes () loads diabetes regression dataset.
Returns features as float64 tensor of shape |442; 10| and targets as float64 tensor of shape |442; 1|. Target is quantitative measure of disease progression one year after baseline.
load_california_housing () loads California housing prices dataset.
Returns features as float64 tensor of shape |20640; 8| and targets as float64 tensor of shape |20640; 1|. Target is median house value in hundreds of thousands of dollars.
load_airline_passengers () loads monthly airline passenger counts.
Returns int32 tensor of shape |144| containing monthly passenger totals from 1949 to 1960.
Functions to generate synthetic datasets with controlled properties for algorithm development and testing.
val make_blobs :
?n_samples:int ->
?n_features:int ->
?centers:[ `N of int | `Array of Nx.float32_t ] ->
?cluster_std:float ->
?center_box:(float * float) ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_tmake_blobs ?n_samples ?n_features ?centers ?cluster_std ?center_box ?shuffle ?random_state () generates isotropic Gaussian blobs.
Creates clusters of points with each cluster drawn from a normal distribution. Returns features and integer labels.
Generating 3 well-separated 2D clusters:
let x, y = Nx_datasets.make_blobs ~centers:(`N 3) ~cluster_std:0.5 () in
Nx.shape x = [| 100; 2 |] && Nx.shape y = [| 100 |]val make_classification :
?n_samples:int ->
?n_features:int ->
?n_informative:int ->
?n_redundant:int ->
?n_repeated:int ->
?n_classes:int ->
?n_clusters_per_class:int ->
?weights:float list ->
?flip_y:float ->
?class_sep:float ->
?hypercube:bool ->
?shift:float ->
?scale:float ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_tmake_classification ?n_samples ?n_features ?n_informative ... generates random n-class classification problem.
Creates a dataset with controllable characteristics including informative, redundant, and useless features. Useful for testing feature selection.
Creating a binary classification dataset:
let x, y =
Nx_datasets.make_classification ~n_features:10 ~n_informative:3
~n_redundant:1 ()
in
Nx.shape x = [| 100; 10 |] && Nx.shape y = [| 100 |]val make_gaussian_quantiles :
?mean:float array ->
?cov:float ->
?n_samples:int ->
?n_features:int ->
?n_classes:int ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_tmake_gaussian_quantiles ?mean ?cov ?n_samples ... generates isotropic Gaussian divided by quantiles.
Divides a single Gaussian cluster into near-equal-size classes separated by concentric hyperspheres. Useful for testing algorithms that assume Gaussian distributions.
val make_hastie_10_2 :
?n_samples:int ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_tmake_hastie_10_2 ?n_samples ?random_state () generates Hastie et al. 2009 binary problem.
Generates 10-dimensional dataset where y = 1 if sum(x_i^2) > 9.34 else 0. Standard benchmark for binary classification.
val make_circles :
?n_samples:int ->
?shuffle:bool ->
?noise:float ->
?random_state:int ->
?factor:float ->
unit ->
Nx.float32_t * Nx.int32_tmake_circles ?n_samples ?shuffle ?noise ?random_state ?factor () generates concentric circles.
Creates a large circle containing a smaller circle in 2D. Tests algorithms' ability to learn non-linear boundaries.
Creating noisy concentric circles:
let x, y = Nx_datasets.make_circles ~noise:0.1 ~factor:0.5 () in
Nx.shape x = [| 100; 2 |]
&& Array.for_all (fun v -> v = 0 || v = 1) (Nx.to_array y)val make_moons :
?n_samples:int ->
?shuffle:bool ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_tmake_moons ?n_samples ?shuffle ?noise ?random_state () generates two interleaving half circles.
Creates two half-moon shapes. Tests algorithms' ability to handle non-convex clusters.
val make_multilabel_classification :
?n_samples:int ->
?n_features:int ->
?n_classes:int ->
?n_labels:int ->
?length:int ->
?allow_unlabeled:bool ->
?sparse:bool ->
?return_indicator:bool ->
?return_distributions:bool ->
?random_state:int ->
unit ->
Nx.float32_t * [ `Float of Nx.float32_t | `Int of Nx.int32_t ]make_multilabel_classification ?n_samples ?n_features ... generates random multilabel problem.
Creates samples with multiple labels per instance. Models bag-of-words with multiple topics per document.
Returns (X, Y) where Y type depends on return_indicator:
n_samples; n_labels containing label indicesn_samples; n_classes containing binary indicatorsval make_regression :
?n_samples:int ->
?n_features:int ->
?n_informative:int ->
?n_targets:int ->
?bias:float ->
?effective_rank:int option ->
?tail_strength:float ->
?noise:float ->
?shuffle:bool ->
?coef:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_t * Nx.float32_t optionmake_regression ?n_samples ?n_features ... generates random regression problem.
Creates linear combination of random features with optional noise and low-rank structure.
Creating multi-output regression:
let x, y, coef =
Nx_datasets.make_regression ~n_features:20 ~n_informative:5 ~n_targets:2
~coef:true ()
in
Nx.shape x = [| 100; 20 |]
&& Nx.shape y = [| 100; 2 |]
&& match coef with Some c -> Nx.shape c = [| 20; 2 |] | None -> falsemake_sparse_uncorrelated ?n_samples ?n_features ?random_state () generates sparse uncorrelated design.
Only first 4 features affect target: y = x0 + 2*x1 - 2*x2 - 1.5*x3
val make_friedman1 :
?n_samples:int ->
?n_features:int ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_tmake_friedman1 ?n_samples ?n_features ?noise ?random_state () generates Friedman #1 problem.
Features uniformly distributed on 0, 1. Output: y = 10 * sin(pi * x0 * x1) \+ 20 * (x2 - 0.5)^2 + 10 * x3 + 5 * x4 + noise
val make_friedman2 :
?n_samples:int ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_tmake_friedman2 ?n_samples ?noise ?random_state () generates Friedman #2 problem.
Four features with ranges: x0 in 0,100, x1 in 40,560, x2 in 0,1, x3 in 1,11. Output: y = sqrt(x0^2 + (x1 * x2 - 1/(x1 * x3))^2) + noise
val make_friedman3 :
?n_samples:int ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_tmake_friedman3 ?n_samples ?noise ?random_state () generates Friedman #3 problem.
Four features with same ranges as Friedman #2. Output: y = arctan((x1 * x2 - 1/(x1 * x3)) / x0) + noise
val make_s_curve :
?n_samples:int ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_tmake_s_curve ?n_samples ?noise ?random_state () generates S-curve dataset.
Creates 3D S-shaped manifold. Returns points and their position along curve.
Returns (X, t) where X has shape n_samples; 3 and t has shape n_samples
val make_swiss_roll :
?n_samples:int ->
?noise:float ->
?random_state:int ->
?hole:bool ->
unit ->
Nx.float32_t * Nx.float32_tmake_swiss_roll ?n_samples ?noise ?random_state ?hole () generates swiss roll dataset.
Creates 3D swiss roll manifold. Returns points and their position along roll.
Returns (X, t) where X has shape n_samples; 3 and t has shape n_samples
val make_low_rank_matrix :
?n_samples:int ->
?n_features:int ->
?effective_rank:int ->
?tail_strength:float ->
?random_state:int ->
unit ->
Nx.float32_tmake_low_rank_matrix ?n_samples ?n_features ?effective_rank ... generates mostly low-rank matrix.
Creates matrix with bell-shaped singular value profile.
val make_sparse_coded_signal :
n_samples:int ->
n_components:int ->
n_features:int ->
n_nonzero_coefs:int ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_t * Nx.float32_tmake_sparse_coded_signal ~n_samples ~n_components ~n_features ~n_nonzero_coefs ?random_state () generates sparse signal.
Creates signal Y = D * X where D is dictionary and X is sparse code.
Returns (Y, D, X) where:
n_features; n_samples (encoded signal)n_features; n_components (dictionary)n_components; n_samples (sparse codes)make_spd_matrix ?n_dim ?random_state () generates symmetric positive-definite matrix.
Creates random SPD matrix using A^T * A + epsilon * I.
val make_sparse_spd_matrix :
?n_dim:int ->
?alpha:float ->
?norm_diag:bool ->
?smallest_coef:float ->
?largest_coef:float ->
?random_state:int ->
unit ->
Nx.float32_tmake_sparse_spd_matrix ?n_dim ?alpha ... generates sparse symmetric positive-definite matrix.
Creates sparse SPD matrix with controllable sparsity.
val make_biclusters :
?shape:(int * int) ->
?n_clusters:int ->
?noise:float ->
?minval:int ->
?maxval:int ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_t * Nx.int32_tmake_biclusters ?shape ?n_clusters ... generates constant block diagonal structure.
Creates matrix with block diagonal biclusters.
Returns (X, row_labels, col_labels) indicating bicluster membership
val make_checkerboard :
?shape:(int * int) ->
?n_clusters:(int * int) ->
?noise:float ->
?minval:int ->
?maxval:int ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_t * Nx.int32_tmake_checkerboard ?shape ?n_clusters ... generates checkerboard structure.
Creates matrix with checkerboard pattern of high/low values.
Returns (X, row_labels, col_labels) indicating cluster membership