Running the UT-LVCE algorithms

The paper proposes two algorithms for estimating the interventional equivalence class, and a full causal discovery procedure:

Algorithm 1: Estimating the equivalence class of best scoring DAGs from a set of candidate DAGs.
Algorithm 2: Improving on the initial candidate set of DAGs and estimating the equivalence class of best scoring one.
UT-LVCE as a Causal Discovery procedure, using the output of GES on the pooled data as an initial candidate set.

These are provided through the functions utlvce.equivalence_class() and utlvce.equivalence_class_w_ges(), documented below.

We now provide details and examples on how to run the algorithms; the example code can also be found here. To this end we will first generate some synthetic data using the utlvce.generators module:

import utlvce
import utlvce.generators as generators
import utlvce.utils as utils

# Generate a random model and sample some data
model = generators.random_graph_model(
    20, 2.1, {1, 2, 3}, 2, 5, 0.5, 0.6, 3, 6, 0.2, 0.3, 0.2, 1, 0.7, 0.8)
data = model.sample(1000, random_state=42)

# For algorithm 1 and 2, we will use the Markov Equivalence class of the data generating
# graph as a candidate set
candidate_dags = utils.mec(model.A)

Algorithm 1

Algorithm 1 can be run by setting prune_edges=False when calling utlvce.equivalence_class() with the data and a candidate set of DAGs.

# Algorithm 1: Estimate equivalence class of the best scoring
# candidate
utlvce.equivalence_class(candidate_dags, data, prune_edges=False)

Algorithm 2

Conversely, algorithm 2 can be run by setting prune_edges=True when calling utlvce.equivalence_class().

# Algorithm 2: Improve on the initial candidate set and estimate the
# equivalence class of the best scoring graph
utlvce.equivalence_class(candidate_dags, data, prune_edges=True)

UT-LVCE as a Causal Discovery Procedure

For the causal discovery procedure, the utlvce.equivalence_class_w_ges() function is provided. This will estimate the equivalence class of the data-generating model, using the output of GES on the pooled data as an initial set of candidate DAGs.

Note that the call would be equivalent to passing the output DAGs of GES to utlvce.equivalence_class() with prune_edges=True. Any other causal discovery procedure can be used to generate a set of initial candidate DAGs, and used together with UT-LVCE in this way.

# GES + UT-LVCE: Estimate equivalence class of the data generating
# model using GES to obtain the initial set of candidate graphs
utlvce.equivalence_class_w_ges(data)

Full function specification

Both functions can take additional parameters to control, among other things, the behaviour of the alternating optimization procedure used to compute the likelihood of the DAGs. The complete specification follows.

utlvce.equivalence_class(candidate_dags, data, prune_edges=False, nums_latent=[1, 2, 3], folds=[0.5, 0.25, 0.25], psi_max=None, psi_fixed=False, max_iter=1000, threshold_dist_B=0.001, threshold_fluctuation=1e-05, max_fluctuations=5, threshold_score=1e-05, learning_rate=1, B_solver='grad', random_state=42, verbose=0)

Estimate the equivalence class of the best scoring graph from an initial set of candidate DAGs.

Parameters

candidate_dags (list of numpy.ndarray) – A list of the candidate DAG adjacencies, where for each adjacency A, A[i, j] != 0 implies i -> j.
data (list of numpy.ndarray) – A list with the sample from each environment, where each sample is an array with columns corresponding to variables and rows to observations.
prune_edges (bool, default=False) – If we also consider the pruned versions of the candidate graphs.
nums_latent (list of ints, default=[1,2,3]) – The candidates for the number of hidden variables, from which one is selected via cross-validation.
folds (list of float, default=[0.5, 0.25, 0.25]) – The size of the different splits of the data, where the first corresponds to the training data (used to fit the models), the second is used to select the number of latent variables and best DAG, and the third is used to select the intervention targets.
psi_max (float, default = None) – The maximum allowed change in variance between environments for the hidden variables. If None, psis are unconstrained.
psi_fixed (bool, default = False) – If True, impose the additional constraint that the hidden variables have all the same variance.
max_iter (int, default=1000) – The maximum number of iterations allowed for the alternating minimization procedure.
threshold_dist_B (float, default=1e-3) – If the change in B between successive iterations of the alternating optimization procedure is lower than this threshold, stop the procedure and return the estimate.
threshold_fluctuation (float, default=1e-5) – For the alternating optimization routine, if the score worsens by more than this value, consider the iteration a fluctuation.
max_fluctuations (int, default=5) – For the alternating optimization routine, the maximum number of fluctuations(see above) allowed before stopping the subroutine.
threshold_score (float, default=1e-5) – For the gradient descent subroutines, if change in score between successive iterations is below this threshold, stop.
learning_rate (float, default=1) – The initial learning rate(factor by which gradient is multiplied) for the gradient descent subroutines.
B_solver ({'grad', 'adaptive', 'cvx'}, default='grad') – Sets the solver for the connectivity matrix B, where the options are (ordered by decreasing speed and increasing stability) grad, adaptive and cvx.
random_state (int, default=42) – To set the random state for reproducibility when randomly splitting the data. Successive calls with the same random state will have the same result.
verbose (int, default = 0) – If debug and execution traces should be printed. 0 corresponds to no traces, higher values correspond to higher verbosity.

Returns

estimated_icpdag (numpy.ndarray) – The I-CPDAG representing the estimated equivalence class.
estimated_I (set of ints) – The estimated set of intervention targets.
estimated_model (utlvce.model.Model) – The estimated model. Its underlying DAG adjacency can be accessed by estimated_model.A. The fitted parameters and assumption deviation metrics can be seen by calling print(estimated_model) (see utlvce.model module).

Raises

ValueError : – If (1) the given data is not valid, i.e. (different number of variables per sample, or one sample with a single observation), (2) if the given B_solver is not valid or (3) if one of the candidate adjacencies does not correspond to a DAG.
TypeError : – If the given data is not a list of numpy.ndarray.

utlvce.equivalence_class_w_ges(data, nums_latent=[1, 2, 3], folds=[0.5, 0.25, 0.25], psi_max=None, psi_fixed=False, max_iter=1000, threshold_dist_B=0.001, threshold_fluctuation=1e-05, max_fluctuations=5, threshold_score=1e-05, learning_rate=1, B_solver='grad', random_state=42, verbose=0)

Estimate the equivalence class of the data-generating model, using the output of GES on the pooled data as an initial set of candidate DAGs.

Parameters

data (list of numpy.ndarray) – A list with the sample from each environment, where each sample is an array with columns corresponding to variables and rows to observations.
nums_latent (list of ints, default=[1,2,3]) – The candidates for the number of hidden variables, from which one is selected via cross-validation.
folds (list of float, default=[0.5, 0.25, 0.25]) – The size of the different splits of the data, where the first corresponds to the training data (used to fit the models), the second is used to select the number of latent variables and best DAG, and the third is used to select the intervention targets.
psi_max (float, default = None) – The maximum allowed change in variance between environments for the hidden variables. If None, psis are unconstrained.
psi_fixed (bool, default = False) – If True, impose the additional constraint that the hidden variables have all the same variance.
max_iter (int, default=1000) – The maximum number of iterations allowed for the alternating minimization procedure.
threshold_dist_B (float, default=1e-3) – If the change in B between successive iterations of the alternating optimization procedure is lower than this threshold, stop the procedure and return the estimate.
threshold_fluctuation (float, default=1e-5) – For the alternating optimization routine, if the score worsens by more than this value, consider the iteration a fluctuation.
max_fluctuations (int, default=5) – For the alternating optimization routine, the maximum number of fluctuations(see above) allowed before stopping the subroutine.
threshold_score (float, default=1e-5) – For the gradient descent subroutines, if change in score between successive iterations is below this threshold, stop.
learning_rate (float, default=1) – The initial learning rate(factor by which gradient is multiplied) for the gradient descent subroutines.
B_solver ({'grad', 'adaptive', 'cvx'}, default='grad') – Sets the solver for the connectivity matrix B, where the options are (ordered by decreasing speed and increasing stability) grad, adaptive and cvx.
random_state (int, default=42) – To set the random state for reproducibility when randomly splitting the data. Successive calls with the same random state will have the same result.
verbose (int, default = 0) – If debug and execution traces should be printed. 0 corresponds to no traces, higher values correspond to higher verbosity.

Returns

estimated_icpdag (numpy.ndarray) – The I-CPDAG representing the estimated equivalence class.
estimated_I (set of ints) – The estimated set of intervention targets.
estimated_model (utlvce.model.Model) – The estimated model. Its underlying DAG adjacency can be accessed by estimated_model.A. The fitted parameters and assumption deviation metrics can be seen by calling print(estimated_model) (see utlvce.model module).

Raises

ValueError : – If (1) the given data is not valid, i.e. (different number of variables per sample, or one sample with a single observation), (2) if the given B_solver is not valid.
TypeError : – If the given data is not a list of numpy.ndarray.