Usage¶

Installation¶

To use Cost Model Queries, the evironment must first be installed from the environment.yml file:

(base) $ pip install cost_model_queries

Code structure¶

The excel files containing the cost models need to be placed in the Cost Models folder in the project root.

Configuration¶

The file config.csv sets the parameters names, sampling ranges, and cells in the excel-based cost models. To adjust sampling ranges and parameter names a new config file can be created and placed in the project root. The config file is a csv with the following columns:

cost_type : production or deployment model.

sheet : the sheet the parameter is found on in the excel-based cost model.

factor_names : a shortened name for the parameter.

cell_row : the row the parameter sits on in the model.

cell_col : the column the parameter sits in in the model.

range_lower : the lower limit of the parameter’s sampling range.

range_upper : the upper limit of the parameter’s sampling range.

is_cat : TRUE if the parameter is categorical, FALSE if not.

The config file is loaded by sampling.sampling_functions.problem_spec() and used to sample predictors and calculate costs for samples.

Sampling¶

Parameter sampling is done using the SALib package, with parameter sampling ranges defined in config.csv. The input samples are then adusted to the correct types, using sampling.sampling_functions.convert_factor_types(). In the example below, cost model sampling is carried out by the function, sampling.sampling_functions.sample_production_cost(), which saves the samples as a csv, here specified as production_cost_samples.csv. An example for the deployment costs is included in sample_deployment_cost_model.py.

import pandas as pd
import os

from cost_model_queries.sampling.sampling_functions import (
    problem_spec,
    convert_factor_types,
    sample_production_cost,
)

# Filename for saved samples
samples_save_fn = "production_cost_samples.csv"

# Path to cost model
file_name = "\\Cost Models\\3.7.0 CA Production Model.xlsx"
wb_file_path = os.path.abspath(os.getcwd()) + file_name

# Generate sample
N = 2**5

# Generate problem spec, factor names and list of categorical factors to create factor sample
sp, factor_specs = problem_spec("production")
# Sample factors using sobal sampling
sp.sample_sobol(N, calc_second_order=True)

factors_df = pd.DataFrame(data=sp.samples, columns=factor_specs.factor_names)

# Convert categorical factors to categories
factors_df = convert_factor_types(factors_df, factor_specs.is_cat)

# Sample cost using factors sampled
factors_df = sample_production_cost(wb_file_path, factors_df, factor_specs)
factors_df.to_csv(samples_save_fn, index=False)  # Save to CSV

Sensitvity analysis¶

Sensitivity analysis can be carried out on the collected samples, again using the SALib package. In the example below, the files production_cost_samples.csv and deployment_cost_samples.csv were generated using the sampling scripts described above. The function sampling.sampling_functions.cost_sensitvity_analysis() generates a series of figures which are saved in the figures folder, including bar plots and heatmaps of the Pawn and Sobol sensitvity analysis results.

from cost_model_queries.sampling.sampling_functions import cost_sensitivity_analysis

samples_fn = "production_cost_samples.csv"
# Run SA for production model sample and save figures to figures folder
cost_sensitivity_analysis(samples_fn, "production")

samples_fn = "deployment_cost_samples.csv"
# Run SA for deployment model sample and save figures to figures folder
cost_sensitivity_analysis(samples_fn, "deployment")

Develop Regression Models¶

Several packages are included for developing and testing regression models for the sampled cost data. Models are available from the included packages statsmodels and scikit-learn. For exploring potential models, predictors can be plotted against cost using plotting.data_plotting.plot_predictors(). A series of functions for testing the assumptions of linear regression are also included in plotting.LM_diagnostics, including QQplots, location vs. scale and residuals plots. The example below shows the process of fitting linear regression models to samples from the deployment cost model and checking assumptions. An example for the production cost is included in test_regression_models_production_cost.py.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import cost_model_queries.plotting.LM_diagnostics as lmd

from cost_model_queries.plotting.data_plotting import plot_predictors, plot_predicted_vs_actual

samples_fn = "deployment_cost_samples.csv"
samples_df = pd.read_csv(samples_fn)

init_x = samples_df[
    samples_df.columns[
        (samples_df.columns != "Cost")
        & (samples_df.columns != "setupCost")
    ]
]

init_x["port"] = init_x["port"].astype("category")

# General review of potential relationships/correlations
ax, fig = plot_predictors(init_x, samples_df.Cost)
fig.show()

### Model for Cost ###
formula = "Cost ~ 0 + np.log(num_devices) + port + DAJ_a_r + deck_space + np.log(distance_from_port) + secs_per_dev + bins_per_tender + proportion + np.log(cape_ferg_price)"
x = pd.concat([np.log(samples_df.Cost), init_x], axis=1)
ols_model = smf.ols(formula=formula, data=x)
res = ols_model.fit()
print(res.summary())

# Calculate diagnostics
cls = lmd.LinearRegDiagnostic(res)
# Remove outliers
remove_inds = cls.get_influence_ids(n_i=30)
fill_vec = np.repeat(False, x.shape[0])
fill_vec[remove_inds] = True
x = x.drop(x[fill_vec].index)
ols_model = smf.ols(formula=formula, data=x)
res = ols_model.fit()
print(res.summary())

# Plot diagnostics
cls = lmd.LinearRegDiagnostic(res)
cls.residual_plot()
cls.qq_plot()
cls.scale_location_plot()

# Plot pred against actual
pred_var = res.get_prediction().summary_frame()["mean"]
ax, fig = plot_predicted_vs_actual(np.exp(x.Cost), np.exp(pred_var))
fig.show()

### Model for setupCost ###
# General review of potential relationships/correlations
ax, fig = plot_predictors(init_x, samples_df.setupCost)
fig.show()

formula = "setupCost ~ 0 + np.log(num_devices) + DAJ_a_r + DAJ_c_s + deck_space + np.log(distance_from_port) + secs_per_dev + bins_per_tender + proportion"
x = pd.concat([np.log(samples_df.setupCost), init_x], axis=1)
ols_model = smf.ols(formula=formula, data=x)
res = ols_model.fit()
print(res.summary())

# Caculate diagnostics
cls = lmd.LinearRegDiagnostic(res)
# Remove outliers
remove_inds = cls.get_influence_ids(n_i=30)
fill_vec = np.repeat(False, x.shape[0])
fill_vec[remove_inds] = True
x = x.drop(x[fill_vec].index)
ols_model = smf.ols(formula=formula, data=x)
res = ols_model.fit()
print(res.summary())

# Plot diagnostics
cls = lmd.LinearRegDiagnostic(res)
cls.residual_plot()
cls.qq_plot()
cls.scale_location_plot()

# Plot pred against actual
pred_var = res.get_prediction().summary_frame()["mean"]
ax, fig = plot_predicted_vs_actual(np.exp(x.setupCost), np.exp(pred_var))
fig.show()