Usage

Installation

To use Cost Model Queries, the evironment must first be installed from the environment.yml file:

(base) $ pip install cost_model_queries

Code structure

The excel files containing the cost models need to be placed in the Cost Models folder in the project root.

Configuration

The file config.csv sets the parameters names, sampling ranges, and cells in the excel-based cost models. To adjust sampling ranges and parameter names a new config file can be created and placed in the project root. The config file is a csv with the following columns:

  • cost_type : production or deployment model.

  • sheet : the sheet the parameter is found on in the excel-based cost model.

  • factor_names : a shortened name for the parameter.

  • cell_row : the row the parameter sits on in the model.

  • cell_col : the column the parameter sits in in the model.

  • range_lower : the lower limit of the parameter’s sampling range.

  • range_upper : the upper limit of the parameter’s sampling range.

  • is_cat : TRUE if the parameter is categorical, FALSE if not.

The config file is loaded by sampling.sampling_functions.problem_spec() and used to sample predictors and calculate costs for samples.

Sampling

Parameter sampling is done using the SALib package, with parameter sampling ranges defined in config.csv. The input samples are then adusted to the correct types, using sampling.sampling_functions.convert_factor_types(). In the example below, cost model sampling is carried out by the function, sampling.sampling_functions.sample_production_cost(), which saves the samples as a csv, here specified as production_cost_samples.csv. An example for the deployment costs is included in sample_deployment_cost_model.py.

 1import pandas as pd
 2import os
 3
 4from cost_model_queries.sampling.sampling_functions import (
 5    problem_spec,
 6    convert_factor_types,
 7    sample_production_cost,
 8)
 9
10# Filename for saved samples
11samples_save_fn = "production_cost_samples.csv"
12
13# Path to cost model
14file_name = "\\Cost Models\\3.7.0 CA Production Model.xlsx"
15wb_file_path = os.path.abspath(os.getcwd()) + file_name
16
17# Generate sample
18N = 2**5
19
20# Generate problem spec, factor names and list of categorical factors to create factor sample
21sp, factor_specs = problem_spec("production")
22# Sample factors using sobal sampling
23sp.sample_sobol(N, calc_second_order=True)
24
25factors_df = pd.DataFrame(data=sp.samples, columns=factor_specs.factor_names)
26
27# Convert categorical factors to categories
28factors_df = convert_factor_types(factors_df, factor_specs.is_cat)
29
30# Sample cost using factors sampled
31factors_df = sample_production_cost(wb_file_path, factors_df, factor_specs)
32factors_df.to_csv(samples_save_fn, index=False)  # Save to CSV

Sensitvity analysis

Sensitivity analysis can be carried out on the collected samples, again using the SALib package. In the example below, the files production_cost_samples.csv and deployment_cost_samples.csv were generated using the sampling scripts described above. The function sampling.sampling_functions.cost_sensitvity_analysis() generates a series of figures which are saved in the figures folder, including bar plots and heatmaps of the Pawn and Sobol sensitvity analysis results.

1from cost_model_queries.sampling.sampling_functions import cost_sensitivity_analysis
2
3samples_fn = "production_cost_samples.csv"
4# Run SA for production model sample and save figures to figures folder
5cost_sensitivity_analysis(samples_fn, "production")
6
7samples_fn = "deployment_cost_samples.csv"
8# Run SA for deployment model sample and save figures to figures folder
9cost_sensitivity_analysis(samples_fn, "deployment")

Develop Regression Models

Several packages are included for developing and testing regression models for the sampled cost data. Models are available from the included packages statsmodels and scikit-learn. For exploring potential models, predictors can be plotted against cost using plotting.data_plotting.plot_predictors(). A series of functions for testing the assumptions of linear regression are also included in plotting.LM_diagnostics, including QQplots, location vs. scale and residuals plots. The example below shows the process of fitting linear regression models to samples from the deployment cost model and checking assumptions. An example for the production cost is included in test_regression_models_production_cost.py.

 1import numpy as np
 2import pandas as pd
 3import statsmodels.formula.api as smf
 4import cost_model_queries.plotting.LM_diagnostics as lmd
 5
 6from cost_model_queries.plotting.data_plotting import plot_predictors, plot_predicted_vs_actual
 7
 8samples_fn = "deployment_cost_samples.csv"
 9samples_df = pd.read_csv(samples_fn)
10
11init_x = samples_df[
12    samples_df.columns[
13        (samples_df.columns != "Cost")
14        & (samples_df.columns != "setupCost")
15    ]
16]
17
18init_x["port"] = init_x["port"].astype("category")
19
20# General review of potential relationships/correlations
21ax, fig = plot_predictors(init_x, samples_df.Cost)
22fig.show()
23
24### Model for Cost ###
25formula = "Cost ~ 0 + np.log(num_devices) + port + DAJ_a_r + deck_space + np.log(distance_from_port) + secs_per_dev + bins_per_tender + proportion + np.log(cape_ferg_price)"
26x = pd.concat([np.log(samples_df.Cost), init_x], axis=1)
27ols_model = smf.ols(formula=formula, data=x)
28res = ols_model.fit()
29print(res.summary())
30
31# Calculate diagnostics
32cls = lmd.LinearRegDiagnostic(res)
33# Remove outliers
34remove_inds = cls.get_influence_ids(n_i=30)
35fill_vec = np.repeat(False, x.shape[0])
36fill_vec[remove_inds] = True
37x = x.drop(x[fill_vec].index)
38ols_model = smf.ols(formula=formula, data=x)
39res = ols_model.fit()
40print(res.summary())
41
42# Plot diagnostics
43cls = lmd.LinearRegDiagnostic(res)
44cls.residual_plot()
45cls.qq_plot()
46cls.scale_location_plot()
47
48# Plot pred against actual
49pred_var = res.get_prediction().summary_frame()["mean"]
50ax, fig = plot_predicted_vs_actual(np.exp(x.Cost), np.exp(pred_var))
51fig.show()
52
53### Model for setupCost ###
54# General review of potential relationships/correlations
55ax, fig = plot_predictors(init_x, samples_df.setupCost)
56fig.show()
57
58formula = "setupCost ~ 0 + np.log(num_devices) + DAJ_a_r + DAJ_c_s + deck_space + np.log(distance_from_port) + secs_per_dev + bins_per_tender + proportion"
59x = pd.concat([np.log(samples_df.setupCost), init_x], axis=1)
60ols_model = smf.ols(formula=formula, data=x)
61res = ols_model.fit()
62print(res.summary())
63
64# Caculate diagnostics
65cls = lmd.LinearRegDiagnostic(res)
66# Remove outliers
67remove_inds = cls.get_influence_ids(n_i=30)
68fill_vec = np.repeat(False, x.shape[0])
69fill_vec[remove_inds] = True
70x = x.drop(x[fill_vec].index)
71ols_model = smf.ols(formula=formula, data=x)
72res = ols_model.fit()
73print(res.summary())
74
75# Plot diagnostics
76cls = lmd.LinearRegDiagnostic(res)
77cls.residual_plot()
78cls.qq_plot()
79cls.scale_location_plot()
80
81# Plot pred against actual
82pred_var = res.get_prediction().summary_frame()["mean"]
83ax, fig = plot_predicted_vs_actual(np.exp(x.setupCost), np.exp(pred_var))
84fig.show()