Published on: March 22, 2023

At Alchemy, our machine learning (ML) models use input and output variables for training to recognize certain types of patterns and make predictions. Alchemy AI (Figure 1.1.) uses two types of ML algorithms depending on the type of the output variables:

- Regression machine learning algorithms - for training models with continuous numerical output variables.
- Classification machine learning algorithms - for training models with predefined (categorical) output variables.

Alchemy’s AutoML module consists of 13 different algorithms for regression and 10 different algorithms for classification which are trained in parallel in order to significantly shorten the training duration.

A dataset consists of input and output variables.

Input variables are independent variables whose values are measured and inputted by the user. In Alchemy input variables are:

- Materials (e.g. resin A, solvent B)
- Process parameters (e.g. temperature, mixing speed)

Output variables are variables which depend on input variables. In Alchemy output variables are:

- Numerical properties (continuous numerical values, e.g. Viscosity, Drying time)
- Predefined numerical properties (numerical categorical values)
- Predefined alphanumeric properties (text or numerical categorical values, e.g. pass/fail)

A dataset for training AI in Alchemy is surfaced through Alchemy’s Scan and Score or Alchemy AI functionality. Our Scan and Score and Alchemy AI functionality gives information about the:

- Number of matching trials with respect to the requirements added in the material constraints table on the Lab Book Overview page
- Number of relevant trials with respect to the requirements added in the test table on the Lab Book Overview page
- Whether it is possible to train ML models based on the available dataset

The Show More Details button displays the number of available trials for each property separately.

Train AI in Alchemy consists of:

Hyperparameter tuning is a process which includes searching for the hyperparameters that will produce the highest performance of the models for each ML algorithm.

In Alchemy, performance of ML models are evaluated through repeated k-fold cross validation.

In k-fold cross validation, the dataset is split into *k* numbers of sets and each time one set is held back and models are trained with the remaining sets. Held back sets are used for performance estimation. This means that a total of *k* models are fit and evaluated based on the performance of mean value of held back sets. This process is repeated *l* times with different splits, depending on how large the dataset is. In addition, *l *✕ *k* number of models are fitted with repeated k-fold cross validation for estimating the performance of ML models.

The process of model training is shown in Figure 3.1.

In Alchemy, selection of the best model is automatically made for each target property. Automatic selection of the best model consists of the following steps:

- Get the best performing model from all models for the same algorithm, but with different hyperparameters
- Get the best model from all models from step 1, one model for each algorithm for regression (13) and/or one model for each algorithm for classification (10)

For automatically choosing the best model (Figure 3.2 and 3.3), different performance metrics are used.

**Regression Models:**combined performance metric which relies on mean absolute error (MAE) and root mean square error (RMSE)**Classification Models**:- If the predefined output values are balanced throughout the dataset, the system will choose accuracy for finding the best models
- If the predefined values are imbalanced throughout the dataset, the system will choose average precision for finding the best models

It is important that we track performance metrics to validate the models we generate in terms of the accuracy of predicted values.

First, a couple definitions:

**Performance metric**:**Actual Value**: the test result achieved for the property of a certain trial based on actual testing.**Predicted Value**: the test result which is predicted from machine learning models for the property of a certain trial.

Performance metrics for regression models available in Alchemy are:

**1. Model accuracy [%]**:

- metric based on scatter index

$$M A=100 \%-\left(\frac{R M S E}{\bar{y}} \times 100 \%\right)$$

${MA}$ - model accuracy

${RMSE}$ - root mean squared error

$\bar{y}$ - average of all actual values

- the metric ranges from 0 to 100%, where a higher value indicates better accuracy of the model

2. R^{2} (coefficient of determination):

- measure of the proportion of variance in the dependent variable that is predictable from the independent variables

$$R^2=1-\frac{\sum_{i=1}^N\left(y_i-\hat{y_i}\right)^2}{\sum_{i=1}^N\left(y_i-\bar{y}\right)^2}$$

${N}$ - number of trials

$y_i$ - actual value

$\widehat{y_i}$ - predicted value

$\bar{y}$ - average of all actual values

- the metric ranges from -∞ to 1, where a higher value indicates better model (value of 1 indicates perfect model)

3. MAE (mean absolute error):

- measure of the average magnitude of the differences between the prediction and the actual values for target property

$$M A E=\frac{1}{N} \sum_{i=1}^N\left|y_i-\widehat{y_i}\right|$$

${N}$ - number of trials

$y_i$ - actual value

$\widehat{y_i}$ - predicted value

- the metric ranges from 0 to +∞, where a lower value indicates better model (value of 0 indicates perfect model)

**4. RMSE (root mean squared error)**:

- measure of the square root of average magnitude of differences between predicted and actual values for target property

$$R M S E=\sqrt{\frac{1}{N} \sum_{i=1}^N\left(y_i-\widehat{y}_i\right)^2}$$

${N}$ - number of trials

$y_i$ - actual value

$\widehat{y_i}$ - predicted value

- the metric ranges from 0 to +∞, where a lower value indicates better model (value of 0 indicates perfect model)

Performance metrics for classification models available in Alchemy are:

**1. Accuracy:**

- ratio of correctly predicted instances to the total number of instances in the dataset

$$Accuracy =\frac{\text { Number of correct predictions }}{\text { Total number of predictions }}$$

- the metric ranges from 0 to 1, where a higher value indicates better model (value of 1 indicates perfect model)

**2. Average Precision**:

- area under the precision-recall curve which quantifies the model's ability to make accurate positive predictions

$$Average Precision =\sum_{k=1}^N \operatorname{Precision}(k) \Delta \operatorname{Recall}(k)$$

${N}$ - number of trials

$Precision(k)$ - is the precision at a cutoff of k

$\Delta Recall(k)$ - is the change in recall that happened between cutoff k-1 and cutoff k

- the metric ranges from 0 to 1 where a higher value indicates better model (value of 1 indicates perfect model)

**3. F1 Score**:

- Harmonic mean of:

- Precision: accuracy of positive predictions, which is the ratio of true positive predictions to the total number of positive predictions made by the model and

$$Precision_{classI}=\frac{TP_{classI}}{TP_{classI}+FP_{classI}}$$

${Precision_{classI}}$ - precision for one class, there are as many classes as there are predefined values

${TP_{classI}}$ - true positives for class I, number of trials which were predicted correct for class I (predicted class I matched the actual class I)

${FP_{classI}}$ - false positive for class I, number of trials which were predicted incorrect to belong to class I (predicted class I did not match the actual class)

- Recall: ratio of true positive predictions to the total number of actual positive instances in the dataset

$$Recall_{classI}=\frac{TP_{classI}}{TP_{classI}+FN_{classI}}$$

${FN_{classI}}$ - false negative for class I, number of trials which were predicted incorrect to belong to another class (predicted class did not match the actual class I)

$$F1score_{class~I}=\frac{2\times Precision_{class~I}\times Recall_{class~I}}{Precision_{class~I}+Recall_{class~I}}$$

- the metric ranges from 0 to 1, where a higher value indicates a well-balanced performance, demonstrating that the model can concurrently attain high precision and high recall (value of 1 indicates perfect model which accurately predicts each class)

**4. ROC AUC Score (area under the receiver operating characteristic curve)**:

- evaluation of a model's ability to discriminate between positive and negative instances
- the metric ranges from 0 to 1, where a higher value indicates better discrimination between positive and negative instances

At Alchemy, we strive to achieve three goals when models are trained:

- Recommend trials with the optimal test results for all target properties
- Predict target property test result for trials with input variables defined by the user
- Get insights to the input variable’s importance

All predicted property values have associated predicted confidence intervals which will show how much deviation can be expected from the predicted property value for a certain trial.