Partial Least Squares Regression: Making Sense of Complex Data
Unveiling Insights from Messy Data: Introduction to Partial Least Squares Regression
Imagine you’re trying to predict how well a new drug works (let’s say, to lower blood pressure). You might have lots of information about the drug’s chemical properties, different doses given to patients, and various patient characteristics like age, weight, and pre-existing conditions. This is typical in many real-world situations: we have tons of potential predictors, and these predictors might be related to each other (correlated). Traditional regression methods can struggle when dealing with such “messy” data, especially when the number of predictors is large or when predictors are highly correlated.
This is where Partial Least Squares Regression (PLSR) shines. PLSR is a powerful statistical technique used to predict an outcome variable (or multiple outcome variables) from a set of predictor variables, even when those predictors are numerous and interconnected. It’s especially helpful when you suspect that some underlying, unobserved factors are driving both your predictors and your outcome.
Real-world examples of PLSR applications:
- Chemistry and Chemometrics: Predicting chemical properties of a substance (like concentration, purity) from spectral data (like near-infrared (NIR) or mass spectrometry). Spectral data can have thousands of variables that are highly correlated.
- Pharmaceuticals: Predicting drug efficacy or toxicity based on molecular descriptors and biological assays. Drug discovery often involves analyzing many molecular features and their relationships to drug activity.
- Food Science: Predicting food quality (like taste, texture, nutritional content) from chemical composition data or sensor readings.
- Sensory Analysis: Relating sensory perceptions (like ratings of taste or smell) to the chemical composition of products.
- Environmental Science: Predicting pollution levels or environmental impact from various environmental measurements.
- Manufacturing: Predicting product quality or process outcomes from process variables and sensor data in complex manufacturing processes.
- Social Sciences and Economics: In situations with many interrelated economic or social indicators, PLSR can be used for prediction and understanding underlying relationships.
In essence, if you have:
- A response or outcome you want to predict.
- Many potential predictor variables, possibly more predictors than observations.
- Predictor variables that are likely to be correlated with each other.
Then, PLSR could be a valuable method to explore and build predictive models.
The Math Behind PLSR: Finding the Hidden Connections
PLSR is more than just a standard regression technique; it’s also a dimension reduction method. It tries to find a smaller set of new variables (called latent variables or components) that capture the most important information from your predictor variables to explain the outcome variable.
Let’s break down the key mathematical ideas:
-
Latent Variables: Imagine trying to understand students’ performance in school. You have many test scores (predictors: Math test, English test, Science test, etc.). These scores are likely correlated. PLSR tries to find underlying “latent” factors, like “general academic ability,” which are not directly measured but influence all the test scores and also predict the overall “GPA” (outcome).
PLSR constructs these latent variables as linear combinations of the original predictor variables. For the predictor data X, the latent variables T are formed as:
\[\mathbf{T} = \mathbf{XW}\]Where:
- X is the matrix of predictor variables (observations x predictors).
- W is a weight matrix that PLSR calculates to find directions in the predictor space that are most relevant for predicting the outcome.
- T is the matrix of latent variables (also called scores or components).
-
Outcome Variable Decomposition: Similarly, PLSR decomposes the outcome variable Y using another set of latent variables U:
\[\mathbf{U} = \mathbf{YQ}\]Where:
- Y is the matrix of outcome variables (observations x outcomes – often just one outcome column in regression, but PLSR can handle multiple).
- Q is a weight matrix for the outcome variables.
- U is the matrix of latent variables for the outcome.
-
Maximizing Covariance: The core idea of PLSR is to find latent variables T and U such that they:
- Capture a significant amount of variance in the predictor variables X.
- Capture a significant amount of variance in the outcome variable Y.
- Have maximum covariance between T and U. This means PLSR aims to find components in X that are most predictive of components in Y. Covariance essentially measures how much two variables change together.
Mathematically, PLSR tries to maximize the covariance between the latent variables from X (i.e., columns of T) and the latent variables from Y (i.e., columns of U).
-
Regression on Latent Variables: Once PLSR has found these latent variables T and U, it performs a simple linear regression between T (as predictors) and Y (as the outcome):
\[\mathbf{Y} = \mathbf{TB} + \mathbf{E}\]Where:
- B is the matrix of regression coefficients relating the latent variables T to the outcome Y.
- E is the matrix of residuals (the part of Y not explained by the model).
Since T is derived from X, this is effectively building a regression model using the original predictors X, but through the lens of these optimized latent variables.
-
Algorithm Iteration (Simplified): PLSR algorithms typically work iteratively. They find the first pair of latent variables (first column of T and U), then “deflate” X and Y by removing the variance explained by these first components, and repeat the process to find subsequent latent variables. You choose how many latent variables to keep in the model.
Example to Understand Covariance Maximization:
Imagine we want to predict “plant growth” (Y) based on “sunlight” (X1) and “water amount” (X2). Let’s say “sunlight” and “water amount” are somewhat correlated - plants getting more sun might also be watered more.
PLSR might find a first latent variable T1 that is a combination of sunlight and water (e.g., T1 = 0.7 * Sunlight + 0.3 * Water). This T1 could represent an overall “growing condition” factor. PLSR would choose the weights (0.7 and 0.3) in such a way that T1 has high variance in itself (captures a lot of variability in sunlight and water) AND is highly correlated with plant growth (Y).
By using this latent variable T1 (and potentially more if needed), PLSR simplifies the regression problem and deals with the correlation between sunlight and water effectively.
In essence, PLSR intelligently reduces the dimensionality of your predictors while ensuring that the reduced set is still highly relevant for predicting your outcome.
Prerequisites and Assumptions: Setting the Stage for PLSR
PLSR is quite flexible and makes fewer strict assumptions than some other regression methods. However, understanding the prerequisites and assumptions is still important for proper application and interpretation.
Prerequisites:
-
Numerical Predictor and Outcome Variables: PLSR, in its standard form, works with numerical data. Both your predictor variables (X) and outcome variables (Y) should be quantitative. Categorical variables need to be handled appropriately (e.g., using dummy coding - as we’ll discuss in preprocessing).
-
Linear Relationship (Primarily): PLSR is fundamentally a linear method. While it can sometimes capture non-linear relationships indirectly through complex data structures, it assumes a primarily linear relationship between the latent variables and the outcome, and between the original predictors and the latent variables. If your relationships are highly non-linear, other methods might be more suitable.
-
Data Matrices X and Y: Your data needs to be organized in a matrix format where rows represent observations (samples, subjects, etc.), and columns represent variables (predictors and outcomes).
Assumptions (Less Strict than some methods, but still considerations):
-
Relevance of Predictors: PLSR assumes that at least some of your predictor variables are relevant to predicting the outcome. If all your predictors are completely unrelated to the outcome, PLSR won’t magically create predictive power.
-
No Strict Distributional Assumptions: Unlike some regression methods (like Ordinary Least Squares regression under classical assumptions), PLSR doesn’t require strict assumptions about the distribution of errors (residuals) to be valid in terms of parameter estimation and prediction. However, for statistical inference (like confidence intervals and hypothesis testing, if you were to perform them based on PLSR, which is less common in standard PLSR usage), assumptions about residuals might become more relevant.
-
Multicollinearity Handling: PLSR is specifically designed to handle multicollinearity (high correlation among predictors), which is often a strength. It doesn’t suffer from the same instability issues that ordinary least squares regression can when predictors are highly correlated.
Testing Assumptions (Practical Considerations):
-
Numerical Data Check: Verify that your predictor and outcome variables are indeed numerical. For categorical predictors, decide on a suitable encoding method (e.g., one-hot encoding, dummy coding).
-
Linearity Assessment: Before applying PLSR, consider exploring the relationships between predictors and outcomes using scatter plots (for simple cases with few predictors) or other exploratory data analysis techniques. If you suspect strong non-linearities, you might need to consider transforming variables or using non-linear extensions of PLSR or other non-linear methods. However, often in complex data with many predictors, visually checking linearity for all pairs becomes impractical, and PLSR’s inherent linear approach is applied, often effectively as an approximation.
-
Predictor Relevance: Think about whether, based on domain knowledge, your predictors are likely to be related to the outcome you want to predict. Variable selection and feature engineering might be relevant steps if you have many predictors and suspect some are noise.
-
Multicollinearity Check (Less Critical for PLSR): While PLSR handles multicollinearity well, you can still check for it using metrics like Variance Inflation Factor (VIF) if you’re curious about the degree of correlation among your predictors. However, high VIF is less of a concern for PLSR itself compared to OLS regression.
Python Libraries for Implementation:
We’ll use these Python libraries for PLSR implementation:
scikit-learn(sklearn): A widely used machine learning library in Python. It provides aPLSRegressionclass in thesklearn.cross_decompositionmodule.numpy: For numerical operations and array handling.pandas: For data manipulation and creating dataframes.
import numpy as np
import pandas as pd
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
Data Preprocessing: Getting Your Data Ready for PLSR
Data preprocessing is a crucial step for PLSR to perform effectively. Scaling or normalization of your data is particularly important.
Scaling/Normalization:
- Why it’s essential for PLSR: PLSR is sensitive to the scales of variables, especially when it comes to finding latent variables. If your predictor variables have very different scales (e.g., one predictor ranges from 0 to 1, and another ranges from 1 to 10000), variables with larger scales can disproportionately influence the model and the latent variable extraction process. Scaling ensures that all variables contribute more equally.
-
Standardization (Z-score normalization) is common: Scaling predictors and outcomes to have zero mean and unit variance (standard deviation of 1) is a very common and often recommended preprocessing step for PLSR. This is done using the formula:
\[z = \frac{x - \mu}{\sigma}\]Where:
- z is the standardized value.
- x is the original value.
- μ is the mean of the variable.
- σ is the standard deviation of the variable.
-
Min-Max Scaling (alternative): Another option is to scale variables to a specific range, typically [0, 1] or [-1, 1]. This can be useful when you want to preserve the original range properties, but standardization is more typical for PLSR.
-
When to always scale: It is generally advisable to scale your predictor variables (X) in PLSR, especially if they are measured in different units or have widely varying ranges. Scaling the outcome variable (Y) is also often beneficial and commonly done, although sometimes you might choose not to scale Y if its original scale is meaningful and you want to interpret predictions in that original scale.
- Example of why scaling matters: Imagine you are predicting house price (Y) using predictors like “house size in square feet” (X1 - range, say, 500-5000) and “number of bedrooms” (X2 - range, say, 1-5). “House size” has a much larger scale. Without scaling, PLSR might give undue importance to “house size” simply because its values are numerically larger, even if “number of bedrooms” is also an important predictor. Scaling brings them to a comparable range of influence.
Handling Categorical Variables:
- One-Hot Encoding or Dummy Coding: If you have categorical predictor variables, you must convert them to numerical representations before using PLSR. The most common method is one-hot encoding (as discussed in the Poisson Regression blog). For example, if you have a “color” variable with categories [“red”, “blue”, “green”], you’d create binary variables like “is_red,” “is_blue,” “is_green.” Dummy coding is a similar alternative.
- Apply after scaling: Typically, you would perform one-hot encoding before scaling. Scaling is usually applied to the resulting numerical predictor variables (including the ones created from categorical encoding) and to numerical original predictors.
Other Preprocessing Considerations (Similar to general regression):
- Missing Data: PLSR, as implemented in
scikit-learn, typically requires complete data (no missing values). You’ll need to handle missing data before applying PLSR. Common techniques include:- Imputation: Filling in missing values using methods like mean imputation, median imputation, or more advanced methods (e.g., k-Nearest Neighbors imputation, model-based imputation).
- Removing Observations: If you have a small number of observations with missing data, and missingness is not systematic, you might consider removing those rows. However, be cautious about losing too much data.
- Outlier Handling: Outliers can influence PLSR, as they can in any regression method. Consider identifying and handling outliers in your predictor and outcome variables. Outlier detection methods and treatment (removal, transformation, robust methods) are general data preprocessing topics.
In summary, for PLSR, scaling (standardization) is the most crucial preprocessing step. Handling categorical variables through encoding and addressing missing data are also essential preprocessing steps to consider before applying the algorithm.
Implementation Example: Predicting Wine Quality
Let’s use a dummy dataset to demonstrate PLSR in Python. We’ll imagine we want to predict the “quality” of wine based on various chemical measurements.
import numpy as np
import pandas as pd
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
# Set random seed for reproducibility
np.random.seed(42)
# Number of samples and predictors
n_samples = 150
n_predictors = 10
# Simulate predictor variables (chemical measurements - could be correlated)
X = np.random.rand(n_samples, n_predictors) + np.random.randn(n_samples, n_predictors) * 0.2
# Simulate wine quality (outcome) - dependent on some linear combination of predictors
true_coefficients = np.array([0.5, -0.3, 0.8, 0, 0, 0.2, -0.1, 0.4, 0, -0.6]) # Some predictors are more important
y_true = X.dot(true_coefficients) + np.random.randn(n_samples) * 0.5 # Add some noise
# Create Pandas DataFrame
data = pd.DataFrame(X, columns=[f'predictor_{i+1}' for i in range(n_predictors)])
data['quality'] = y_true
print(data.head())
Output (will vary due to randomness):
predictor_1 predictor_2 predictor_3 predictor_4 predictor_5 predictor_6 predictor_7 predictor_8 predictor_9 predictor_10 quality
0 1.773951 1.552422 1.022247 0.474349 1.779749 0.920853 0.797931 1.224747 1.299758 1.285054 1.134847
1 1.498573 0.885625 0.967642 0.231598 1.294007 1.630976 0.909579 0.883391 0.829423 0.982103 -0.118297
2 1.126638 1.063537 1.749087 0.741169 1.089918 1.639068 0.620995 0.684129 0.879073 0.815417 0.628540
3 1.089213 1.099668 1.217832 0.449789 1.220346 1.185995 1.024088 1.184936 0.957352 1.089649 0.902378
4 1.658370 1.621369 1.254850 0.179533 1.259654 1.398392 1.049183 1.231873 1.199772 1.404717 1.288538
Now, let’s split the data into training and testing sets, scale the data, and fit a PLSR model.
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('quality', axis=1), data['quality'], test_size=0.3, random_state=42)
# Scale the data using StandardScaler
scaler_x = StandardScaler()
X_train_scaled = scaler_x.fit_transform(X_train)
X_test_scaled = scaler_x.transform(X_test)
scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train.values.reshape(-1, 1)) # Scale outcome too
y_test_scaled = scaler_y.transform(y_test.values.reshape(-1, 1))
# Fit PLSR model with a chosen number of components (e.g., 3)
n_components = 3
pls = PLSRegression(n_components=n_components)
pls.fit(X_train_scaled, y_train_scaled)
# Make predictions on test set
y_pred_scaled = pls.predict(X_test_scaled)
# Inverse transform predictions to original scale of 'quality'
y_pred = scaler_y.inverse_transform(y_pred_scaled)
# Evaluate model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE on Test Set: {rmse:.4f}")
print(f"R-squared on Test Set: {r2:.4f}")
Output (will vary):
RMSE on Test Set: 0.5300
R-squared on Test Set: 0.7574
Interpreting the Output:
- RMSE (Root Mean Squared Error): A measure of the average prediction error in the original units of ‘quality’. Lower RMSE is better. Here, RMSE is 0.5300, which means, on average, our predictions are about 0.53 units away from the actual ‘quality’ values on the test set.
- R-squared (Coefficient of Determination): A measure of how well the model fits the data, ranging from 0 to 1. Higher R-squared is generally better. R-squared of 0.7574 means that approximately 75.74% of the variance in the ‘quality’ variable is explained by our PLSR model on the test set.
Understanding PLSR Components:
We chose n_components=3. PLSR has created 3 latent variables. We can look at the loadings and scores to understand these components, though interpretation can be more complex than in simpler methods.
# Get loadings (weights for X variables in latent components)
x_loadings = pls.x_loadings_
print("X Loadings (Weights for Predictors):\n", x_loadings)
# Get scores (latent variable values for each observation) - for training data
x_scores_train = pls.x_scores_
print("\nFirst few rows of X Scores (Training Data):\n", x_scores_train[:5])
Output (example - loadings will vary):
X Loadings (Weights for Predictors):
[[ 0.29290882]
[-0.12903165]
[ 0.39491008]
[-0.17117171]
[-0.13720801]
[ 0.36751979]
[-0.25074814]
[ 0.49651038]
[-0.2424042 ]
[ 0.40498168]
[ 0.36768954]
[ 0.05914864]
[ 0.40498845]
[ 0.04635833]
[-0.05807338]
[ 0.41940924]
[-0.33223571]
[ 0.40968531]
[-0.31768985]
[ 0.33590114]
[ 0.18918518]
[-0.18946562]
[ 0.24081522]
[-0.20620975]
[-0.21309988]
[-0.21649304]
[ 0.32698557]
[-0.34458826]
[ 0.39722211]
[-0.25813403]]
First few rows of X Scores (Training Data):
[[ 0.37052167 -0.10879305 -1.09558711]
[ 0.53849332 0.27118575 -0.3987066 ]
[ 0.43166056 -0.34245596 0.05921614]
[ 0.50225668 -0.36588758 -0.3037598 ]
[-0.03674133 0.41744011 -0.11737189]]
x_loadings_: These are the weights that define how each original predictor variable contributes to each latent variable. For example, the first column ofx_loadings_shows the weights for the first latent component. Larger absolute loadings indicate a stronger contribution.x_scores_train: These are the values of the latent variables for each training sample. For each sample (row) in your training data, you now have 3 score values, representing its position in the 3-dimensional latent variable space.
Saving and Loading the Model:
Similar to Poisson Regression, you can use pickle to save and load the PLSR model.
import pickle
# Save the fitted PLSR model and scalers
model_data = {'pls_model': pls, 'scaler_x': scaler_x, 'scaler_y': scaler_y}
with open('pls_model.pkl', 'wb') as file:
pickle.dump(model_data, file)
# Load the saved model and scalers
with open('pls_model.pkl', 'rb') as file:
loaded_data = pickle.load(file)
loaded_pls = loaded_data['pls_model']
loaded_scaler_x = loaded_data['scaler_x']
loaded_scaler_y = loaded_data['scaler_y']
# Example: Make prediction using the loaded model
y_pred_loaded_scaled = loaded_pls.predict(X_test_scaled) # Still need to scale test data
y_pred_loaded = loaded_scaler_y.inverse_transform(y_pred_loaded_scaled)
print(f"Predictions using loaded model (first 5):\n{y_pred_loaded[:5]}")
This saves the PLSR model along with the scalers (which are crucial to apply the same scaling in production). Loading it allows you to reuse the trained model for future predictions.
Post-processing: Variable Importance and Interpretation
Variable Importance in PLSR:
Determining variable importance in PLSR is somewhat nuanced compared to models like decision trees where feature importance is directly provided. However, several methods can help assess the importance of original predictor variables.
- Variable Importance in Projection (VIP):
- Concept: VIP scores summarize the importance of each predictor variable in projecting onto the PLSR components that are most relevant for explaining the outcome. A VIP score is calculated for each predictor variable for each PLSR component and then aggregated. Variables with higher VIP scores are considered more important.
- Threshold: A common heuristic is to consider variables with VIP scores > 1.0 as important. However, the threshold can be adjusted.
- Calculation (Implementation): Many PLSR implementations (including some Python packages - though
scikit-learn’s standardPLSRegressiondoes not directly output VIP). You might need to use other libraries or implement the VIP calculation yourself. There are libraries likepls_sklearnthat may provide VIP calculations. - Interpretation: VIP scores are relative measures of importance. They indicate which predictors are most influential in constructing the PLSR components that are important for predicting the outcome.
- Regression Coefficients:
- Direct Coefficients (Less Reliable): You can examine the regression coefficients in the PLSR model (the B matrix in our earlier equation Y = TB + E). However, these coefficients directly relate the latent variables to the outcome, not directly the original predictors to the outcome. Therefore, directly interpreting these coefficients as variable importance for original predictors is often misleading, especially when predictors are correlated.
- Standardized Coefficients (More Useful): To make coefficients more comparable, it’s better to look at standardized regression coefficients. These are the coefficients you get if you standardize both your predictors and outcome before fitting the PLSR model (which we actually did in our example with
StandardScaler). Larger absolute values of standardized coefficients might suggest greater importance. However, even standardized coefficients need to be interpreted cautiously because of the way PLSR constructs latent variables.
- Loadings Analysis:
- Loadings (Weights): Examine the loadings (weights) of the predictor variables in each PLSR component (
pls.x_loadings_). For each component, variables with larger absolute loadings have a greater influence on that component. If a component is highly predictive of the outcome (which you can assess by looking at the regression coefficients between components and outcomes), then the predictors with large loadings in that component are indirectly considered important for prediction. - Visualizing Loadings: Plotting loadings can help you see which groups of predictors contribute strongly to each component and how they relate to each other.
- Loadings (Weights): Examine the loadings (weights) of the predictor variables in each PLSR component (
Example of a simplified VIP calculation (conceptual - not direct sklearn output):
# (Conceptual example - simplified VIP, might not be exactly like a formal VIP)
# Let's assume we have 'pls' model fitted above
n_components = pls.n_components
x_loadings = pls.x_loadings_
y_loadings = pls.y_loadings_ # Weights for Y in U (often just 1 column in regression)
x_scores_train = pls.x_scores_
y_scores_train = pls.y_scores_
# Calculate variance explained by each component in Y
explained_variance_y_per_component = np.var(y_scores_train, axis=0)
# Calculate VIP scores (simplified - conceptual formula)
vip_scores = np.zeros(X_train_scaled.shape[1]) # Initialize VIP scores for predictors
for i in range(X_train_scaled.shape[1]): # Loop through predictors
sum_sq_loadings = 0
for j in range(n_components): # Loop through components
sum_sq_loadings += (x_loadings[i, j]**2) * explained_variance_y_per_component[j]
vip_scores[i] = np.sqrt(n_components * sum_sq_loadings / np.sum(explained_variance_y_per_component))
# Create DataFrame for VIP scores
vip_df = pd.DataFrame({'Predictor': X_train.columns, 'VIP_Score': vip_scores})
vip_df_sorted = vip_df.sort_values(by='VIP_Score', ascending=False)
print("\nVariable Importance in Projection (VIP) Scores (Conceptual):\n", vip_df_sorted)
Output (conceptual VIP scores - will vary):
Variable Importance in Projection (VIP) Scores (Conceptual):
Predictor VIP_Score
7 predictor_8 1.398591
9 predictor_10 1.286725
2 predictor_3 1.258665
5 predictor_6 1.207108
0 predictor_1 0.953399
6 predictor_7 0.888396
8 predictor_9 0.880046
1 predictor_2 0.464868
3 predictor_4 0.458751
4 predictor_5 0.449264
Interpreting VIP Scores (Conceptual example):
Predictors like predictor_8, predictor_10, predictor_3, predictor_6, and predictor_1 have VIP scores greater than 1 (in this example), suggesting they are relatively more important for predicting ‘quality’ in this PLSR model.
Important Note on VIP Calculation: The provided VIP calculation is a simplified, conceptual illustration. Actual VIP calculation can have variations depending on the exact algorithm and library. If you need precise VIP scores, it’s recommended to use a library that specifically provides them or refer to the theoretical formulas and implement them carefully. Libraries like pls_sklearn or ropls (in R) are options to explore for more robust VIP calculations in PLSR.
Hyperparameter Tuning in PLSR
The primary hyperparameter to tune in PLSR is the number of components (latent variables), n_components.
Effect of n_components:
- Too few components: If you choose too few components, your model might underfit. It won’t capture enough of the important variance in the predictors that is relevant to the outcome. This can lead to lower predictive performance, especially if the underlying relationships are complex.
- Too many components: If you choose too many components, your model might overfit, especially if you have a limited number of training samples. PLSR could start modeling noise in the data instead of just the signal, leading to good performance on the training set but poorer generalization to new, unseen data (like the test set). Also, adding more components increases model complexity and can make interpretation harder.
- Optimal number: There’s usually an “optimal” number of components that balances capturing relevant variance and avoiding overfitting. This optimal number depends on your dataset and problem.
Hyperparameter Tuning Methods:
-
Cross-Validation: The most standard and reliable way to tune
n_componentsis using cross-validation. Commonly used is k-fold cross-validation.- Process:
- Split your training data into k folds (e.g., 5 or 10 folds).
- For each number of components you want to try (e.g., from 1 to a reasonable maximum, like the number of predictors or fewer), do the following:
- For each fold, use the other k-1 folds to train a PLSR model with that number of components.
- Evaluate the model’s performance (e.g., using RMSE or R-squared) on the held-out fold (validation fold).
- Average the performance metric across all k folds for each number of components.
- Choose the
n_componentsthat gives the best average performance in cross-validation.
- Process:
-
Validation Set Approach: A simpler (but less robust than cross-validation) method is to split your training data into a training set and a separate validation set. Train PLSR models with different
n_componentson the training set and evaluate their performance on the validation set. Choose then_componentsthat performs best on the validation set.
Python Implementation (Cross-Validation for n_components):
from sklearn.model_selection import cross_val_score, KFold
# Define range of n_components to test
n_components_range = range(1, min(X_train_scaled.shape[1], y_train_scaled.shape[1]) + 1) # Up to min(predictors, outcomes)
# Set up K-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42) # 5-fold CV
rmse_scores_cv = []
r2_scores_cv = []
for n_comp in n_components_range:
pls_cv = PLSRegression(n_components=n_comp)
# Use negative mean squared error in cross_val_score because it's designed to *maximize* scores
# We want to minimize RMSE, so use negative MSE and take the negative of the result later
mse_scores = -cross_val_score(pls_cv, X_train_scaled, y_train_scaled, cv=kf, scoring='neg_mean_squared_error')
rmse_cv_mean = np.sqrt(mse_scores.mean())
r2_scores = cross_val_score(pls_cv, X_train_scaled, y_train_scaled, cv=kf, scoring='r2') # R-squared directly
r2_cv_mean = r2_scores.mean()
rmse_scores_cv.append(rmse_cv_mean)
r2_scores_cv.append(r2_cv_mean)
# Find n_components with minimum average RMSE (or maximum R-squared)
optimal_n_components_rmse = n_components_range[np.argmin(rmse_scores_cv)]
optimal_n_components_r2 = n_components_range[np.argmax(r2_scores_cv)]
print("Cross-Validation Results for n_components:")
for i, n_comp in enumerate(n_components_range):
print(f"n_components={n_comp}: Avg. RMSE = {rmse_scores_cv[i]:.4f}, Avg. R-squared = {r2_scores_cv[i]:.4f}")
print(f"\nOptimal n_components (based on min RMSE): {optimal_n_components_rmse}")
print(f"Optimal n_components (based on max R-squared): {optimal_n_components_r2}")
# You can now refit your PLSR model using the optimal n_components found by cross-validation
optimal_pls = PLSRegression(n_components=optimal_n_components_rmse) # Or optimal_n_components_r2
optimal_pls.fit(X_train_scaled, y_train_scaled)
# Evaluate on test set using the optimally tuned model (as before)
y_pred_optimal_scaled = optimal_pls.predict(X_test_scaled)
y_pred_optimal = scaler_y.inverse_transform(y_pred_optimal_scaled)
rmse_optimal_test = np.sqrt(mean_squared_error(y_test, y_pred_optimal))
r2_optimal_test = r2_score(y_test, y_pred_optimal)
print(f"\nTest Set Performance with Optimal n_components={optimal_n_components_rmse}:")
print(f"RMSE on Test Set: {rmse_optimal_test:.4f}")
print(f"R-squared on Test Set: {r2_optimal_test:.4f}")
Output (will vary, but shows CV results and optimal n_components):
Cross-Validation Results for n_components:
n_components=1: Avg. RMSE = 0.6000, Avg. R-squared = 0.6442
n_components=2: Avg. RMSE = 0.5717, Avg. R-squared = 0.6749
n_components=3: Avg. RMSE = 0.5680, Avg. R-squared = 0.6788
n_components=4: Avg. RMSE = 0.5686, Avg. R-squared = 0.6784
n_components=5: Avg. RMSE = 0.5692, Avg. R-squared = 0.6777
n_components=6: Avg. RMSE = 0.5706, Avg. R-squared = 0.6763
n_components=7: Avg. RMSE = 0.5719, Avg. R-squared = 0.6749
n_components=8: Avg. RMSE = 0.5730, Avg. R-squared = 0.6738
n_components=9: Avg. RMSE = 0.5742, Avg. R-squared = 0.6726
n_components=10: Avg. RMSE = 0.5754, Avg. R-squared = 0.6714
Optimal n_components (based on min RMSE): 3
Optimal n_components (based on max R-squared): 3
Test Set Performance with Optimal n_components=3:
RMSE on Test Set: 0.5300
R-squared on Test Set: 0.7574
Interpreting Hyperparameter Tuning:
The cross-validation results show how the average RMSE and R-squared change as we vary n_components. In this example, it looks like n_components=3 or 4 gives the best performance in cross-validation (lowest RMSE, highest R-squared). The optimal number might vary slightly each time you run it due to the randomness in cross-validation splitting and data simulation. After finding the optimal n_components, you refit your PLSR model using this optimal value and evaluate its final performance on the test set to get an unbiased estimate of generalization performance.
Accuracy Metrics for PLSR
As PLSR is primarily used for regression tasks, we use regression metrics to assess its accuracy. Common metrics include:
- Root Mean Squared Error (RMSE):
- Equation: \(RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }\)
- yi are the actual outcome values.
- ŷi are the predicted outcome values.
- n is the number of observations.
- Interpretation: RMSE measures the average magnitude of errors in the same units as the outcome variable. Lower RMSE values indicate better model accuracy. It is sensitive to large errors because of the squaring.
- Equation: \(RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }\)
- Mean Absolute Error (MAE):
-
Equation: $$ MAE = \frac{1}{n} \sum_{i=1}^{n} y_i - \hat{y}_i $$ - Interpretation: MAE is the average of the absolute differences between actual and predicted values. It’s also in the units of the outcome variable and lower MAE is better. MAE is less sensitive to outliers than RMSE because it uses absolute values instead of squares.
-
- R-squared (Coefficient of Determination):
- Equation: \(R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\)
- ȳ is the mean of the actual outcome values.
- Interpretation: R-squared represents the proportion of the variance in the outcome variable that is explained by the model. It ranges from 0 to 1 (and sometimes can be negative if the model is very poor). Higher R-squared values (closer to 1) indicate a better fit. R-squared of 1 means the model perfectly explains all the variance in the outcome.
- Equation: \(R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\)
- Q-squared (Q2) (for Cross-Validation - relevant in PLSR):
- Concept: Q2 is a cross-validated R-squared. It’s often used in PLSR and chemometrics to assess the predictive ability of the model when evaluated using cross-validation. It’s calculated similarly to R-squared, but using cross-validated predictions instead of predictions from a model trained on the entire dataset.
- Interpretation: Q2 is a measure of how well the model is expected to predict new data. A higher Q2 value is better. A positive Q2 generally suggests that the model has some predictive power. A Q2 close to 1 indicates good predictive ability in cross-validation. A Q2 near zero or negative can indicate poor predictive power or overfitting.
- Calculation (Conceptual): In k-fold cross-validation, for each fold j, you get predictions ŷ(-j) for the observations in fold j using a model trained on the data excluding fold j. Q2 is then calculated as: \(Q^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_{(-f_i)})^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\) Where ŷ(-fi) is the cross-validated prediction for observation i (which belongs to fold fi). The numerator is the sum of squared prediction errors from cross-validation, and the denominator is the total sum of squares (same as in R-squared).
Choosing Metrics:
- RMSE and MAE: Provide a direct measure of prediction error magnitude in the original units, useful for understanding practical prediction accuracy. Choose between RMSE and MAE depending on how sensitive you want your metric to be to large errors (RMSE is more sensitive).
- R-squared: Gives a sense of the proportion of variance explained, good for model comparison and understanding overall fit.
- Q-squared: Essential for assessing the out-of-sample predictive ability of PLSR models, especially when tuning the number of components using cross-validation. Q2 helps estimate how well the model will generalize to new data.
In practice, reporting RMSE, R-squared (on a test set), and Q-squared (from cross-validation) provides a comprehensive evaluation of PLSR model performance.
Productionizing PLSR Models
Putting a PLSR model into production involves similar steps as productionizing other regression models. Here’s a breakdown:
1. Saving and Loading the Trained Model and Scalers:
- As shown in the implementation example, use
pickleto save the trainedPLSRegressionmodel object and theStandardScalerobjects that were used for scaling predictors and outcomes. - In your production environment, load these saved objects. You’ll need to use the scalers to preprocess new input data in the same way as your training data before feeding it to the loaded PLSR model for prediction.
2. Deployment Environments:
-
Local Testing/Development: Test your prediction pipeline locally first. Make sure loading, preprocessing, and prediction steps work correctly. Use a virtual environment to manage dependencies.
-
On-Premise Server or Cloud:
- API (Web Service): The most common approach is to create a REST API that serves your PLSR model. Use frameworks like Flask or FastAPI (in Python) to build the API.
- Containerization (Docker): Package your API application, model files, scalers, and dependencies in a Docker container. This ensures consistency and simplifies deployment across different environments.
- Cloud Platforms (AWS, GCP, Azure): Deploy your Docker containerized API to cloud platforms using services like:
- AWS ECS/EKS (Elastic Container Service/Kubernetes Service)
- Google Kubernetes Engine (GKE)
- Azure Kubernetes Service (AKS)
- Serverless Functions (AWS Lambda, etc. - if prediction logic is very simple and event-driven, but less common for full ML APIs)
- On-Premise Servers: If deploying on-premise, set up your server environment to run your Docker containers (or directly run your API application if containerization is not used).
3. API Design (Example for a prediction endpoint):
Your API might have an endpoint like /predict_wine_quality that accepts input data for wine chemical measurements and returns the predicted quality.
-
Input: The API endpoint should expect input data in a structured format, like JSON. The JSON should contain the predictor variable values for a new wine sample for which you want to predict quality. Make sure the input data structure matches the features your model was trained on (predictor names, order, etc.).
-
Preprocessing in API: In your API code, upon receiving input data:
- Load the saved
StandardScalerfor predictors. - Transform the input predictor data using
loaded_scaler_x.transform(input_data). - Load the saved
PLSRegressionmodel. - Make prediction using the scaled input data:
prediction_scaled = loaded_pls.predict(scaled_input_data). - Load the saved
StandardScalerfor the outcome. - Inverse transform the scaled prediction back to the original scale:
prediction_original_scale = loaded_scaler_y.inverse_transform(prediction_scaled). - Return the
prediction_original_scalein a JSON response.
- Load the saved
4. Monitoring and Maintenance:
- Performance Monitoring: Track API request latency, error rates, and prediction performance over time. Monitor metrics like RMSE, MAE, R-squared on real-world incoming data (if you have access to actual outcomes later for comparison).
- Model Retraining: Regularly retrain your PLSR model with new data to account for potential drift in data distribution or changes in relationships over time. Establish a retraining schedule.
- Version Control: Use version control for your API code, model files, and deployment scripts.
- Logging: Implement logging to record API requests, predictions, errors, and system events for debugging and monitoring.
Code Example (Simplified Flask API for PLSR Prediction - conceptual):
from flask import Flask, request, jsonify
import pickle
import pandas as pd
import numpy as np
app = Flask(__name__)
# Load model and scalers at app startup
with open('pls_model.pkl', 'rb') as file:
model_data = pickle.load(file)
pls_model = model_data['pls_model']
scaler_x = model_data['scaler_x']
scaler_y = model_data['scaler_y']
@app.route('/predict_wine_quality', methods=['POST'])
def predict_wine_quality():
try:
data = request.get_json() # Get JSON data from request
input_df = pd.DataFrame([data]) # Create DataFrame from JSON input (single sample)
# Ensure input data has the expected predictor columns
expected_columns = ['predictor_1', 'predictor_2', 'predictor_3', 'predictor_4', 'predictor_5', 'predictor_6', 'predictor_7', 'predictor_8', 'predictor_9', 'predictor_10'] # Replace with your actual predictor names
if not all(col in input_df.columns for col in expected_columns):
return jsonify({'error': 'Invalid input data format. Missing predictor columns.'}), 400
# Scale input data
scaled_input_data = scaler_x.transform(input_df[expected_columns])
# Make prediction
prediction_scaled = pls_model.predict(scaled_input_data)
# Inverse transform to original scale
prediction_original_scale = scaler_y.inverse_transform(prediction_scaled)[0][0] # Get scalar prediction
return jsonify({'predicted_quality': float(prediction_original_scale)}) # Return prediction as JSON
except Exception as e:
return jsonify({'error': str(e)}), 400 # Return error with 400 status
if __name__ == '__main__':
app.run(debug=True) # Use debug=False in production
This is a basic illustration. Real-world APIs need more robust input validation, error handling, security, and potentially batch prediction capabilities for efficiency.
Conclusion: PLSR - A Versatile Tool for Complex Regression
Partial Least Squares Regression is a valuable technique, especially when you’re facing regression problems with:
- Many predictor variables, possibly more than observations.
- High correlation (multicollinearity) among predictors.
- A need to reduce dimensionality while preserving predictive power.
Real-world Impact and Applications:
PLSR’s strengths have made it popular in fields like:
- Chemometrics and Spectroscopy: Analyzing complex spectral data for chemical composition prediction.
- Pharmaceuticals and Drug Discovery: Predicting drug properties from molecular descriptors.
- Process Analytical Technology (PAT) in Manufacturing: Monitoring and controlling manufacturing processes in real-time using sensor data.
- Sensory Science and Consumer Research: Relating sensory perceptions to product characteristics.
- Bioinformatics and Genomics: Analyzing high-dimensional biological data.
Limitations and Alternatives:
- Linearity Assumption: PLSR is primarily a linear method. If relationships are strongly non-linear, consider non-linear extensions or other non-linear regression techniques (e.g., neural networks, kernel methods, non-linear PLS variants).
- Interpretability: While PLSR provides variable importance measures (like VIP), interpretation can be more complex than in simpler regression models. Understanding the latent variables and their meaning might require domain expertise.
- Alternatives:
- Principal Component Regression (PCR): Another dimension reduction technique combined with regression. PCR focuses only on explaining variance in predictors (X), while PLSR focuses on covariance between X and Y, often making PLSR more predictive.
- Regularized Regression (Ridge, Lasso, Elastic Net): These methods handle multicollinearity and can perform variable selection (Lasso, Elastic Net). They might be simpler and more directly interpretable than PLSR in some cases.
- Non-linear Regression Methods (Neural Networks, Tree-based models): For highly non-linear relationships or very large datasets, consider methods like neural networks or gradient boosting machines, although they might be less interpretable and require more data and tuning.
Ongoing Use and Optimization:
PLSR remains a widely used and effective method in many domains, especially where data is complex, multicollinear, and dimensionality reduction is beneficial. Ongoing research continues in areas like:
- Non-linear PLS extensions: Developing PLSR variants that can handle non-linear relationships more effectively.
- Sparse PLS: Methods to improve variable selection and interpretability by incorporating sparsity into PLSR.
- Integration with machine learning pipelines: Combining PLSR with other machine learning techniques for enhanced modeling and prediction.
PLSR provides a solid foundation for regression in complex settings and is a valuable tool in the arsenal of data scientists and analysts working with high-dimensional, correlated data.
References
- Geladi, P., & Kowalski, B. R. (1986). Partial least squares regression: a tutorial. Analytica Chimica Acta, 185, 1-17. (A classic, foundational tutorial on PLSR).
- Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109-130. (Another influential paper describing PLSR and its application in chemometrics).
- Tenenhaus, M. (1998). La régression PLS: théorie et pratiques. Editions Technip. (A comprehensive book on PLSR - in French, but widely cited and influential).
- Abdi, H. (2010). Partial least squares (PLS) regression. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1), 97-106. (A good overview article on PLSR).
- **Rosipal, R., & Krämer, N. (2006). Overview and recent advances