Partial Least Squares Regression (PLSR): Predicting with Many, Messy Variables
1. Introduction: Making Predictions When Your Data is Complicated
Imagine you’re trying to bake a cake. You have a recipe (your model) and many ingredients (your data - features) like flour, sugar, eggs, and milk. But what if some of these ingredients are very similar? For example, you might have different types of flour that are highly correlated in their properties. And what if you have more ingredients than you really need for a good cake?
This is where Partial Least Squares Regression (PLSR) comes in handy in the world of data analysis. PLSR is a powerful statistical method used for regression, which means predicting a continuous outcome (like the quality of a cake, or sales figures, or chemical concentration) based on a set of input variables (like ingredient amounts, customer demographics, or spectral measurements).
What makes PLSR special?
PLSR is particularly useful when you have situations where:
- Many Predictor Variables: You have a large number of input features (columns in your data).
- Predictor Variables are Correlated: These input features are not independent of each other; they are related or correlated. This is often called multicollinearity, and it can cause problems for traditional regression methods.
- You Want to Predict Multiple Outcomes (though we will focus on single outcome for simplicity): PLSR can actually handle predicting multiple output variables at the same time, but we’ll focus on the case of predicting a single outcome for easier understanding.
Real-world examples where PLSR is used:
-
Chemistry and Spectroscopy: Imagine analyzing the quality of food, fuel, or pharmaceuticals using spectral data. Spectral data (like near-infrared (NIR) or Raman spectra) consists of measurements at many wavelengths (many features), and these measurements are often highly correlated. PLSR is widely used to predict properties like protein content, octane number, or drug concentration from spectral data. Think about quickly testing the quality of milk by shining a light through it and using PLSR to analyze the light pattern to predict fat content – that’s the power of PLSR!
-
Sensory Analysis and Food Science: In food and beverage industry, sensory panels evaluate products based on taste, smell, texture, etc. These sensory attributes (features) are often correlated. PLSR can be used to relate these sensory scores to chemical compositions or processing parameters to understand what drives consumer preference.
-
Chemometrics and Process Analytical Technology (PAT): In manufacturing processes (especially in pharmaceuticals or chemicals), we monitor various process variables (temperature, pressure, flow rates, etc.) to control product quality. These variables are often interconnected. PLSR is used to build models that predict critical quality attributes based on real-time process data.
-
Social Sciences and Economics: While maybe less common than in chemistry, PLSR can also be used in social sciences or economics when dealing with datasets with many potentially correlated indicators to predict outcomes like customer satisfaction, economic indices, or market trends.
In short, PLSR is your go-to method when you need to build a regression model and your input data is high-dimensional and contains correlated variables – a situation that is very common in many scientific and industrial applications. It’s about making reliable predictions even when your data isn’t “perfectly” structured for standard regression methods.
2. The Mathematics Behind PLSR: Finding the Right Components
Let’s peek under the hood at the math behind PLSR. Don’t worry, we will explain it step-by-step in a simple way!
The core idea of PLSR is to find a set of new variables, called components or latent variables, that are:
- Linear Combinations of Original Predictor Variables: These components are created by combining the original input features in a smart way.
- Good at Predicting the Outcome Variable(s): These components should be strongly related to the variable we want to predict.
- Uncorrelated with Each Other (as much as possible): This helps to overcome the problem of multicollinearity in the original features.
Imagine it like this: Instead of using all your correlated ingredients directly in your cake recipe, PLSR helps you create “ingredient blends” (components) that capture the most important information for cake quality. These blends are designed to be less redundant (uncorrelated) and more predictive.
Mathematical Steps in PLSR (Simplified):
Let’s say we have our input data matrix X (features) and output data vector y (target variable).
Step 1: Extract the First Component (t1 and u1)
PLSR starts by finding the first pair of components – one for the input variables (t1) and one for the output variable (u1).
-
Find a weight vector (w_1) for X: PLSR finds a direction in the space of X variables that has a high covariance (a measure of how much two variables change together) with y. This direction is represented by a weight vector (w_1). This (w_1) is chosen to maximize the covariance between (Xw_1) (which is our first X component t1) and y.
Mathematically, this can be represented (simplified) as finding (w_1) that maximizes:
\[\text{Covariance}(Xw_1, y)\] -
Calculate the first X component (score) t1: Once (w_1) is found, we calculate the first X component (t1) for each data point by projecting the original X data onto the direction (w_1):
\[t_1 = Xw_1\]Example: Think of (w_1) as a set of coefficients. For each data point (row in X), you calculate (t_1) by taking a weighted sum of its features using (w_1) as weights.
-
Find a weight vector (c_1) for y (for single y, it’s often just scaled y): Similarly, PLSR finds a weight vector (c_1) (or, in the case of single output y, it’s often simply related to y itself after scaling) that helps relate y to the component t1. For simplicity, in many implementations with a single output, the first Y component u1 is often just y itself (or a scaled version), and then we focus on finding components of X that best predict this y (or u1).
-
Calculate the first Y component (score) u1 (often just scaled y in single output case):
\[u_1 = yc_1 \approx y \text{ (often approximately y for single output)}\]
Step 2: Regression with Components
Now, we use the X component t1 to predict the Y component u1 (or effectively, y) using simple linear regression.
-
Regress u1 (or y) on t1: We find a regression coefficient (b_1) that relates u1 (or y) to t1:
\[u_1 \approx b_1 t_1 \text{ or } y \approx b_1 t_1 \text{ (in simplified single output case)}\]This is just like standard linear regression, but now we’re regressing on the component t1 instead of the original features.
Step 3: Deflation - Removing the Effect of the First Component
After finding the first components and performing regression, we need to “remove” the effect of these components from our original X and y data. This is called deflation.
-
Deflate X: We remove the part of X that is explained by t1:
\[X_{deflated} = X - t_1 p_1^T\]where (p_1) is a “loading” vector for X, which essentially describes how much each original X variable contributes to the component t1. (p_1^T) is the transpose of (p_1), and (t_1 p_1^T) is an outer product resulting in a matrix that’s subtracted from X.
-
Deflate y (optional, but common in PLSR): Similarly, we can deflate y to remove the part explained by t1 (or u1):
\[y_{deflated} = y - t_1 q_1^T\]where (q_1) is a “loading” for y, showing how much of y is captured by u1 (or t1). (q_1^T) is the transpose of (q_1), and (t_1 q_1^T) is a vector that’s subtracted from y.
Step 4: Repeat to Find More Components
We repeat steps 1-3 using the deflated matrices (X_{deflated}) and (y_{deflated}) (from the previous step) to find the second components (t2, u2), and then the third components (t3, u3), and so on. In each step:
- We find new weight vectors ((w_2, w_3, …)) using the deflated data from the previous step.
- We calculate new components ((t_2 = X_{deflated} w_2), (t_3 = X_{deflated, step2} w_3), …).
- We perform regression of u components (or y) on t components to get regression coefficients ((b_2, b_3, …)).
- We deflate (X_{deflated}) and (y_{deflated}) again to prepare for the next component extraction.
Step 5: Determine the Number of Components
We continue extracting components until we have extracted a certain number of components. The number of components to use is a hyperparameter that needs to be chosen (we will discuss this later). We might use cross-validation to determine the optimal number of components that gives good prediction performance without overfitting.
Step 6: Prediction with Selected Components
To make predictions for new data, we:
- Transform the new X data using the weight vectors ((w_1, w_2, …, w_k)) from the first (k) components to get the components for the new data ((t_{1,new}, t_{2,new}, …, t_{k,new})).
-
Use the regression coefficients ((b_1, b_2, …, b_k)) obtained during PLSR training and the components of the new data to predict the output:
\[\hat{y}_{new} = b_1 t_{1,new} + b_2 t_{2,new} + ... + b_k t_{k,new}\]
Key Idea: PLSR effectively performs dimensionality reduction by creating components that are most relevant for predicting the output variable and also address multicollinearity in the input features. It’s a balance between Principal Component Analysis (PCA) – which focuses on explaining variance in X – and standard regression – which focuses on predicting y from X. PLSR is a supervised method, meaning it uses both X and y to find the components, unlike PCA which is unsupervised and only uses X.
3. Prerequisites and Preprocessing: Getting Ready for PLSR
Before you dive into using PLSR, let’s discuss the prerequisites and any essential preprocessing steps.
Prerequisites and Assumptions:
- Linear Relationship (Assumption): PLSR, like standard linear regression, assumes that there is a linear relationship between the predictor variables and the outcome variable. While PLSR is more robust to multicollinearity and high dimensionality than ordinary least squares regression, the underlying relationship it models is still linear.
- Continuous Outcome Variable (for Regression): PLSR is primarily designed for regression tasks, where the variable you are trying to predict (y) is continuous (numerical values that can take any value within a range, like temperature, concentration, sales, etc.). For categorical outcomes (classification), other methods like classification algorithms or variations of PLSR for classification are used.
- Adequate Sample Size: While PLSR can handle high dimensionality (many features), having a reasonable number of samples (data points) is still important for building a reliable model. A general guideline (though it’s not a strict rule) is to have more samples than predictor variables, or at least a sufficient number to represent the underlying relationships in your data. For complex datasets with many features and noise, more samples are generally better.
- Data Quality: As with any statistical method, the quality of your data is crucial. Outliers, errors in data measurement, and missing values can affect PLSR model performance. Address data quality issues through preprocessing steps.
Testing Assumptions (Informally):
- Linearity Check (Scatter Plots, Residual Plots):
- Scatter Plots: Create scatter plots of your outcome variable (y) against individual predictor variables in X. Look for roughly linear trends. If you see strong non-linear patterns (curves, U-shapes, etc.), linear PLSR might not be the best choice without feature transformations or considering non-linear extensions of PLSR.
- Residual Plots (after fitting a PLSR model): After building a PLSR model, examine residual plots (plots of predicted values vs. residuals - the differences between actual and predicted values). For a good linear model, residuals should be randomly scattered around zero, without obvious patterns. Patterns in residual plots can indicate violations of linearity assumptions.
- Sample Size Rule of Thumb (Guideline, not strict test): Compare the number of samples to the number of predictor variables. If you have far fewer samples than features, be cautious about overfitting. Cross-validation becomes particularly important to assess model generalization in such cases. If possible, try to increase sample size if it’s very small relative to the number of features.
- Data Quality Checks:
- Descriptive Statistics: Calculate basic statistics for your variables (means, standard deviations, min, max, quantiles). Look for unusual ranges, unexpected values, or potential data entry errors.
- Outlier Detection: Use methods like box plots, scatter plots, or statistical outlier detection techniques to identify potential outliers in your data. Decide how to handle them (removal, transformation, or robust methods).
- Missing Value Analysis: Check for missing values. Understand the patterns of missingness. Decide on an appropriate strategy for handling missing values (imputation or removal of data if necessary).
Python Libraries for PLSR:
The main Python library you’ll need for PLSR is scikit-learn (sklearn). It provides a PLSRegression class in the sklearn.cross_decomposition module.
# Python Library for PLSR
import sklearn
from sklearn.cross_decomposition import PLSRegression
print("scikit-learn version:", sklearn.__version__)
import sklearn.cross_decomposition # To confirm PLSRegression is accessible
Ensure scikit-learn is installed in your Python environment. Install it using pip if needed:
pip install scikit-learn
4. Data Preprocessing: Scaling is Key
Data preprocessing is often essential for PLSR, and scaling is particularly important.
Why Scaling is Important for PLSR:
PLSR is sensitive to the scales of your features and the target variable. If variables have very different ranges, those with larger ranges can disproportionately influence the component extraction and regression process.
Example (Similar to previous blogs): Consider predicting house price (target) using features like “living area (in sq ft)” (range: 500-5000) and “number of bedrooms” (range: 1-5). If you don’t scale, the “living area” feature, with its larger numerical range, will have a much greater influence on distances, covariances, and component extraction in PLSR, just because its values are numerically larger, not necessarily because it is intrinsically more important for predicting house price than “number of bedrooms.”
Types of Scaling (Same as in t-SNE and Sammon Mapping Blogs - StandardScaler and MinMaxScaler):
-
Standardization (Z-score scaling): Transforms each feature to have mean 0 and standard deviation 1. \(x'_{i} = \frac{x_{i} - \mu}{\sigma}\)
-
Mean Centering (often used in PLSR): Subtracts the mean from each feature (results in mean 0, but standard deviation is not necessarily 1). In some PLSR implementations, mean centering is implicitly done. For scikit-learn’s
PLSRegression, you should apply scaling explicitly if desired. \(x'_{i} = x_{i} - \mu\) -
Min-Max Scaling: Scales features to a specific range, typically 0 to 1. \(x'_{i} = \frac{x_{i} - x_{min}}{x_{max} - x_{min}}\)
Which Scaling to Use for PLSR:
- StandardScaler (Standardization): Generally, StandardScaler is a good default choice for PLSR. It puts all features on a comparable scale (mean 0, standard deviation 1), which often leads to more stable and well-performing PLSR models.
- Mean Centering: Mean centering is also frequently used in PLSR, and in some contexts (e.g., chemometrics), it might be preferred. StandardScaler is often a robust general choice.
- Min-Max Scaling: Min-Max scaling might be less common for PLSR than StandardScaler, but it could be considered if you want features to be in a specific bounded range (e.g., 0 to 1).
When can scaling be ignored?
It’s generally not recommended to skip scaling for PLSR unless you have a very specific reason and understand the potential consequences.
- Features Already on Similar Scales (Rare): If you are absolutely certain that all your features are already measured in very similar units and have comparable ranges, and you are using a PLSR implementation that inherently performs mean centering, you might consider skipping explicit scaling. However, even in such cases, scaling is often still beneficial and safer practice.
- Decision Trees and Tree-Based Models (Example where scaling is less critical - as you correctly mentioned): Decision trees and tree-based ensemble methods are indeed less sensitive to feature scaling than methods like PLSR, linear regression, or distance-based algorithms. For tree-based models, scaling is often not strictly required, but for PLSR, it is generally recommended.
Preprocessing Example in Python (using scikit-learn scalers - same code as before):
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
# Dummy data (example - replace with your actual data)
data = np.array([[10, 1000],
[20, 20000],
[15, 15000],
[5, 5000]])
# StandardScaler
scaler_standard = StandardScaler()
scaled_data_standard = scaler_standard.fit_transform(data)
print("Standardized data:\n", scaled_data_standard)
# MinMaxScaler
scaler_minmax = MinMaxScaler()
scaled_data_minmax = scaler_minmax.fit_transform(data)
print("\nMin-Max scaled data:\n", scaled_data_minmax)
Handling Categorical Features:
- Encoding: PLSR, like linear regression, typically works with numerical features. Categorical features need to be encoded into numerical form (e.g., one-hot encoding).
- One-Hot Encoding: Convert categorical variables into binary indicator variables (dummy variables). Use
pd.get_dummiesin pandas orOneHotEncoderin scikit-learn.
Handling Missing Values:
- Imputation or Removal: PLSR requires complete data. Missing values must be handled before applying PLSR. Common approaches are:
- Imputation: Fill missing values with estimates (mean, median, or more advanced imputation methods).
SimpleImputerin scikit-learn can be used. - Removal: Remove rows (samples) or columns (features) with missing values if the amount of missing data is small enough and removal is acceptable for your analysis. Be cautious about removing too much data.
- Imputation: Fill missing values with estimates (mean, median, or more advanced imputation methods).
Best Practice: For PLSR, always scale your numerical features (StandardScaler is a good default). Handle categorical features using appropriate encoding (like one-hot encoding) and address missing values before applying PLSR. Scaling the target variable (y) is sometimes also done but might be less critical than scaling features.
5. Implementation Example: PLSR on Dummy Data
Let’s implement PLSR using scikit-learn on some dummy regression data. We’ll see how to fit the model, make predictions, and interpret some outputs.
import numpy as np
import pandas as pd
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
# 1. Generate dummy regression data
np.random.seed(42)
X = np.random.rand(100, 10) # 100 samples, 10 features (some correlated)
y = 2*X[:, 0] + 0.5*X[:, 1] - 1.5*X[:, 3] + 0.8*X[:, 0]*X[:, 2] + np.random.randn(100) # y depends on features 0, 1, 3 and interaction
feature_names = [f'feature_{i+1}' for i in range(10)] # Feature names like 'feature_1', 'feature_2', ...
X_df = pd.DataFrame(X, columns=feature_names)
y_df = pd.Series(y, name='target')
# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, random_state=42)
# 3. Scale features and target variable using StandardScaler
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)
y_scaler = StandardScaler() # Scale target variable too (optional but often helpful)
y_train_scaled = y_scaler.fit_transform(y_train.values.reshape(-1, 1)) # reshape for StandardScaler
y_test_scaled = y_scaler.transform(y_test.values.reshape(-1, 1))
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=feature_names) # Keep column names
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=feature_names) # Keep column names
y_train_scaled_series = pd.Series(y_train_scaled.flatten()) # Convert back to Series if needed for later steps
y_test_scaled_series = pd.Series(y_test_scaled.flatten())
# 4. Train PLSR model
n_components = 3 # Choose number of components (hyperparameter - will tune later)
plsr = PLSRegression(n_components=n_components)
plsr.fit(X_train_scaled_df, y_train_scaled_series)
# 5. Make predictions on test set
y_pred_scaled_test = plsr.predict(X_test_scaled_df)
# 6. Inverse transform predictions to original scale (if target was scaled)
y_pred_test = y_scaler.inverse_transform(y_pred_scaled_test)
# 7. Evaluate model performance
r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
print("PLSR Model Performance on Test Set:")
print(f"R-squared (Test): {r2_test:.4f}")
print(f"RMSE (Test): {rmse_test:.4f}")
# 8. Access PLSR model components and loadings
print("\nPLSR Model Components and Loadings:")
print("X weights (w_weights):\n", plsr.x_weights_)
print("\nY weights (y_weights):\n", plsr.y_weights_)
print("\nX loadings (x_loadings_):\n", plsr.x_loadings_)
print("\nY loadings (y_loadings_):\n", plsr.y_loadings_)
print("\nX scores (x_scores_ - components t):\n", plsr.x_scores_[:5]) # Show first 5 scores
print("\nY scores (y_scores_ - components u):\n", plsr.y_scores_[:5]) # Show first 5 scores
print("\nRegression coefficients (coef_ - coefficients in original variable space):\n", plsr.coef_)
# 9. Save and Load the PLSR model and scalers (for later use)
import joblib # or pickle
# Save PLSR model
joblib.dump(plsr, 'plsr_model.joblib')
print("\nPLSR model saved to plsr_model.joblib")
# Save scalers
joblib.dump(scaler_X, 'scaler_X.joblib')
joblib.dump(y_scaler, 'y_scaler.joblib')
print("Scalers saved to scaler_X.joblib and y_scaler.joblib")
# Load PLSR model and scalers
loaded_plsr = joblib.load('plsr_model.joblib')
loaded_scaler_X = joblib.load('scaler_X.joblib')
loaded_y_scaler = joblib.load('y_scaler.joblib')
print("\nPLSR model and scalers loaded.")
# Example: Make prediction using loaded model
new_data_point = X_test_scaled_df.iloc[[0]] # Take first point from scaled test set
prediction_scaled_loaded = loaded_plsr.predict(new_data_point)
prediction_loaded = loaded_y_scaler.inverse_transform(prediction_scaled_loaded)
print("\nPrediction using loaded model (for first test sample):", prediction_loaded)
Explanation of the Code and Output:
- Generate Dummy Data: We create dummy regression data with 10 features and a target variable
ythat depends on a few features and also includes an interaction term (non-linearity). This is to demonstrate PLSR on a more complex dataset than purely linear. - Train-Test Split: Split into training and testing sets to evaluate model generalization.
- Scale Features and Target: We use
StandardScalerto scale both the feature matrixXand the target variabley. Scalingyis often helpful in PLSR. - Train PLSR Model:
PLSRegression(n_components=n_components): We create aPLSRegressionobject, specifying the number of components (n_components=3). We’ll discuss choosingn_componentslater.plsr.fit(X_train_scaled_df, y_train_scaled_series): We fit the PLSR model using the scaled training data.
- Make Predictions:
plsr.predict(X_test_scaled_df)makes predictions on the scaled test data, resulting in scaled predictionsy_pred_scaled_test. - Inverse Transform Predictions: We use
y_scaler.inverse_transformto transform the scaled predictions back to the original scale of the target variable, gettingy_pred_test. - Evaluate Performance:
r2_score(y_test, y_pred_test): Calculate R-squared on the test set.np.sqrt(mean_squared_error(y_test, y_pred_test)): Calculate Root Mean Squared Error (RMSE) on the test set.- These metrics give an idea of how well the PLSR model is predicting on unseen data.
- Access PLSR Components and Loadings:
plsr.x_weights_,plsr.y_weights_,plsr.x_loadings_,plsr.y_loadings_,plsr.x_scores_,plsr.y_scores_,plsr.coef_: These are attributes of the trainedPLSRegressionobject that provide information about the model: weights, loadings, component scores, and regression coefficients in the original feature space. We print these to understand the model.x_scores_are the t components, andy_scores_are the u components (or approximations of y components for single output PLSR).coef_are the regression coefficients expressed in terms of the original features (not components).
- Save and Load Model and Scalers: We use
joblib.dumpto save the trained PLSR model and theStandardScalerobjects to files. We then usejoblib.loadto load them back. This is essential for deploying and reusing your trained model without retraining.
Interpreting the Output:
When you run this code, you’ll see output like this (values will vary slightly due to random data generation):
PLSR Model Performance on Test Set:
R-squared (Test): 0.81...
RMSE (Test): 0.99...
PLSR Model Components and Loadings:
X weights (w_weights):
[[ 0.2... -0.0... 0.3... 0.1... -0.0... 0.3... 0.3... 0.2... 0.3...
0.3...]
[-0.1... 0.3... 0.0... 0.4... 0.5... 0.0... -0.2... -0.2... -0.3...
0.4...]
[-0.4... 0.1... -0.2... -0.1... 0.0... 0.0... 0.4... -0.2... -0.2...
-0.1...]]
Y weights (y_weights):
[[0.9...]
[-0.3...]
[0.0...]]
X loadings (x_loadings_):
[[ 0.6... -0.1... 0.7... 0.5... -0.1... 0.8... 0.8... 0.7... 0.7...
0.8...]
[ 0.0... 0.6... 0.2... 0.5... 0.8... 0.0... -0.3... -0.3... -0.4...
0.7...]
[-0.7... 0.3... -0.3... -0.2... 0.0... 0.0... 0.7... -0.3... -0.3...
-0.1...]]
Y loadings (y_loadings_):
[[0.9...]
[0.9...]
[0.0...]]
X scores (x_scores_ - components t):
[[-0.4... -0.1... -0.0...]
[ 0.4... -0.0... -0.1...]
[-0.5... 0.4... -0.2...]
[ 0.1... 0.2... 0.0...]
[-0.1... -0.1... -0.0...]...]
Y scores (y_scores_ - components u):
[[-0.4...]
[ 0.4...]
[-0.6...]
[ 0.1...]
[-0.2...]...]
Regression coefficients (coef_ - coefficients in original variable space):
[[ 1.2... 0.2... 0.5... -0.4... -0.0... -0.0... -0.1... 0.0... 0.0...
-0.0...]]
PLSR model saved to plsr_model.joblib
Scalers saved to scaler_X.joblib and y_scaler.joblib
PLSR model and scalers loaded.
Prediction using loaded model (for first test sample): [[0.3...]]
- Performance Metrics (R-squared, RMSE):
R-squared (Test): 0.81...andRMSE (Test): 0.99...indicate the performance of the PLSR model on unseen test data. R-squared of around 0.81 suggests the model explains about 81% of the variance in the test set target variable. RMSE of around 0.99 is in the original units of the target variable (after inverse scaling) and is the average magnitude of the prediction error. These values tell you how well the model predicts, relative to the scale of your target variable. x_weights_,y_weights_,x_loadings_,y_loadings_: These are the weight and loading vectors and matrices from the PLSR algorithm. They are used internally by PLSR to compute components and perform regression. Examining these directly might be less intuitive for general users, but they are crucial for understanding the inner workings of PLSR.x_scores_(t-components),y_scores_(u-components): These are the component scores for the training data.x_scores_(often denoted as T or t components) are the projections of your X data onto the component directions. They are the lower-dimensional representation of your X data in the PLSR component space.y_scores_(often denoted as U or u components) are similarly related to y. You can analyze these component scores to understand the reduced-dimension representation of your data.coef_(Regression Coefficients in Original Space): These are the regression coefficients when you express the PLSR model in terms of the original predictor variables (before component transformation). You can examine these coefficients to get an idea of the relative importance and direction of effect of the original features on the target variable. However, be cautious in interpreting these coefficients directly as “feature importance” because PLSR components are combinations of original features, and the coefficients in original space are derived from the component-based model.- Saving and Loading: The output confirms that the model and scalers have been saved and loaded successfully, and a prediction made with the loaded model matches expectations.
6. Post-Processing: Exploring Feature Importance and Model Insights
After building a PLSR model, you might want to explore feature importance and gain more insights into the model. While PLSR doesn’t directly give you a straightforward “feature importance” ranking like some tree-based models, there are ways to infer feature importance and interpret the model.
Methods for Post-Processing and Feature Importance in PLSR:
- Regression Coefficients in Original Space (
plsr.coef_):- Magnitude of Coefficients: As mentioned earlier, the
plsr.coef_attribute gives you regression coefficients expressed in terms of the original, scaled predictor variables. Features with larger absolute coefficient values have a greater linear influence on the predicted outcome within the context of the PLSR model. - Sign of Coefficients: The sign (+ or -) indicates the direction of the relationship. A positive coefficient means an increase in the feature (when other features are held constant in the model) leads to an increase in the predicted outcome, and vice versa for negative coefficients.
- Caution: Be careful about directly interpreting these coefficients as direct “feature importance” rankings. PLSR components are combinations of features, and these coefficients are derived from the component-based model. They reflect linear relationships within the PLSR model, but not necessarily “importance” in a causal or independent sense. Correlated features can share importance, and the coefficients can be influenced by multicollinearity even though PLSR is designed to handle it.
- Magnitude of Coefficients: As mentioned earlier, the
- Variable Importance in Projection (VIP) Scores:
- VIP scores are a more widely used measure of feature importance in PLSR. They quantify the contribution of each predictor variable to the PLSR model, considering its role in both explaining the predictor variables (X-variance) and predicting the outcome variable (Y-variance).
- Calculation of VIP (simplified concept): VIP for a feature is essentially a weighted sum of the squared loadings of that feature across all PLSR components, weighted by the amount of Y-variance explained by each component. Features with higher VIP scores are considered more important in the PLSR model.
- No direct attribute in scikit-learn for VIP:
sklearn.cross_decomposition.PLSRegressiondoesn’t directly compute VIP scores. You might need to implement the VIP calculation yourself or use specialized PLSR packages (some chemometrics-focused packages might offer VIP calculation). You can find code examples online to calculate VIP using NumPy based on the PLSR model attributes from scikit-learn. - Interpretation of VIP Scores: VIP scores are typically normalized to have an average value of 1. Features with VIP scores significantly greater than 1 (e.g., > 1.5 or > 2, depending on context and dataset) are often considered important. Features with very low VIP scores (e.g., close to 0) are less influential in the PLSR model.
- Component Analysis (Loadings and Weights):
- Examine
plsr.x_loadings_andplsr.x_weights_: These attributes can give insights into how the original features contribute to each PLSR component. - Loadings:
x_loadings_(andy_loadings_) represent the correlations between the original variables and the PLSR components. They show how strongly each original feature is “loaded” onto each component. Larger loadings (in absolute value) indicate a stronger relationship. - Weights:
x_weights_(andy_weights_) are used to calculate the component scores. They are the weights applied to the original variables to create the components.
- Examine
- Feature Selection based on Importance:
- Use VIP scores (or coefficients, but VIP is preferred): If you want to perform feature selection after PLSR, you could use VIP scores to rank features. Select a subset of features with the highest VIP scores. Retrain a PLSR model (or even other regression models) using only these selected features. See if you can achieve comparable or even better performance with a reduced feature set.
- Iterative Feature Elimination (using PLSR): You could devise a backward feature elimination approach. Start with all features, build a PLSR model, evaluate feature importance (e.g., using VIP), remove the least important feature, and repeat. This can be computationally more intensive but explores feature subsets iteratively.
No AB Testing or Hypothesis Testing Directly on PLSR Output (like in visualization methods):
PLSR is a regression technique, and its performance is evaluated using regression metrics (R-squared, RMSE, etc.), not through AB testing or hypothesis testing in the same way you might do post-processing for visualization methods like t-SNE or for experimental designs. However, you would use hypothesis testing and statistical inference in the validation stage of your PLSR modeling (e.g., when comparing the performance of different models or when statistically evaluating the significance of model parameters).
Example: Printing Regression Coefficients and Conceptual VIP Calculation (Conceptual VIP code - needs full VIP implementation for real use):
# ... (Code from previous implementation example up to training PLSR model) ...
# 1. Print Regression Coefficients in Original Variable Space
print("\nRegression Coefficients (coef_ - in original variable space):\n", plsr.coef_)
# 2. Conceptual VIP Calculation (Simplified - Illustrative, not full VIP implementation)
# Note: Full VIP calculation is more involved, this is just to demonstrate the idea.
# For real VIP calculation, search for proper VIP formulas and implementations.
X_loadings = plsr.x_loadings_
Y_loadings = plsr.y_loadings_
X_scores = plsr.x_scores_
n_components_used = plsr.n_components
VIP = np.zeros(X_loadings.shape[0]) # Initialize VIP scores (for features)
# In a full VIP implementation, you'd iterate through components,
# calculate variance explained by each component in Y, and sum weighted loadings squared.
# This is a highly simplified illustration:
for feature_index in range(X_loadings.shape[0]): # Iterate through features
vip_score_feature = 0 # Initialize VIP score for this feature
for component_index in range(n_components_used): # Iterate through components
loading_sq = X_loadings[feature_index, component_index]**2 # Squared loading of feature on component
# In full VIP, you'd weight this by Y-variance explained by component
vip_score_feature += loading_sq # Simplified - just sum of squared loadings (not full VIP)
VIP[feature_index] = vip_score_feature
print("\nConceptual VIP Scores (Simplified - not full VIP):\n", VIP)
Interpreting Output (from example above):
- Regression Coefficients (
coef_): Examine thecoef_output to see the magnitude and sign of coefficients for each feature. In the example output, you can see thatfeature_1has a relatively large positive coefficient,feature_4has a negative coefficient, etc. This gives a sense of the linear influence of each feature in the PLSR model. - Conceptual VIP Scores (Simplified): The simplified VIP scores (in the conceptual example code – not a full VIP implementation) provide a very rough indication of feature “importance.” Features with higher scores in this simplified calculation might be considered more influential in the model. For accurate VIP scores, use a proper VIP calculation method as described in PLSR literature or specialized packages.
In summary: Post-processing for PLSR involves exploring regression coefficients, calculating VIP scores (the preferred method for feature importance), and analyzing loadings and weights to understand feature contributions and model insights. These methods help you go beyond just prediction performance and gain a deeper understanding of the relationships captured by the PLSR model.
7. Hyperparameters of PLSR: Tuning for Best Performance
The main hyperparameter you need to tune in PLSR is the number of components (n_components). Choosing the right number of components is crucial for model performance and avoiding overfitting or underfitting.
Key Hyperparameter: n_components (Number of Components)
- Effect:
n_componentscontrols the complexity of the PLSR model.- Small
n_components: A small number of components (e.g., 1, 2, 3) leads to a simpler model that captures only the strongest relationships in the data. It can help in dimensionality reduction and handling multicollinearity. However, if too few components are used, the model might underfit, failing to capture important variance in the data, leading to lower prediction accuracy. - Large
n_components: A large number of components (closer to the number of original features) results in a more complex model that can capture more variance in the data. Using too many components can lead to overfitting, where the model fits the training data very well but performs poorly on new, unseen data (poor generalization). With enough components, PLSR can essentially become very similar to ordinary least squares regression, potentially losing its benefits in handling multicollinearity and dimensionality reduction. - Optimal
n_components: The goal is to find an optimaln_componentsthat balances model complexity and prediction performance, achieving good generalization.
- Small
-
Hyperparameter Tuning for
n_components(using Cross-Validation):- Cross-Validation (Essential): Use cross-validation (e.g., k-fold cross-validation) to estimate the performance of PLSR models with different numbers of components. Cross-validation helps you assess how well the model generalizes to unseen data and avoid overfitting during hyperparameter tuning.
- Grid Search or Range of Values: Try a range of
n_componentsvalues (e.g., from 1 up to a reasonable number, maybe up to the number of original features or a smaller number based on dataset size). - Evaluation Metric: Choose a suitable evaluation metric for regression (e.g., R-squared, RMSE, Mean Squared Error) to evaluate performance in cross-validation.
- Plot Performance vs.
n_components: Plot the cross-validated performance metric (e.g., average R-squared across CV folds) against the number of components. Look for the “elbow” in the plot. Performance typically increases as you add components up to a point, and then might plateau or decrease (due to overfitting). The optimaln_componentsis often around the point where performance starts to level off. -
Code Example (Hyperparameter Tuning for
n_components):import matplotlib.pyplot as plt from sklearn.model_selection import cross_val_score # ... (Code from previous implementation example up to scaling data) ... n_components_range = range(1, 11) # Try components from 1 to 10 cv_scores = [] for n_comp in n_components_range: plsr = PLSRegression(n_components=n_comp) # Use cross_val_score to get cross-validated R-squared scores scores = cross_val_score(plsr, X_train_scaled_df, y_train_scaled_series, cv=5, scoring='r2') # 5-fold CV, R-squared avg_r2 = np.mean(scores) # Average R-squared across folds cv_scores.append(avg_r2) # Plot CV R-squared vs. Number of Components plt.figure(figsize=(8, 6)) plt.plot(n_components_range, cv_scores, marker='o') plt.xlabel('Number of PLSR Components') plt.ylabel('Cross-Validated R-squared Score') plt.title('PLSR Performance vs. Number of Components (Cross-Validation)') plt.grid(True) plt.show() # Find optimal n_components (e.g., choose the one with highest CV R-squared) optimal_n_components = n_components_range[np.argmax(cv_scores)] print(f"Optimal number of components based on CV R-squared: {optimal_n_components}") # Train final PLSR model with optimal n_components and evaluate on test set (as in previous example) final_plsr = PLSRegression(n_components=optimal_n_components) final_plsr.fit(X_train_scaled_df, y_train_scaled_series) y_pred_test_final = y_scaler.inverse_transform(final_plsr.predict(X_test_scaled_df)) r2_test_final = r2_score(y_test, y_pred_test_final) rmse_test_final = np.sqrt(mean_squared_error(y_test, y_pred_test_final)) print("\nPerformance of final PLSR model (optimal components) on Test Set:") print(f"R-squared (Test): {r2_test_final:.4f}") print(f"RMSE (Test): {rmse_test_final:.4f}")Run this code to generate a plot of cross-validated R-squared against
n_components. Examine the plot to choose the best number of components and find the reported optimaln_componentsand final test set performance.
Hyperparameter Tuning Process Summary:
- Choose a range of
n_componentsvalues to test. - Use cross-validation (e.g., k-fold CV).
- For each
n_componentsvalue, train a PLSR model and evaluate its cross-validated performance using an appropriate regression metric (like R-squared). - Plot the CV performance metric against
n_components. - Select the
n_componentsvalue that gives the best trade-off between performance and model complexity (look for the “elbow” in the performance plot, or choose then_componentsthat maximizes the CV metric). - Train a final PLSR model using the optimal
n_componentson the entire training data and evaluate its performance on a held-out test set.
Other Potential “Hyperparameters” (Less Commonly Tuned in basic PLSR, but worth knowing):
scale=TrueinPLSRegression(Scikit-learn): Insklearn.cross_decomposition.PLSRegression, thescale=Trueparameter controls whether to scale bothXandYdata to unit variance before applying PLSR. By default,scale=True. If you have already scaled your data externally usingStandardScaler, you can setscale=Falseto prevent double scaling. However, it’s generally recommended to leavescale=Trueor applyStandardScalerexternally, as scaling is important for PLSR.- Centering Options (Implicit in Scaling, Sometimes Explicit in other PLSR implementations): Mean centering of
Xandyis a common preprocessing step in PLSR. StandardScaler performs both centering and scaling to unit variance. Some specialized PLSR packages might offer more fine-grained control over centering and scaling options.
Note: For most standard PLSR applications, tuning the number of components (n_components) is the primary hyperparameter tuning task.
8. Accuracy Metrics: Evaluating PLSR Models
To assess the performance of PLSR models, we use standard regression metrics. Here are the most common ones:
Regression Metrics for PLSR Evaluation:
-
R-squared (Coefficient of Determination): (Explained in detail in Subset Selection Blog) Measures the proportion of variance in the target variable explained by the model. Ranges from 0 to 1 (or sometimes negative if the model is very poor). Higher R-squared is better.
\[R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\] -
Adjusted R-squared: (Explained in Subset Selection Blog) Penalizes the addition of irrelevant features (or in PLSR’s case, components). Useful for comparing models with different numbers of components. Adjusted R-squared is always less than or equal to R-squared.
\(Adjusted\ R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}\) (Here, (p) could be considered the number of components used in PLSR, though the direct interpretation is slightly different from number of features in linear regression).
-
Mean Squared Error (MSE): (Explained in Subset Selection Blog) Average squared difference between predicted and actual values. Lower MSE is better.
\[MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] -
Root Mean Squared Error (RMSE): (Explained in Subset Selection Blog) Square root of MSE. In the same units as the target variable, often more interpretable than MSE. Lower RMSE is better.
\[RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\] -
Mean Absolute Error (MAE): (Explained in Subset Selection Blog) Average absolute difference between predicted and actual values. Less sensitive to outliers than MSE or RMSE. Lower MAE is better.
\[MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\]
Choosing the Right Metric:
- R-squared and Adjusted R-squared: Good for understanding the proportion of variance explained. R-squared is widely used, but adjusted R-squared is better for comparing models with different complexity (number of components in PLSR).
- RMSE: Provides an error metric in the original units of your target variable, often easily interpretable. Sensitive to outliers due to squaring errors.
- MAE: Less sensitive to outliers than RMSE. Might be preferred if you are concerned about outliers having a disproportionate influence on the error metric.
Evaluating PLSR Performance in Practice:
- Split Data: Split your data into training and testing sets.
- Preprocess Data: Scale your features and target variable (usually StandardScaler).
- Tune
n_componentsusing Cross-Validation: Use cross-validation on the training set to find the optimal number of components that maximizes your chosen evaluation metric (e.g., R-squared) or minimizes error (e.g., RMSE). - Train Final Model: Train a PLSR model with the optimal
n_componentson the entire training set. - Evaluate on Test Set: Evaluate the performance of the final trained model on the held-out test set using the chosen metric(s) (R-squared, RMSE, MAE). The test set performance gives a realistic estimate of how well your PLSR model will generalize to new, unseen data.
Example (Calculating R-squared and RMSE in Python - same code as in Subset Selection Blog):
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
# Assume y_true and y_pred are your actual and predicted values (numpy arrays or lists)
y_true = np.array([25, 30, 35, 40, 45])
y_pred = np.array([26, 29, 36, 41, 44])
r2 = r2_score(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
print("R-squared:", r2)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
9. Productionizing PLSR Models
“Productionizing” a PLSR model for real-world use involves deploying it to make predictions on new, incoming data. Here are typical steps:
Productionizing Steps for PLSR:
- Offline Training and Model Saving:
- Train PLSR Model: Train your PLSR model using your training data, including preprocessing (scaling, encoding, missing value handling). Determine the optimal
n_componentsusing cross-validation. - Save the Trained Model: Save the trained
PLSRegressionmodel object to a file (e.g., usingjoblib.dumpin Python). - Save Preprocessing Objects: Crucially, save any preprocessing objects used, such as
StandardScaler(or other scalers, encoders, imputers) that were fitted on your training data. You’ll need these to preprocess new data in exactly the same way.
- Train PLSR Model: Train your PLSR model using your training data, including preprocessing (scaling, encoding, missing value handling). Determine the optimal
- Production Environment Setup:
- Choose Deployment Environment: Select where your PLSR model will be deployed (cloud, on-premise servers, local machines, edge devices).
- Software Stack: Ensure the necessary software (Python environment with scikit-learn and any other libraries) is set up in your production environment.
- Data Ingestion and Preprocessing in Production:
- Data Ingestion: Set up a process to receive new data that you want to make predictions on (e.g., from APIs, databases, files, sensors).
- Preprocessing (Crucial): Apply exactly the same preprocessing steps to the new data as you used during training. This is critical for model consistency.
- Load Preprocessing Objects: Load the saved
StandardScaler(and other preprocessing objects). - Transform New Data: Use the loaded scalers (and other preprocessors) to transform the new input data. Use the
transform()method (notfit_transform()) on the pre-fitted objects to ensure consistency with training.
- Load Preprocessing Objects: Load the saved
- Model Loading and Prediction:
- Load Trained PLSR Model: Load the saved
PLSRegressionmodel object into your production environment (usingjoblib.load). - Make Predictions: Use the
predict()method of the loaded PLSR model to make predictions on the preprocessed new data. - Inverse Transform Predictions (if target variable was scaled): If you scaled your target variable (
y) during training, inverse transform the scaled predictions back to the original scale using the saved target variable scaler.
- Load Trained PLSR Model: Load the saved
- Output and Integration:
- Output Predictions: Output the predictions in the desired format (e.g., return values from an API, write to a database, display on a dashboard, send to other systems).
- Integrate into Workflow: Integrate the PLSR prediction process into your broader application or workflow.
Code Snippet: Example Production Prediction Function (Python - Conceptual):
import joblib
import pandas as pd
import numpy as np
# --- Assume these files were saved during training ---
MODEL_FILE = 'plsr_model.joblib'
X_SCALER_FILE = 'scaler_X.joblib'
Y_SCALER_FILE = 'y_scaler.joblib'
# Load trained model and scalers (do this once at application startup, not for every prediction)
loaded_plsr = joblib.load(MODEL_FILE)
loaded_scaler_X = joblib.load(X_SCALER_FILE)
loaded_y_scaler = joblib.load(Y_SCALER_FILE)
def make_plsr_prediction(raw_data_point_dict):
"""Makes a PLSR prediction for a new data point."""
# 1. Convert raw input (dict) to DataFrame (or numpy array) - assuming feature_names are consistent
input_df = pd.DataFrame([raw_data_point_dict]) # Assume keys in dict are feature names
# 2. Preprocess input data using *loaded* scaler (same scaler fitted on training data)
input_scaled = loaded_scaler_X.transform(input_df)
input_scaled_df = pd.DataFrame(input_scaled, columns=input_df.columns) # Keep column names
# 3. Make prediction using loaded PLSR model
prediction_scaled = loaded_plsr.predict(input_scaled_df)
# 4. Inverse transform prediction to original scale (if target was scaled)
prediction_original_scale = loaded_y_scaler.inverse_transform(prediction_scaled)
return prediction_original_scale[0][0] # Return single prediction value
# Example usage in production
new_data = {'feature_1': 0.5, 'feature_2': 0.1, 'feature_3': 0.8, 'feature_4': 0.3,
'feature_5': 0.6, 'feature_6': 0.2, 'feature_7': 0.9, 'feature_8': 0.4,
'feature_9': 0.7, 'feature_10': 0.1} # New data as dictionary
predicted_value = make_plsr_prediction(new_data)
print("PLSR Prediction:", predicted_value)
Deployment Environments:
- Cloud Platforms (AWS, Google Cloud, Azure): Cloud services are well-suited for production deployment due to scalability, reliability, and managed infrastructure. Use cloud compute instances, serverless functions, container services, and ML platform services.
- On-Premise Servers: Deploy on your organization’s servers if required by security or compliance.
- Edge Devices: For some applications, deploy PLSR models on edge devices (e.g., sensors, embedded systems) for real-time analysis and reduced latency.
Key Production Considerations:
- Preprocessing Consistency: Ensure absolute consistency in preprocessing steps between training and production. Use the same preprocessing objects (scalers, encoders) fitted on training data.
- Performance and Latency: PLSR prediction is typically fast. Ensure your entire pipeline (data ingestion, preprocessing, prediction, output) meets performance and latency requirements for your application.
- Monitoring: Monitor model performance in production over time. Data drift can occur (data distributions changing over time), which can degrade model accuracy. Regularly retrain or update your PLSR model with fresh data as needed.
- Error Handling: Implement robust error handling in your production pipeline to deal with unexpected input data formats, missing data, or system issues.
- Version Control: Manage versions of your models, preprocessing code, and deployment configurations.
10. Conclusion: PLSR – A Robust Tool for Regression with Complex Data
Partial Least Squares Regression (PLSR) is a powerful and versatile regression technique, particularly valuable when dealing with high-dimensional datasets with multicollinearity. It’s widely applied across various fields for prediction and data analysis.
Real-World Problem Solving with PLSR:
- Spectroscopic Data Analysis (Chemometrics, PAT): Dominant technique for building calibration models to predict chemical or physical properties from spectral data (NIR, Raman, etc.) in industries like food, pharmaceuticals, chemicals, agriculture, and environmental monitoring.
- Sensory and Consumer Science: Relating sensory attributes of products to chemical composition or consumer preferences in food and beverage, cosmetics, and consumer goods industries.
- Process Modeling and Control: Building models to predict product quality based on real-time process variables in manufacturing, enabling process monitoring and control.
- Bioinformatics and Genomics: Analyzing gene expression data or other high-dimensional biological data to predict disease outcomes, drug responses, or biological traits.
- Environmental Modeling: Predicting environmental variables (pollution levels, water quality, etc.) based on sensor data and environmental indicators.
Where PLSR is Still Being Used:
PLSR remains a highly relevant and widely used technique, especially in:
- Chemometrics and related fields: It’s a cornerstone method for quantitative analysis in spectroscopy and process analytics.
- Applications with high-dimensional, correlated data: When you encounter datasets with many potentially collinear predictor variables and need to build a regression model, PLSR is a strong candidate.
- Situations where interpretability is also important: While not as interpretable as simpler linear regression in terms of individual feature effects, PLSR provides insights through component loadings, weights, and VIP scores, helping to understand feature contributions in a reduced-dimensional space.
Optimized and Newer Algorithms:
While PLSR is robust, some related and newer methods are used in similar contexts:
- Regularized Regression Methods (Ridge Regression, Lasso, Elastic Net): These methods, discussed in the Subset Selection blog, also handle multicollinearity and perform feature selection (Lasso, Elastic Net) or coefficient shrinkage (Ridge). They are often computationally efficient and widely used as alternatives to PLSR in some regression tasks.
- Principal Component Regression (PCR): PCR first performs PCA on the predictor variables X and then uses the principal components as predictors in a linear regression model. PCR is simpler than PLSR but doesn’t directly consider the outcome variable y during component extraction, which can sometimes make PLSR more predictive.
- Kernel PLS and Non-Linear PLSR Extensions: For datasets with non-linear relationships, kernel PLSR or other non-linear extensions of PLSR can be considered to capture non-linearities.
- Deep Learning and Neural Networks (for very large datasets, potentially): For very large datasets and complex non-linear relationships, deep learning models can be powerful regression tools, but they are often less interpretable and require more data and computational resources than PLSR, which is effective for many medium-sized, high-dimensional datasets.
Choosing Between PLSR and Alternatives:
- For Spectroscopic and Chemometric Data: PLSR is often the method of choice and a well-established standard.
- For Handling Multicollinearity and Dimensionality Reduction in Regression: PLSR, Ridge Regression, Lasso, and Elastic Net are all viable options. PLSR and regularized regression methods offer different trade-offs in terms of feature selection, coefficient shrinkage, and computational properties.
- For Non-linear Relationships: Consider non-linear extensions of PLSR or other non-linear regression models if linearity assumptions are strongly violated.
Final Thought: PLSR is a powerful and enduring technique for regression, especially valuable in scenarios with complex, high-dimensional, and correlated predictor data. Its ability to handle multicollinearity and provide a reduced-dimensional representation of the data while maintaining predictive power makes it a key tool in the data scientist’s and analyst’s toolkit. For many applications, particularly in chemometrics and process analysis, PLSR remains the gold standard.
11. References and Resources
Here are some references and resources to learn more about Partial Least Squares Regression (PLSR):
-
“Chemometrics: A Practical Guide to Multivariate Data Analysis in the Chemical and Life Sciences” by Kim H. Esbensen: (Book Link - Search Online) - A comprehensive textbook on chemometrics, with detailed chapters on PLSR and its applications in chemistry, spectroscopy, and related fields. This book is considered a classic in chemometrics.
-
“Methods in Molecular Biology, vol 544: Partial Least Squares Regression (PLS-Regression)” Edited by Gaston Sanchez: (Book Link - Search Online) - A book in the Methods in Molecular Biology series specifically dedicated to PLSR. It covers various aspects of PLSR, from theory to applications, with a focus on biological and omics data.
- scikit-learn Documentation for PLSRegression:
- scikit-learn PLSRegression - The official documentation for the
PLSRegressionclass in scikit-learn. Provides details on parameters, usage, and examples in Python.
- scikit-learn PLSRegression - The official documentation for the
-
“Principles of Multivariate Data Analysis: A User’s Perspective” by W. Krzanowski: (Book Link - Search Online) - A broader textbook on multivariate data analysis, with chapters covering regression methods including PLSR, from a user-oriented perspective.
- Online Resources and Tutorials: Search online for tutorials, blog posts, and videos on “Partial Least Squares Regression Python” or “PLSR tutorial”. Websites like Towards Data Science, Cross Validated (Stack Exchange), and YouTube have numerous resources explaining PLSR with code examples.
These references should provide a good starting point for deepening your understanding of PLSR, its mathematical foundations, applications, and implementation in Python. Experiment with PLSR on your own datasets and explore its capabilities for regression and data analysis!