Multivariate Adaptive Regression Splines (MARS): Flexible Modeling with Splines

31 minute read

Introduction: Bending the Line to Fit Complex Data

Imagine you are trying to model how crop yield changes with rainfall. You might notice that initially, more rain helps a lot, but after a certain point, too much rain can actually harm the crops. A simple straight line (like in linear regression) might not accurately capture this ‘sweet spot’ relationship.

Multivariate Adaptive Regression Splines (MARS) is a clever algorithm that addresses this limitation. It’s like bending and piecing together straight line segments to create a flexible curve that better fits complex data relationships. Think of it as using LEGO bricks of straight lines to build a curve that follows the twists and turns of your data, instead of being restricted to just one straight brick.

MARS is particularly useful when:

Relationships are Non-linear: The connection between your input features and the target variable isn’t a simple straight line. It might curve, bend, or have different slopes in different regions of the data.
Interactions between Features Exist: The effect of one feature on the target might depend on the value of another feature. MARS can automatically capture these interactions.
You want an interpretable model: While flexible, MARS models are still more interpretable than “black box” models like neural networks. You can often understand how each feature and their interactions affect the prediction.

Real-world examples where MARS is useful:

Sales Prediction with Promotions: Predicting sales based on advertising spend and promotional offers. The effect of advertising might be different depending on whether a promotion is currently running. MARS can capture these varying effects and interactions.
Financial Modeling: Predicting stock prices or asset returns. Financial markets are complex and often non-linear. MARS can model these complex patterns better than simple linear models.
Environmental Modeling: Predicting pollution levels based on weather conditions and industrial activity. The relationship might be non-linear and influenced by interactions between weather variables and pollution sources.
Healthcare and Medical Prediction: Predicting patient outcomes based on various clinical measurements and patient characteristics. The relationships might be complex and non-linear, and interactions between factors can be crucial.
Engineering and Manufacturing: Optimizing manufacturing processes or predicting product quality based on machine settings and raw material properties. The relationship between settings and quality might be non-linear with critical thresholds.

In essence, MARS helps us move beyond the limitations of simple linear models by allowing for flexibility to model curves and interactions, while still retaining a degree of interpretability.

The Mathematics of MARS: Building Curves with Hinges

Let’s delve into the mathematical ideas behind MARS. We’ll break down the process into simpler steps to make it easier to understand.

1. Basis Functions: The Building Blocks

MARS builds its flexible curves using special functions called basis functions. These are the “LEGO bricks” we mentioned earlier. The main types of basis functions used in MARS are called hinge functions.

A hinge function looks like this:

\text{Hinge function} = \max(0, x - c) \text{  or  } \max(0, c - x)

Let’s break this down:

x: This is our input feature (e.g., rainfall, advertising spend).
c: This is a “knot” or “hinge point”. It’s a specific value of ‘x’ where the relationship might change its slope.
max(0, …): This is a “maximum” function. It means “take the larger of 0 and the value inside the parentheses”.

There are two types of hinge functions we use in pairs:

Type 1: max(0, x - c) - This function is zero when x is less than or equal to ‘c’, and it increases linearly when x is greater than ‘c’. It’s like a line that’s “hinged” at point ‘c’ and only goes up to the right of it.
Type 2: max(0, c - x) - This function is zero when x is greater than or equal to ‘c’, and it increases linearly when x is less than ‘c’. It’s like a line that’s “hinged” at point ‘c’ and only goes up to the left of it.

Example to Understand Hinge Functions:

Let’s say our feature ‘x’ is temperature, and we choose a knot ‘c’ = 20 degrees Celsius.

Hinge function 1: max(0, temperature - 20)
- If temperature is 15°C (less than 20), max(0, 15-20) = max(0, -5) = 0. The function’s output is 0.
- If temperature is 25°C (greater than 20), max(0, 25-20) = max(0, 5) = 5. The function’s output is 5.
- This hinge function only becomes active (non-zero) when the temperature is above 20°C.
Hinge function 2: max(0, 20 - temperature)
- If temperature is 15°C (less than 20), max(0, 20-15) = max(0, 5) = 5. The function’s output is 5.
- If temperature is 25°C (greater than 20), max(0, 20-25) = max(0, -5) = 0. The function’s output is 0.
- This hinge function only becomes active (non-zero) when the temperature is below 20°C.

By using pairs of these hinge functions around different knot points, MARS can create different linear segments and piece them together.

2. Building the MARS Model: Sum of Basis Functions

A MARS model is essentially a sum of these basis functions (hinge functions) and a constant term. For a single input feature ‘x’, a MARS model might look like this:

\hat{y} = \beta_0 + \beta_1 \times \text{bf}_1(x) + \beta_2 \times \text{bf}_2(x) + \beta_3 \times \text{bf}_3(x) + ...

Where:

ŷ (y-hat) is the predicted value of ‘y’.
β₀ (beta_0) is the intercept term (like ‘b’ in linear regression).
β₁, β₂, β₃, … (beta_1, beta_2, beta_3, …) are the coefficients for each basis function. These are learned by the model.
bf₁(x), bf₂(x), bf₃(x), … are basis functions, which are hinge functions (or sometimes just ‘x’ itself, for linear parts).

Example of MARS model with hinge functions:

Imagine we want to model crop yield (y) based on rainfall (x). MARS might create a model like:

\hat{y} = 10 + 0.5 \times \max(0, \text{rainfall} - 30) - 0.2 \times \max(0, 60 - \text{rainfall}) + 0.1 \times \text{rainfall}

Knot points: Here, we have knots at 30mm and 60mm of rainfall.
Interpretation:
- 0.5 × max(0, rainfall - 30): Yield increases by 0.5 units for every mm of rainfall above 30mm.
- -0.2 × max(0, 60 - rainfall): Yield decreases by 0.2 units for every mm of rainfall below 60mm (and above 30mm because the first term is already handling rainfall > 30mm). The negative sign and (60 - rainfall) means this term becomes active when rainfall is below 60mm and reduces yield.
- 0.1 × rainfall: A linear component, yield increases by 0.1 units for every mm of rainfall overall.
- 10: Base yield.

This model combines linear and hinged segments to capture a non-linear relationship between rainfall and yield.

3. Forward and Backward Stepwise Selection: Finding the Best Basis Functions

The MARS algorithm uses a two-phase approach to automatically build the model and select the best basis functions:

Forward Pass (Building Phase):
1. Start with a simple model (just the intercept β₀).
2. Iteratively add basis functions (hinge function pairs and potentially linear terms ‘xᵢ’) to the model.
3. At each step, choose the pair of basis functions that most improves the model’s fit to the training data (usually by minimizing the Mean Squared Error, MSE).
4. Continue adding pairs until a maximum number of basis functions is reached or further additions don’t significantly improve the fit.
Backward Pass (Pruning Phase):
1. Start with the complex model built in the forward pass (which might be too complex and overfit).
2. Iteratively remove the least effective basis function from the model.
3. Effectiveness is usually judged by how much removing a basis function increases the model’s error (e.g., using a Generalized Cross-Validation, GCV, criterion, which estimates the model’s performance on unseen data).
4. Continue removing basis functions until the GCV score starts to increase (meaning removing more functions would hurt the model’s predictive power).

Key Aspects of Forward and Backward Passes:

Knot Placement: Knot points ‘c’ for hinge functions are chosen based on the input feature values in the training data. MARS tries knots at different data points to find the best locations for hinges.
Interactions: MARS can also create basis functions that represent interactions between features. For example, if you have features x₁ and x₂, MARS can create basis functions like bf(x₁) × bf(x₂) to model how the combined effect of x₁ and x₂ on ‘y’ might be different from their individual effects.
Generalized Cross-Validation (GCV): GCV is used to guide the backward pruning step and helps prevent overfitting. It estimates the model’s out-of-sample error and helps select a model complexity that generalizes well.

In summary, MARS uses hinge functions as building blocks and a stepwise forward-backward approach to automatically construct a flexible regression model that can capture non-linearities and feature interactions in your data.

Prerequisites and Preprocessing for MARS

Before applying MARS, there are certain prerequisites and preprocessing considerations to keep in mind.

Assumptions of MARS:

MARS is more flexible than linear regression and makes fewer strict assumptions, but it’s still helpful to understand its underlying assumptions and when it works best.

Data should be primarily additive and linear within segments: MARS models relationships as piecewise linear segments connected at knots. While it can capture non-linear curves, it’s fundamentally built on linear segments. If the true underlying relationship is highly complex and drastically non-linear even within localized regions, MARS might need many basis functions to approximate it well, potentially making the model less interpretable and more prone to overfitting.
Features are relevant: MARS is good at selecting knots and building basis functions for relevant features. However, it assumes that the input features you provide are at least somewhat related to the target variable. If you include purely random noise features that have no relationship to the target, MARS might still try to fit them, potentially adding unnecessary complexity to the model. Feature selection or domain knowledge to pre-select relevant features can still be beneficial.
Smoothness (Implied): While MARS is piecewise linear, the resulting function is continuous. It implicitly assumes a degree of smoothness in the underlying relationship, even if it’s not perfectly smooth. Extremely discontinuous or highly erratic relationships might be harder for MARS to model effectively.

Testing Assumptions (Less Formal for MARS, More about Data Understanding):

For MARS, formal assumption tests like those used for linear regression are less directly applicable or commonly performed. Instead, focus on data understanding and visualization:

Scatter Plots and Pair Plots: Examine scatter plots of each feature against the target variable and pair plots of features against each other. Look for patterns:
- Non-linear patterns: If you see curves, bends, or changing slopes in scatter plots of features vs. target, MARS might be a good choice to capture this non-linearity.
- Interactions: Pair plots can give hints about potential interactions between features. If the relationship between one feature and the target changes depending on the value of another feature, MARS might be able to model this.
Residual Analysis (After Initial MARS Model): After fitting a MARS model, analyze the residuals (actual y - predicted ŷ):
- Residual Plots vs. Predicted Values or Features: Plot residuals against predicted values or against each input features. Ideally, residuals should be randomly scattered around zero, showing no systematic patterns. Patterns in residuals might indicate that the MARS model hasn’t fully captured the relationships, and you might need to adjust hyperparameters or consider other modeling approaches.
- Histogram/Q-Q Plot of Residuals: Check the distribution of residuals for approximate normality (though normality is less critical for prediction itself, it can be relevant for statistical inference if you were to perform it with MARS, which is less common in typical MARS applications).

Python Libraries Required for Implementation:

numpy: For numerical computations, especially array operations.
pandas: For data manipulation and analysis, working with DataFrames.
pylearn-parzival (or py-earth as alternative): Python libraries that implement MARS. pylearn-parzival is a good option for MARS implementation in Python. py-earth is another popular choice.
scikit-learn (sklearn): For data splitting (train_test_split), preprocessing (StandardScaler, etc.), and evaluation metrics (mean_squared_error, r2_score).
matplotlib and seaborn: For data visualization.

Note on Library Choice: While scikit-learn is the most widely used general-purpose machine learning library in Python, it does not natively include MARS. You’ll need to use dedicated libraries like pylearn-parzival or py-earth for MARS modeling.

Data Preprocessing for MARS: Scaling - Often Not Critical, But Can Be Helpful

Data Scaling (Standardization or Normalization) is generally not as critical for MARS as it is for algorithms like Gradient Descent Regression, Lasso, or Ridge Regression. MARS is somewhat less sensitive to feature scaling than those methods, but scaling can still be beneficial in certain situations and is often considered good practice.

Why Scaling Can Be Helpful for MARS:

Improved Numerical Stability (Potentially): If you have features with vastly different scales, and especially if some features have very large values, scaling (like standardization to mean 0 and standard deviation 1) can sometimes improve the numerical stability of the MARS algorithm, particularly during the basis function selection and model fitting process. This is less of a major concern for MARS compared to optimization-based methods like Gradient Descent, but still a possible benefit.
Fairer Feature Importance (Potentially, but interpret coefficients carefully): While MARS is not as directly coefficient-based as linear regression, if you are interested in trying to compare the “importance” or influence of different features based on MARS model outputs, scaling features to a common scale can make the relative contributions of features slightly more comparable (although MARS feature importance is often assessed through other methods, like basis function usage, rather than solely by coefficient magnitudes).

When Scaling Can Be Ignored (or is Less Critical) for MARS:

MARS is Designed to Handle Different Scales to Some Extent: MARS’s basis function approach is inherently somewhat less sensitive to feature scales compared to methods where scaling is essential. MARS can create hinge functions and linear segments tailored to the range and distribution of each feature individually. It’s not as heavily reliant on gradient-based optimization that is directly affected by scale.
If Features Have Naturally Meaningful Scales: If your features are already measured in units that are meaningful and interpretable in your domain, and you don’t need to compare coefficient magnitudes across features for feature importance, you might choose to skip scaling. For example, if all your features are already percentages, ratios, or counts on a roughly similar scale, scaling might be less crucial.

Types of Scaling (Standardization is generally recommended if scaling is done):

Standardization (Z-score scaling): Transforms features to have a mean of 0 and a standard deviation of 1. If you choose to scale for MARS, standardization is often a reasonable and recommended choice as a general-purpose scaling method.
```
x' = (x - μ) / σ
```
Normalization (Min-Max scaling): Scales features to a specific range, typically [0, 1]. Less commonly used for MARS compared to standardization, but could be considered depending on your specific data and requirements.
```
x' = (x - min(x)) / (max(x) - min(x))
```

In Summary:

Data scaling is generally not strictly required for MARS to work correctly and produce a valid model. MARS is more robust to feature scales than some other algorithms.
Scaling (especially standardization) can be helpful and is often recommended as good practice for MARS preprocessing because it can potentially:
- Improve numerical stability, especially with features on very different scales.
- Make feature contributions slightly more comparable if you are trying to interpret feature importance (though interpret cautiously).
If you are unsure, it’s often safer to apply standardization before using MARS. It’s less likely to hurt performance and might provide some benefits.
If you are working with features that already have naturally meaningful and comparable scales, and computational stability isn’t a concern, you might consider skipping scaling.

Recommendation: For most MARS applications, applying Standardization (using StandardScaler in scikit-learn) to your numerical features is a reasonable and recommended preprocessing step. It’s a generally safe and often beneficial practice, even though it’s not as essential for MARS as for some other algorithms.

Implementation Example: House Price Prediction with MARS

Let’s implement MARS Regression in Python for house price prediction, using dummy data with a non-linear relationship and feature interaction. We’ll use the pylearn-parzival library.

1. Dummy Data Creation (with non-linearity and interaction):

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Set seed for reproducibility
np.random.seed(42)

# Generate dummy data with non-linear relationship and interaction
n_samples = 200
square_footage = np.random.randint(800, 3000, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
location_quality = np.random.randint(1, 11, n_samples) # Scale 1-10

# Create price with non-linear effects and interaction (interaction between sqft and location)
price = 100000 + \
        200 * square_footage + \
        (square_footage**2) / 1000 + \
        5000 * bedrooms + \
        50000 * location_quality + \
        (square_footage * location_quality) / 50 - \
        np.random.normal(0, 40000, n_samples)

# Create Pandas DataFrame
data = pd.DataFrame({
    'SquareFootage': square_footage,
    'Bedrooms': bedrooms,
    'LocationQuality': location_quality,
    'Price': price
})

# Split data into training and testing sets
X = data[['SquareFootage', 'Bedrooms', 'LocationQuality']] # Features
y = data['Price'] # Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

feature_names = X.columns # Store feature names

print("Training data shape:", X_train.shape, y_train.shape)
print("Testing data shape:",  X_test.shape,  y_test.shape)
print("\nFirst 5 rows of training data:\n", X_train.head())

This code generates dummy data for house prices with three features. The ‘Price’ is created with non-linear components (squared term of ‘SquareFootage’) and an interaction term between ‘SquareFootage’ and ‘LocationQuality’, to demonstrate MARS’s capability to handle such relationships.

2. Data Scaling (Standardization - Optional, but good practice):

# Scale features using StandardScaler (optional, but often recommended)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=feature_names, index=X_train.index) # for easier viewing
print("\nScaled Training Data (first 5 rows):\n", X_train_scaled_df.head())

We apply Standardization to the features using StandardScaler. As discussed, this is optional but often recommended preprocessing for MARS.

3. MARS Model Training (using pylearn-parzival):

from pylearn_parzival import MARS # Import MARS from pylearn-parzival

# Train MARS model
mars_model = MARS() # Instantiate MARS model (using default hyperparameters initially)
mars_model.fit(X_train_scaled, y_train) # Fit MARS model on scaled training data and target

print("\nMARS Model Trained.")

We import the MARS class from pylearn-parzival and train a MARS model using the scaled training data. We use default hyperparameters for now.

4. Make Predictions and Evaluate:

# Make predictions on test set
y_pred_test_mars = mars_model.predict(X_test_scaled)

# Evaluate performance
mse_mars = mean_squared_error(y_test, y_pred_test_mars)
r2_mars = r2_score(y_test, y_pred_test_mars)

print(f"\nMARS Regression - Test Set MSE: {mse_mars:.2f}")
print(f"MARS Regression - Test Set R-squared: {r2_mars:.4f}")

# Inspect basis functions and model summary (if the library provides it)
print("\nMARS Model Summary (Basis Functions and Coefficients):")
print(mars_model) # pylearn-parzival MARS model object overloads __str__ for a summary

We make predictions on the test set using the trained MARS model and evaluate its performance using MSE and R-squared. We also print a summary of the MARS model, which, for pylearn-parzival, will show the basis functions and their coefficients.

Understanding Output - MARS Model Summary:

When you run the code, pay attention to the “MARS Model Summary” output. It will typically show:

Basis Functions: A list of the basis functions that MARS has selected to build the model. These will be in the form of hinge functions max(0, feature - knot) or max(0, knot - feature), linear terms (just feature names), and possibly interaction terms (products of basis functions).
Coefficients: The coefficients (β values) associated with each basis function and the intercept. These coefficients are the weights that MARS has learned for each basis function to minimize the error and fit the data.

By examining the basis functions and coefficients, you can gain insights into:

Non-linear effects: The presence of hinge functions indicates that MARS has identified non-linear segments in the relationship between features and the target. The knots in hinge functions show where these segments “hinge” or change slope.
Feature Interactions: If you see basis functions that are products of two or more features, it means MARS has detected and modeled interactions between those features.
Relative Importance (Indirect): While not directly a feature importance ranking, you can get some sense of feature influence by looking at how many basis functions involve each feature and the magnitudes of their coefficients. Features involved in more basis functions and with larger coefficients are likely to be more influential in the MARS model.

Saving and Loading the MARS Model and Scaler:

import joblib

# Save the trained MARS model and scaler
joblib.dump(mars_model, 'mars_regression_model.joblib')
joblib.dump(scaler, 'scaler_mars.joblib')

print("\nMARS Regression model and scaler saved to 'mars_regression_model.joblib' and 'scaler_mars.joblib'")

# To load them later:
loaded_mars_model = joblib.load('mars_regression_model.joblib')
loaded_scaler = joblib.load('scaler_mars.joblib')

# Now you can use loaded_mars_model for prediction on new data after preprocessing with loaded_scaler.

We save the trained MARS model and the scaler using joblib for later reuse.

Post-Processing: Interpreting Basis Functions and Feature Importance

Interpreting Basis Functions and Coefficients:

MARS models, while more flexible than simple linear models, still offer a degree of interpretability through their basis functions and coefficients. Understanding these components is key to post-processing and gaining insights.

Basis Functions as Segments: Each basis function (hinge function, linear term, interaction term) represents a segment or component of the overall non-linear relationship learned by MARS.
- Hinge Functions: max(0, feature - knot) and max(0, knot - feature) define linear segments that activate above or below a specific knot value. The coefficients associated with hinge functions tell you the slope of the relationship within that segment.
- Linear Terms (Original Features ‘xᵢ’): If MARS includes a feature ‘xᵢ’ directly as a basis function, it indicates a linear component of the relationship with the target variable.
- Interaction Terms (Products of Basis Functions): Basis functions that are products of two or more simpler basis functions represent interactions between features. For instance, bf(x₁) × bf(x₂) captures how the combined effect of x₁ and x₂ might differ from their individual linear effects.
Coefficient Interpretation (β values): The coefficients (β values) associated with each basis function quantify the contribution of that specific segment or interaction to the predicted value.
- Magnitude: The magnitude of a coefficient reflects the strength of the basis function’s influence on the prediction. Larger magnitudes (absolute values) indicate a stronger effect of that particular segment or interaction.
- Sign: The sign (+ or -) of a coefficient indicates the direction of the effect. A positive sign means an increase in the basis function’s value leads to an increase in the predicted target, and vice versa for a negative sign.

Example Interpretation (from Model Summary):

Suppose your MARS model summary includes basis functions and coefficients like this (simplified example):

Basis Function                          Coefficient
--------------------------------------- ------------
(Intercept)                             200000
bf1 = max(0, SquareFootage - 1500)        200
bf2 = max(0, 2500 - SquareFootage)       -100
bf3 = max(0, LocationQuality - 7)        30000
Interaction_bf1_bf3 = bf1 * bf3          50

Intercept (200000): Base price, if all basis functions are zero.
bf1 (max(0, SquareFootage - 1500)) with coefficient 200: For every sq ft above 1500 sq ft, price increases by $200.
bf2 (max(0, 2500 - SquareFootage)) with coefficient -100: For every sq ft below 2500 sq ft, price decreases by $100 (but this is likely interacting with bf1, so the actual effect might be more nuanced).
bf3 (max(0, LocationQuality - 7)) with coefficient 30000: For every unit of LocationQuality above 7, price increases by $30000.
Interaction_bf1_bf3 (bf1 * bf3) with coefficient 50: An interaction effect. When both ‘SquareFootage’ is above 1500 and ‘LocationQuality’ is above 7, there’s an additional increase in price of $50 for every unit of (bf1 * bf3), capturing a synergistic effect.

Feature Importance (Indirectly from Basis Function Usage):

MARS doesn’t directly provide a single “feature importance” score for each original feature in the same way as some tree-based models. However, you can infer feature importance indirectly by analyzing the basis functions:

Frequency of Feature Appearance: Count how many basis functions involve each original feature. Features that appear in more basis functions (especially in main effect basis functions, not just interactions) are likely to be more influential in the MARS model.
Coefficient Magnitudes of Basis Functions Involving a Feature: Look at the magnitudes of the coefficients associated with basis functions that include a particular feature. Larger coefficient magnitudes for basis functions related to a feature suggest a stronger overall influence of that feature.
GCV Importance (if the MARS library provides it): Some MARS implementations provide a measure of “GCV Importance” for each feature, which estimates how much the Generalized Cross-Validation score improves when basis functions involving that feature are added to the model. This can be a more direct measure of feature importance from a predictive perspective.

Example: Post-processing to analyze Basis Functions (Conceptual):

You would need to access the basis function representation from your MARS library (e.g., if pylearn-parzival or py-earth provides an attribute or method to get basis function details) and then analyze them programmatically.

# (Conceptual code - library-dependent, might need adjustments based on your MARS library output structure)

basis_functions_summary = mars_model.basis_functions # Assume your library exposes basis function info like this

feature_basis_function_counts = {} # Count how many basis functions each feature appears in

for bf in basis_functions_summary: # Iterate through basis functions
    features_involved = get_features_from_basis_function(bf) # Function to extract original feature names from BF representation (library-specific)
    for feature in features_involved:
        feature_basis_function_counts[feature] = feature_basis_function_counts.get(feature, 0) + 1

print("\nFeature Basis Function Counts (Rough Importance Indicator):")
for feature, count in feature_basis_function_counts.items():
    print(f"{feature}: {count}")

# ... (Similarly, analyze coefficients, or GCV importance if available from your library) ...

Limitations of Feature Importance Interpretation in MARS:

Basis Function Complexity: Interpreting feature importance in MARS is less straightforward than in linear regression (with single coefficients per feature) or tree-based models (with explicit feature importance scores). MARS uses basis functions that are combinations of features, and importance is distributed across these functions.
Correlation vs. Causation: Feature importance inferred from MARS, like any statistical model, reflects correlation, not necessarily causation. It indicates which features are important for prediction within the MARS model, but not necessarily causal relationships in the real world.
Model-Specific Importance: Feature importance is model-dependent. Importance from a MARS model reflects importance within the MARS model’s structure and assumptions. Different models (e.g., different hyperparameter settings for MARS, or different types of models altogether) might rank feature importance differently.

Despite these limitations, analyzing basis functions and coefficients in MARS provides valuable insights into the non-linear relationships and feature interactions captured by the model.

Hyperparameter Tuning in MARS

MARS models have hyperparameters that control their complexity and flexibility. Tuning these hyperparameters is crucial to prevent overfitting and optimize model performance. Key hyperparameters in MARS typically include:

1. max_terms (Maximum Number of Basis Functions):

Effect: Controls the maximum number of basis functions (segments, hinge functions, interaction terms) that MARS is allowed to include in the final model.
- Small max_terms: Limits model complexity. Fewer basis functions are used. The model might be simpler, more interpretable, and less prone to overfitting (lower variance), but it might underfit if the true relationship is complex (higher bias).
- Large max_terms: Allows for a more complex model. More basis functions can be added, allowing MARS to capture more intricate non-linearities and interactions. The model can potentially fit the training data very well, but it’s more prone to overfitting (higher variance) if max_terms is too large, especially with limited data.
Tuning: max_terms is a primary hyperparameter to tune. You need to find a balance between model complexity and generalization performance. Tune max_terms using cross-validation (or a validation set approach). Try a range of values (e.g., from a small number like 5 or 10 up to a larger number, depending on your dataset size and complexity). Evaluate performance (e.g., MSE, RMSE, R-squared) for each max_terms value using cross-validation, and choose the value that gives the best validation performance.

2. degree (Maximum Interaction Degree):

Effect: Controls the maximum degree of interaction allowed in the model.
- degree=1: No interaction terms are allowed. The model will only include basis functions for main effects (hinge functions and linear terms of individual features). Equivalent to additive splines.
- degree=2: Allows for two-way interactions (basis functions that are products of up to two basis functions involving different features). Can capture pairwise interactions.
- degree=3 (and higher): Allows for higher-order interactions (three-way, four-way, etc.). Models become more complex and can capture more intricate interactions, but also increase the risk of overfitting, especially with limited data. Higher degrees become computationally more expensive.
Tuning: degree is another important hyperparameter. Typically, degree=1 or degree=2 are common choices. Start with degree=1 and consider increasing it to degree=2 or degree=3 if you suspect important interactions and have enough data to support a more complex model without overfitting. Tune degree in combination with max_terms using cross-validation to find the optimal combination.

3. penalty (Penalty for Model Complexity - Pruning):

Effect: Controls the severity of the penalty used in the backward pruning phase of MARS. A higher penalty encourages more aggressive pruning and leads to simpler models with fewer basis functions.
- Higher penalty: Stronger pruning. MARS will tend to remove more basis functions, resulting in a simpler model with fewer terms. Can help prevent overfitting, but might lead to underfitting if the penalty is too high.
- Lower penalty: Weaker pruning. MARS will retain more basis functions, leading to a more complex model. More flexibility to fit the training data, but higher risk of overfitting if the penalty is too low.
Tuning: penalty can also be tuned using cross-validation. Try a range of penalty values and see how it affects the model’s performance. The specific range and interpretation of penalty might be library-dependent. In some implementations, penalty=2 is the default and a common starting point.

Hyperparameter Tuning Implementation (using GridSearchCV with pylearn-parzival - Conceptual):

While pylearn-parzival might not have direct scikit-learn compatible GridSearchCV, you can perform manual grid search-like hyperparameter tuning using loops and cross-validation (or a validation set).

# (Conceptual tuning code, adapt to your library's interface - pylearn-parzival example)

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
from pylearn_parzival import MARS

# Define hyperparameter grids to search
param_grid = {
    'max_terms': [10, 20, 30, 40], # Example values
    'degree': [1, 2],            # Example values
    'penalty': [1, 2, 3]           # Example values (check your library's documentation for penalty parameter)
}

best_mse = float('inf')
best_params = None
kf = KFold(n_splits=5, shuffle=True, random_state=42) # 5-fold CV

for max_terms in param_grid['max_terms']:
    for degree in param_grid['degree']:
        for penalty in param_grid['penalty']:
            current_params = {'max_terms': max_terms, 'degree': degree, 'penalty': penalty}
            fold_mse_scores = []

            for train_index, val_index in kf.split(X_train_scaled): # Assuming X_train_scaled, y_train are scaled training data
                X_train_fold, X_val_fold = X_train_scaled[train_index], X_train_scaled[val_index]
                y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

                mars_model_tune = MARS(max_terms=max_terms, degree=degree, penalty=penalty) # Instantiate MARS with current hyperparameters
                mars_model_tune.fit(X_train_fold, y_train_fold)
                y_pred_val_fold = mars_model_tune.predict(X_val_fold)
                mse_fold = mean_squared_error(y_val_fold, y_pred_val_fold)
                fold_mse_scores.append(mse_fold)

            avg_mse = np.mean(fold_mse_scores) # Average MSE across folds
            print(f"Params: {current_params}, Avg. Validation MSE: {avg_mse:.2f}")

            if avg_mse < best_mse:
                best_mse = avg_mse
                best_params = current_params

print(f"\nBest Hyperparameters found by Cross-Validation: {best_params}, Best Validation MSE: {best_mse:.2f}")

# Train final model with best hyperparameters on the entire training set
best_mars_model = MARS(**best_params) # Instantiate MARS with best params
best_mars_model.fit(X_train_scaled, y_train)

# Evaluate best model on test set
y_pred_test_best_mars = best_mars_model.predict(X_test_scaled)
mse_test_best_mars = mean_squared_error(y_test, y_pred_test_best_mars)
r2_test_best_mars = r2_score(y_test, y_pred_test_best_mars)
print(f"\nBest MARS Model - Test Set MSE: {mse_test_best_mars:.2f}, R-squared: {r2_test_best_mars:.4f}")

This code demonstrates a manual grid search approach with cross-validation to tune max_terms, degree, and penalty. You would need to adapt this code based on the specific API of your MARS library (e.g., parameter names, performance metric calculations).

Accuracy Metrics for MARS Regression

The accuracy metrics used to evaluate MARS Regression are the standard regression metrics, same as for Linear Regression, Lasso Regression, and Gradient Descent Regression.

Common Regression Accuracy Metrics (Reiterated):

Mean Squared Error (MSE): Average of squared errors. Lower is better. Sensitive to outliers.
```
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
```
Root Mean Squared Error (RMSE): Square root of MSE. Lower is better. Interpretable units, sensitive to outliers.
```
RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
```
Mean Absolute Error (MAE): Average of absolute errors. Lower is better. Robust to outliers.
```
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
```

R-squared (R²): Coefficient of Determination. Variance explained by the model. Higher (closer to 1) is better. Unitless.

R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}

Choosing Metrics for MARS Evaluation:

MSE/RMSE: Common for evaluating MARS, especially when you want to penalize larger errors and are interested in overall prediction accuracy. RMSE is often preferred for its interpretability in original units.
MAE: Use if you want a metric that is less sensitive to outliers, focusing on average error magnitude.
R-squared: Useful for understanding how much of the variance in the target variable is captured by the MARS model.

When comparing MARS to other regression models, evaluate performance using one or more of these metrics on a held-out test set or through cross-validation to get a reliable estimate of generalization performance.

Model Productionizing MARS Regression

Productionizing a MARS Regression model involves deploying it to make predictions in a real-world application. Steps are similar to productionizing other regression models:

1. Local Testing and Deployment (Python Script/Application):

Python Script: Deploy as a Python script for local use or batch predictions.

Load Model and Scaler: Load the saved mars_regression_model.joblib and scaler_mars.joblib files.
Load New Data: Load new data for prediction.
Preprocess New Data: Apply scaling using the loaded scaler.
Make Predictions: Use the loaded MARS model to predict.
Output Results: Output predictions.

Code Snippet (Conceptual Local Deployment):

import joblib
import pandas as pd
import numpy as np

# Load trained MARS model and scaler
loaded_mars_model = joblib.load('mars_regression_model.joblib')
loaded_scaler = joblib.load('scaler_mars.joblib')

def predict_house_price_mars(input_data_df): # Input data as DataFrame
    scaled_input_data = loaded_scaler.transform(input_data_df) # Scale using loaded scaler
    predicted_prices = loaded_mars_model.predict(scaled_input_data) # Predict using loaded MARS model
    return predicted_prices

# Example usage with new house data
new_house_data = pd.DataFrame({
    'SquareFootage': [3200, 1600],
    'Bedrooms': [4, 2],
    'LocationQuality': [9, 6]
})
predicted_prices_new = predict_house_price_mars(new_house_data)

for i in range(len(new_house_data)):
    sqft = new_house_data['SquareFootage'].iloc[i]
    predicted_price = predicted_prices_new[i]
    print(f"Predicted price for {sqft} sq ft house: ${predicted_price:,.2f}")

# ... (Further actions) ...

Application Integration: Embed prediction logic into a larger application.

2. On-Premise and Cloud Deployment (API for Real-time Predictions):

For real-time prediction needs, deploy MARS as an API:

API Framework (Flask, FastAPI): Create a web API using Python frameworks.
API Endpoint: Define an endpoint (e.g., /predict_house_price_mars) to receive input data.
Prediction Logic in API Endpoint:
1. Load MARS model and scaler.
2. Preprocess API request data using the loaded scaler.
3. Make predictions with the loaded MARS model.
4. Return predictions in the API response (JSON).
Server Deployment: Deploy the API app on servers (on-premise or cloud).
Cloud ML Platforms: Use cloud ML services (AWS SageMaker, Azure ML, Google AI Platform) to deploy and manage the MARS model API.
Serverless Functions: For event-driven predictions or lightweight APIs, consider serverless functions.

Productionization Considerations Specific to MARS:

Basis Function Interpretability: Leverage MARS’s basis functions for interpretability in production. You might log and analyze basis function contributions to understand predictions.
Computational Cost: MARS can be more computationally expensive than simple linear regression, especially for very large datasets or high max_terms and degree values. Consider performance testing and optimization if latency is critical.
Model Updates and Retraining: Plan for model retraining as data changes. Monitor model performance in production and retrain periodically.

Conclusion: MARS - Flexible Regression for Non-Linear Worlds

Multivariate Adaptive Regression Splines (MARS) offer a powerful and flexible approach to regression modeling, particularly when dealing with non-linear relationships and feature interactions. MARS bridges the gap between simple linear models and complex “black box” methods, providing a degree of interpretability along with non-linear modeling capability.

Real-World Problem Solving with MARS:

Capturing Non-linearities: MARS excels at modeling datasets where the relationship between features and the target variable is not simply linear. It uses piecewise linear segments to approximate curves and complex shapes, going beyond the limitations of standard linear regression.
Handling Feature Interactions: MARS can automatically discover and model interactions between features, capturing synergistic or modulating effects where the influence of one feature depends on the level of another.
Interpretability (Relative to Complex Models): While more complex than linear regression, MARS models are still more interpretable than many “black box” models like neural networks. The basis functions and coefficients provide insights into how the model is making predictions and which segments and interactions are important.
Feature Selection and Variable Importance: MARS, through its basis function selection process, implicitly performs a form of variable selection, identifying relevant features and interactions. While not as explicit as Lasso, it offers a way to understand which features are more influential in the non-linear model.

Limitations and Alternatives:

Piecewise Linear Approximation: MARS models relationships as piecewise linear. If the true underlying relationships are extremely smooth and non-linear in a way that’s not well approximated by piecewise linear segments, other methods might be more appropriate.
Interpretability Trade-off: While more interpretable than some complex models, MARS is still less straightforward to interpret than simple linear regression. Understanding the combined effects of multiple basis functions can require careful analysis.
Computational Cost (Potentially Higher than Linear Regression): Training and using MARS models can be more computationally intensive than standard linear regression, especially with large datasets and higher hyperparameter settings.

Optimized and Newer Algorithms/Techniques:

Generalized Additive Models (GAMs): GAMs provide a broader framework for modeling non-linear relationships while retaining additivity and interpretability. MARS can be seen as a type of GAM, and other GAM techniques (e.g., smoothing splines) offer alternatives for non-linear regression.
Tree-Based Models (Decision Trees, Gradient Boosting Machines): Tree-based models are excellent for capturing non-linearities and interactions and are often highly competitive with MARS in terms of predictive performance, and can offer different types of interpretability (e.g., feature importance from tree ensembles).
Neural Networks: For very complex non-linear regression tasks with large datasets, neural networks offer powerful capabilities, although at the cost of interpretability.

MARS is a valuable tool for bridging the gap between linear simplicity and non-linear complexity. It’s a good choice when you need to model non-linear relationships and interactions while still seeking a degree of model interpretability.

References

“Multivariate Adaptive Regression Splines” by Jerome H. Friedman (1991): The original research paper introducing the MARS algorithm. https://www.jstor.org/stable/2290255
“The Elements of Statistical Learning” by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009): A comprehensive textbook with a chapter on MARS and related non-linear regression techniques. https://web.stanford.edu/~hastie/ElemStatLearn/
“Applied Predictive Modeling” by Max Kuhn and Kjell Johnson (2013): A practical guide to predictive modeling, including a chapter on flexible regression models, discussing MARS and other methods.
pylearn-parzival Python Library Documentation: https://pylearn-parzival.readthedocs.io/en/latest/ (or documentation for your chosen MARS library, like py-earth).
py-earth Python Library Documentation: https://contrib.scikit-learn.org/py-earth/

Share on

Twitter Facebook LinkedIn