Quantile Regression: Beyond the Average Prediction
Looking Beyond the Average: Introduction to Quantile Regression
Imagine you’re trying to predict house prices in your city. Traditional methods, like Ordinary Least Squares (OLS) regression, focus on predicting the average house price based on features like size, location, and number of bedrooms. While this is useful, what if you’re interested in more than just the average?
What if you want to know:
- The price of a house in the lower end of the market (say, the 25th percentile)? Maybe you’re a first-time homebuyer looking for affordable options.
- The price of a house in the higher end of the market (say, the 75th or 90th percentile)? Perhaps you’re interested in luxury homes and their pricing.
Traditional regression, focusing on the average, may not give you a complete picture of the price distribution. This is where Quantile Regression comes to the rescue.
Quantile Regression is a statistical method that allows us to model and predict conditional quantiles of the outcome variable. Instead of focusing solely on the mean, it helps us understand how predictor variables affect different parts of the outcome distribution.
Real-world examples where Quantile Regression is powerful:
- Wage Analysis: Understanding factors that influence not just average wages, but also wages at different levels (e.g., lower, median, upper percentiles). For example, how does education impact the lowest 10% of earners compared to the highest 10%?
- Healthcare Costs: Predicting healthcare expenditures at different quantiles. This is important because healthcare costs are often highly skewed, and understanding the factors driving very high costs (e.g., the 90th percentile) is crucial for policy and risk management.
- Educational Testing: Analyzing student test scores not just on average, but at different performance levels. Quantile Regression can help identify factors that disproportionately affect low-performing or high-performing students.
- Environmental Studies: Predicting pollution levels at different quantiles. For example, understanding factors leading to the highest pollution episodes is often more critical than just average pollution levels.
- Finance: Modeling Value at Risk (VaR), which is a quantile of potential losses in financial investments.
- Retail Demand Forecasting: Predicting demand quantiles for inventory management, especially for products with variable demand patterns.
In essence, if you need to understand how predictors affect not just the typical outcome but also the spread and shape of the outcome distribution, Quantile Regression is a valuable tool. It’s particularly useful when the relationship between predictors and outcome varies across different parts of the outcome’s range.
The Math Behind Quantiles: Stepping Away from Averages
Traditional Ordinary Least Squares (OLS) regression aims to minimize the sum of squared errors. This minimization process naturally leads to estimating the conditional mean of the outcome variable. In contrast, Quantile Regression is based on minimizing a different loss function, called the pinball loss function, which allows it to estimate conditional quantiles.
Let’s understand the core mathematical idea:
Quantiles and the τ-th Quantile:
A quantile divides a dataset into portions. The τ-th quantile (where τ is between 0 and 1, often expressed as a percentage, e.g., τ=0.5 for the median, τ=0.25 for the 25th percentile) is the value below which a proportion τ of the data falls.
For example, the 0.5-th quantile (median) divides the data into two halves. 50% of the data points are below the median, and 50% are above. The 0.25-th quantile (25th percentile) is the value below which 25% of the data lies.
The Pinball Loss Function:
The heart of Quantile Regression is the pinball loss function, also known as the tilted absolute value loss. For a given quantile level τ (0 < τ < 1), the pinball loss for a single observation is defined as:
\[L_{\tau}(y_i, \hat{y}_i(\tau)) = \begin{cases} \tau |y_i - \hat{y}_i(\tau)| & \text{if } y_i \ge \hat{y}_i(\tau) \\ (1 - \tau) |y_i - \hat{y}_i(\tau)| & \text{if } y_i < \hat{y}_i(\tau) \end{cases}\]Let’s break this down:
- yi: The actual observed value of the outcome variable for observation i.
- <0xE2><0x98><0xA3>i(τ): The predicted τ-th quantile value for observation i. We aim to find this prediction.
- τ: The quantile level we are interested in (e.g., 0.5, 0.25, 0.90).
-
** yi - <0xE2><0x98><0xA3>i(τ) **: The absolute difference between the actual value and the predicted quantile value.
How the Pinball Loss Works:
The pinball loss penalizes underestimation and overestimation differently, depending on the quantile level τ.
-
For τ = 0.5 (Median Regression): The loss is symmetric. It’s essentially proportional to the absolute error yi - <0xE2><0x98><0xA3>i(0.5) , regardless of whether we underestimate or overestimate. Minimizing this loss leads to the median regression. -
For τ < 0.5 (Lower Quantiles, e.g., 0.25): The loss function puts a higher weight on underestimation errors (when yi > <0xE2><0x98><0xA3>i(τ)) compared to overestimation errors (when yi < <0xE2><0x98><0xA3>i(τ)). Specifically, the weight for underestimation is τ, and for overestimation is (1-τ). For τ=0.25, underestimation is penalized with weight 0.25, and overestimation with weight 0.75. This asymmetry pushes the regression line to estimate lower quantiles.
- For τ > 0.5 (Upper Quantiles, e.g., 0.75): The loss function puts a higher weight on overestimation errors (when yi < <0xE2><0x98><0xA3>i(τ)) compared to underestimation errors (when yi > <0xE2><0x98><0xA3>i(τ)). For τ=0.75, overestimation is penalized with weight 0.25, and underestimation with weight 0.75. This asymmetry pushes the regression line to estimate upper quantiles.
Optimization in Quantile Regression:
Quantile Regression aims to find the coefficients β(τ) for a linear model such that it minimizes the sum of pinball losses over all observations:
\[\min_{\beta(\tau)} \sum_{i=1}^{n} L_{\tau}(y_i, \mathbf{x}_i^T \beta(\tau)) = \min_{\beta(\tau)} \sum_{i=1}^{n} \begin{cases} \tau |y_i - \mathbf{x}_i^T \beta(\tau)| & \text{if } y_i \ge \mathbf{x}_i^T \beta(\tau) \\ (1 - \tau) |y_i - \mathbf{x}_i^T \beta(\tau)| & \text{if } y_i < \mathbf{x}_i^T \beta(\tau) \end{cases}\]Where:
- β(τ): The regression coefficients we are trying to estimate for the τ-th quantile. Notice that these coefficients are quantile-specific – they will be different for different values of τ.
- xi: The vector of predictor variables for observation i.
- xiTβ(τ): The linear predictor, which serves as our estimate of the τ-th conditional quantile, <0xE2><0x98><0xA3>i(τ).
This minimization problem is typically solved using linear programming methods because the pinball loss function, while convex, is not differentiable everywhere (due to the absolute value). However, for practical purposes, software libraries handle the optimization process automatically.
Example: Visualizing Pinball Loss
| Imagine we want to predict house price (Y) based on size (X). Let’s say we’re focusing on the 0.25th quantile (lower quartile). If our model underpredicts the price for a house (i.e., predicted price is lower than actual), the pinball loss is | Actual - Predicted | * 0.25. If our model overpredicts, the loss is | Actual - Predicted | * 0.75. The model tries to adjust its predictions to minimize the total pinball loss across all houses, giving more weight to avoiding underpredictions at the 25th percentile level. |
By changing τ, we can shift the focus and estimate regression lines for different quantiles, revealing how the relationship between predictors and the outcome varies across the distribution.
Prerequisites and Assumptions: Flexibility with Fewer Restrictions
Quantile Regression is known for being more flexible and making fewer restrictive assumptions compared to traditional OLS regression. This is one of its key strengths, particularly when dealing with real-world data that often deviates from ideal conditions.
Prerequisites:
- Numerical Outcome Variable (Continuous or Discrete): Quantile Regression is primarily designed for numerical outcome variables. While it can be applied to discrete outcomes, it’s most naturally suited for continuous or ordered discrete responses.
- Predictor Variables (Numerical and/or Categorical): Predictor variables can be a mix of numerical and categorical types. Categorical predictors need to be handled appropriately (e.g., using one-hot encoding, as discussed in previous blogs).
Assumptions (Less Strict than OLS):
-
Linearity: Like OLS, Quantile Regression, in its basic form, assumes a linear relationship between the predictor variables and the conditional quantile of the outcome. That is, for a given quantile level τ, we assume that the τ-th conditional quantile is a linear function of the predictors: *Qτ(y x) = xTβ(τ)*. However, this linearity is about the quantile function, not necessarily the conditional mean. -
Independence of Observations: Similar to most regression methods, it’s generally assumed that the observations are independent of each other. Dependencies between observations (e.g., time series data, panel data) might require specialized Quantile Regression techniques that account for these dependencies.
- No Strict Distributional Assumptions on Errors: This is a major advantage of Quantile Regression over OLS. OLS regression, for valid statistical inference, often relies on assumptions like normally distributed errors with constant variance (homoscedasticity). Quantile Regression does not require assumptions about the distribution of the error term. It is robust to heteroscedasticity (non-constant variance of errors) and non-normality. This makes it well-suited for data where these classical assumptions are violated, which is common in many real-world datasets.
Testing Assumptions (Less Critical but Still Consider):
-
Linearity Assessment: Similar to other linear models, you can examine scatter plots of outcome vs. predictors to get a visual sense of linearity, especially for simple models with few predictors. For more complex models, residual plots can be explored (though residual analysis in Quantile Regression is less straightforward than in OLS). You can also consider adding polynomial terms or transformations to predictors if non-linearity is suspected.
-
Independence Check: Consider the data collection process. Are there reasons to suspect dependence between observations? For example, if you’re analyzing data collected over time, you might need to check for autocorrelation.
-
Robustness Check (Heteroscedasticity, Non-normality): You can visually assess heteroscedasticity in OLS regression by examining residual plots (e.g., residuals vs. fitted values). Formal tests for heteroscedasticity (like Breusch-Pagan or White’s test) and normality (like Shapiro-Wilk or Kolmogorov-Smirnov test) exist for OLS residuals. If these tests and plots suggest violations of OLS assumptions, Quantile Regression becomes an even more attractive alternative. However, formal “assumption tests” for Quantile Regression itself, related to error distributions, are less common because the method is designed to be distribution-free.
Python Libraries for Implementation:
We will primarily use these Python libraries for Quantile Regression:
statsmodels: A powerful library for statistical modeling, including Quantile Regression. It provides theQuantRegclass withinstatsmodels.regression.quantile_regression.numpy: For numerical operations.pandas: For data manipulation and DataFrames.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
Data Preprocessing: Minimal Fuss, Maximum Insight
Data preprocessing for Quantile Regression is generally less demanding than for some other machine learning algorithms, particularly those sensitive to scale or distance.
Normalization/Scaling:
- Generally Not Required: Quantile Regression, like other linear regression methods, is not strictly dependent on feature scaling for the algorithm to work. The optimization process will find appropriate coefficients regardless of the scale of the predictors.
- Can be helpful for Interpretation and potentially optimization (sometimes):
- Interpretation: If your predictor variables have vastly different scales, the magnitudes of the coefficients in Quantile Regression will also be on different scales, making direct comparison of coefficients less meaningful in terms of relative importance. Scaling predictors to have roughly comparable ranges (e.g., standardization to mean=0, std. dev=1) can make coefficient magnitudes more directly comparable in terms of their influence on the conditional quantile, though interpretation should still be cautious.
- Optimization: In some cases, especially with very large datasets or highly complex models, scaling might improve the numerical stability and convergence speed of the optimization algorithms used to fit Quantile Regression models. However, this is usually a less critical concern than for algorithms where scaling is fundamental.
- When to consider scaling: If you have predictors with dramatically different ranges and you want to compare coefficient magnitudes more directly or if you encounter optimization issues with unscaled data, you can consider standardization or min-max scaling. But, for basic Quantile Regression, it’s often not a mandatory step for the method to function correctly.
Handling Categorical Variables:
- Essential: One-Hot Encoding or Dummy Coding: Like most regression methods, Quantile Regression requires numerical input. If you have categorical predictor variables, you must convert them into numerical representations. One-hot encoding remains the most common and recommended approach for categorical variables in regression, including Quantile Regression. As discussed earlier, it creates binary (0/1) variables for each category. Dummy coding is a similar alternative.
Other Preprocessing Notes (Similar to general regression):
- Missing Data: How to handle missing data depends on the amount and patterns of missingness. Options include:
- Imputation: Filling in missing values (e.g., mean, median, more advanced methods). Be cautious when imputing, as it can introduce bias if not done carefully.
- Removing Observations: If missingness is limited and not systematic, you might remove rows with missing values. But this can reduce sample size.
statsmodelsmight handleNaNvalues in input data in some contexts, but it’s usually better to address missing data explicitly beforehand.
- Outlier Handling: Quantile Regression is known to be more robust to outliers in the outcome variable (Y) than OLS regression because it minimizes absolute deviations (in the pinball loss) rather than squared deviations. Squared loss in OLS gives outliers disproportionate influence. However, extreme outliers in predictor variables (X) could still potentially influence the model. Consider whether outliers in X are genuine data points or errors and handle them accordingly (correction, removal, or use robust methods if outliers are a concern).
In summary: For Quantile Regression, handling categorical variables (one-hot encoding) is essential. Scaling is generally optional but can be considered for interpretation and in some optimization scenarios. Its robustness to outliers in the outcome and lack of strict distributional assumptions make it a relatively forgiving method in terms of data preprocessing.
Implementation Example: Predicting Restaurant Tips
Let’s create some dummy data to illustrate Quantile Regression in Python. We’ll imagine we are predicting the amount of restaurant tips (as a percentage of the bill) based on the size of the party and the day of the week.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Set random seed for reproducibility
np.random.seed(42)
# Number of restaurant bills
n_bills = 150
# Simulate party size (number of people)
party_size = np.random.randint(1, 7, n_bills) # Party sizes from 1 to 6
# Simulate day of the week (0=Weekday, 1=Weekend)
day_type = np.random.binomial(1, 0.3, n_bills) # 30% chance of weekend
# Base tip percentage (for small party on weekday)
base_tip_percent = 15
# Effect of party size (decreasing tip percentage for larger parties - example)
party_size_effect = -0.8 # Percentage point decrease per additional person
# Effect of weekend (increasing tip percentage on weekends)
weekend_effect = 3 # Percentage point increase on weekends
# Simulate tip percentage with some random variation (heteroscedasticity example)
tip_percentage = []
for i in range(n_bills):
tip = base_tip_percent + party_size_effect * party_size[i]
if day_type[i] == 1:
tip += weekend_effect
# Add heteroscedastic noise (variance increases with party size)
noise_std_dev = 1 + 0.2 * party_size[i]
tip += np.random.normal(0, noise_std_dev)
tip_percentage.append(max(0, tip)) # Ensure tip percentage is not negative
# Create Pandas DataFrame
data = pd.DataFrame({
'tip_percent': tip_percentage,
'party_size': party_size,
'is_weekend': day_type
})
# Convert 'is_weekend' to categorical for formula-based one-hot encoding
data['is_weekend'] = pd.Categorical(data['is_weekend'])
print(data.head())
Output (will vary due to randomness):
tip_percent party_size is_weekend
0 11.998632 6 0
1 11.495777 5 0
2 14.878582 3 0
3 10.735358 5 0
4 13.514730 4 0
Now, let’s fit Quantile Regression models at different quantile levels (e.g., 0.25, 0.5, 0.75) using statsmodels.formula.api.
# Quantile levels to estimate
quantiles = [0.25, 0.5, 0.75]
model_results = {}
for q in quantiles:
# Fit Quantile Regression model using formula
mod = smf.quantreg(f"tip_percent ~ party_size + is_weekend", data)
res = mod.fit(q=q) # 'q' parameter specifies the quantile
model_results[q] = res
print(f"\nQuantile = {q:.2f} Model Summary:")
print(res.summary())
Output (Model Summaries - important parts explained below):
Quantile = 0.25 Model Summary:
QuantReg Results
==============================================================================
Dep. Variable: tip_percent No. Observations: 150
Model: QuantReg Df Residuals: 147
Method: Least Squares Df Model: 2
Date: Fri, 27 Oct 2023 Pseudo R-squ.: 0.2014
Time: 15:43:23 Bandwidth: 1.7045
Quantile: 0.25
------------------------------------------------------------------------------
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 13.5511 0.409 33.117 0.000 12.742 14.360
is_weekend[T.1] 2.4812 0.565 4.395 0.000 1.365 3.597
party_size -0.7872 0.095 -8.257 0.000 -0.976 -0.599
==============================================================================
Quantile = 0.50 Model Summary:
QuantReg Results
==============================================================================
Dep. Variable: tip_percent No. Observations: 150
Model: QuantReg Df Residuals: 147
Method: Least Squares Df Model: 2
Date: Fri, 27 Oct 2023 Pseudo R-squ.: 0.2015
Time: 15:43:23 Bandwidth: 1.7367
Quantile: 0.50
------------------------------------------------------------------------------
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 15.2298 0.417 36.533 0.000 14.406 16.054
is_weekend[T.1] 2.9983 0.576 5.206 0.000 1.860 4.137
party_size -0.8349 0.097 -8.611 0.000 -1.027 -0.643
==============================================================================
Quantile = 0.75 Model Summary:
QuantReg Results
==============================================================================
Dep. Variable: tip_percent No. Observations: 150
Model: QuantReg Df Residuals: 147
Method: Least Squares Df Model: 2
Date: Fri, 27 Oct 2023 Pseudo R-squ.: 0.2018
Time: 15:43:23 Bandwidth: 1.7367
Quantile: 0.75
------------------------------------------------------------------------------
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 17.0988 0.417 40.987 0.000 16.275 17.923
is_weekend[T.1] 3.5493 0.576 6.162 0.000 2.411 4.688
party_size -0.8626 0.097 -8.915 0.000 -1.054 -0.672
==============================================================================
Interpreting the Output (Key Differences across Quantiles):
coef(Coefficients): Notice how coefficients change across quantiles.Intercept: The intercept represents the predicted tip percentage for a party size of 1 (baseline party size, because ‘party_size’ is numerical) and on a weekday (is_weekend[T.1]is 0, weekday baseline).- For the 25th percentile (q=0.25), Intercept is 13.55%. Predicted 25th percentile tip percentage is about 13.55% for a single person on a weekday.
- For the 50th percentile (q=0.50, median), Intercept is 15.23%. Median tip percentage is about 15.23% for a single person on a weekday.
- For the 75th percentile (q=0.75), Intercept is 17.10%. 75th percentile tip is around 17.10% for a single person on a weekday.
- Interpretation: As we move to higher quantiles, the intercept increases. This suggests that at higher tip percentiles, the baseline tip percentage (for small parties on weekdays) tends to be higher.
is_weekend[T.1]: Effect of weekend (compared to weekday) on tip percentage.- For q=0.25, Weekend effect: 2.48 percentage points.
- For q=0.50, Weekend effect: 3.00 percentage points.
- For q=0.75, Weekend effect: 3.55 percentage points.
- Interpretation: The weekend effect (increase in tip percentage on weekends) appears to be larger at higher tip percentiles. Weekends have a more pronounced positive impact on higher tips than on lower tips.
party_size: Effect of increasing party size on tip percentage.- For q=0.25, Party size effect: -0.79 percentage point decrease per person.
- For q=0.50, Party size effect: -0.83 percentage point decrease per person.
- For q=0.75, Party size effect: -0.86 percentage point decrease per person.
- Interpretation: The negative effect of party size (larger parties tend to tip less as a percentage) is quite consistent across quantiles, and even slightly stronger (more negative coefficient) at higher quantiles.
-
std err(Standard Error),t(t-statistic),P>|t|(p-value),[0.025 0.975](Confidence Interval): Interpret these similarly to standard regression output, but they are now specific to the estimated quantile. Small p-values and confidence intervals not including zero indicate statistical significance for the effect of that predictor at the given quantile level. Pseudo R-squ.(Pseudo R-squared): For Quantile Regression, this is a measure of goodness-of-fit, analogous to R-squared in OLS, but it’s based on comparing the model to a constant quantile model (intercept only) using the pinball loss. Higher pseudo R-squared is better, but values are generally lower than typical R-squared values in OLS. It’s more for relative comparison of models.
Visualizing Quantile Regression Results:
It’s helpful to visualize how the regression lines change across quantiles.
# Generate x values for plotting (party sizes from 1 to 6)
x_plot = np.linspace(data['party_size'].min(), data['party_size'].max(), 100)
weekend_0 = pd.DataFrame({'party_size': x_plot, 'is_weekend': pd.Categorical([0]*100)}) # Weekday
weekend_1 = pd.DataFrame({'party_size': x_plot, 'is_weekend': pd.Categorical([1]*100)}) # Weekend
fig, ax = plt.subplots(figsize=(8, 6))
# Plot scatter of actual data
ax.scatter(data['party_size'], data['tip_percent'], alpha=0.5, label='Data Points')
# Plot Quantile Regression lines
for q, res in model_results.items():
y_pred_weekday = res.predict(weekend_0)
y_pred_weekend = res.predict(weekend_1)
ax.plot(x_plot, y_pred_weekday, linestyle='--', label=f'Quantile {q:.2f} (Weekday)')
ax.plot(x_plot, y_pred_weekend, linestyle='-', label=f'Quantile {q:.2f} (Weekend)')
ax.set_xlabel('Party Size')
ax.set_ylabel('Tip Percentage')
ax.set_title('Quantile Regression of Tip Percentage vs. Party Size')
ax.legend()
plt.show()
(Note: While the instructions say no internet images, this code will generate a plot locally if you run it, allowing you to visualize the results, which is essential for understanding Quantile Regression. In a real blog, you would ideally provide an image generated from such code).
Saving and Loading the Model:
For statsmodels Quantile Regression results objects, you can also use pickle to save and load the fitted model.
import pickle
# Save the model results (for quantile 0.5, as an example)
median_model_result = model_results[0.5]
with open('quantile_regression_model_median.pkl', 'wb') as file:
pickle.dump(median_model_result, file)
# Load the saved model result
with open('quantile_regression_model_median.pkl', 'rb') as file:
loaded_model_result = pickle.load(file)
print("\nSummary of loaded median Quantile Regression model:")
print(loaded_model_result.summary()) # Verify it's the same model
This saves the fitted QuantRegResults object, allowing you to load it later and make predictions or examine the results without refitting.
Post-processing: Comparing Quantile Effects and Hypothesis Testing
Comparing Effects Across Quantiles:
The primary form of “post-processing” in Quantile Regression is to compare how the coefficients vary across different quantile levels. This is where Quantile Regression really provides unique insights beyond mean regression.
- Coefficient Trajectories: Examine how the estimated coefficients for each predictor variable change as you move from lower quantiles to higher quantiles.
- Increasing Coefficient with Quantile: If a coefficient becomes more positive (or less negative) as you increase the quantile level, it suggests that the predictor’s positive effect on the outcome is stronger at higher levels of the outcome distribution.
- Decreasing Coefficient with Quantile: If a coefficient becomes less positive (or more negative) as you increase the quantile level, it suggests the predictor’s positive effect is weaker at higher outcome levels (or negative effect is stronger).
- Constant Coefficient: If a coefficient stays relatively constant across quantiles, it implies that the predictor’s effect is consistent across the entire outcome distribution.
- Visualization of Coefficients: Plotting the estimated coefficients (and their confidence intervals) against the quantile levels (τ) can be very informative. This visual trajectory reveals how the predictor effects differ across the distribution. In our restaurant tip example, we qualitatively observed that the ‘weekend’ effect became larger at higher tip percentiles; a coefficient plot would make this trend visually clear and quantify the change.
Example Code to Extract and Compare Coefficients:
quantile_levels = sorted(model_results.keys())
coefficients = {}
for var_name in model_results[quantile_levels[0]].params.index: # Get variable names from any result
coefficients[var_name] = []
for q in quantile_levels:
coefficients[var_name].append(model_results[q].params[var_name])
coefficient_df = pd.DataFrame(coefficients, index=quantile_levels)
print("\nCoefficients across Quantiles:\n", coefficient_df)
# Example of plotting coefficients (for 'party_size')
plt.figure(figsize=(7, 5))
plt.plot(quantile_levels, coefficient_df['party_size'], marker='o')
plt.xlabel('Quantile Level (τ)')
plt.ylabel('Coefficient for Party Size')
plt.title('Quantile Regression: Party Size Coefficient vs. Quantile')
plt.grid(True)
plt.show()
# Example of plotting coefficients (for 'is_weekend[T.1]')
plt.figure(figsize=(7, 5))
plt.plot(quantile_levels, coefficient_df['is_weekend[T.1]'], marker='o')
plt.xlabel('Quantile Level (τ)')
plt.ylabel('Coefficient for Weekend')
plt.title('Quantile Regression: Weekend Coefficient vs. Quantile')
plt.grid(True)
plt.show()
(Again, this generates plots locally. In a blog setting, you’d include these generated images).
Hypothesis Testing in Quantile Regression:
-
Coefficient Significance Tests: The
statsmodelsQuantile Regression summary output already provides t-statistics and p-values for testing if each coefficient is significantly different from zero at each quantile level. P-values are based on asymptotic approximations. -
Testing for Equality of Coefficients Across Quantiles (Less Common in Standard
statsmodelsoutput directly): You might want to formally test if the coefficient for a predictor is significantly different across two different quantiles (e.g., is the ‘party_size’ effect significantly different at the 0.25th quantile compared to the 0.75th quantile?). Standardstatsmodelsoutput doesn’t directly give you these tests. You would typically need to perform more advanced procedures, possibly involving bootstrapping or specialized hypothesis tests for quantile regression coefficients (which might require using other software or libraries). For many practical purposes, visually comparing coefficient trajectories and examining confidence intervals at different quantiles is often sufficient to infer whether effects differ significantly across quantiles.
In summary, post-processing in Quantile Regression focuses heavily on interpreting and comparing the quantile-specific effects. Analyzing how coefficients change across quantiles and visualizing these trajectories is the primary way to extract deeper insights from Quantile Regression models.
Hyperparameter Tuning: The Quantile Level τ
The main “hyperparameter” in Quantile Regression is the quantile level τ (tau). It’s not a hyperparameter in the traditional sense of being tuned for optimization performance like in machine learning algorithms (e.g., regularization parameters). Instead, choosing different values of τ is fundamental to exploring different parts of the conditional distribution.
Effect of τ:
- τ controls which quantile is being modeled: As we’ve seen throughout the blog, τ determines which part of the conditional distribution we are focusing on.
- τ = 0.5: Median regression (predicts the conditional median).
- τ < 0.5: Lower quantiles (e.g., τ=0.25 for the 25th percentile - lower quartile).
- τ > 0.5: Upper quantiles (e.g., τ=0.75 for the 75th percentile - upper quartile, τ=0.90 for the 90th percentile).
-
Different τ values reveal different relationships: By fitting Quantile Regression models for a range of τ values (e.g., τ = 0.1, 0.2, 0.3, …, 0.9), you can map out how the effect of predictors changes across the entire conditional distribution of the outcome. This is the key strength of Quantile Regression.
- No “tuning” for optimal prediction in the same way: You don’t “tune” τ to optimize a prediction accuracy metric in the way you might tune hyperparameters like regularization strength in ridge regression or depth in a decision tree. Instead, you choose τ values based on your research question or problem objective.
Choosing τ Values:
- Based on Research Question: Select τ values that are relevant to your specific question.
- If interested in typical outcomes, focus around τ=0.5 (median regression) and perhaps nearby quantiles (e.g., 0.4, 0.6).
- If concerned with low outcomes (e.g., risk management, understanding low performers), examine lower quantiles (e.g., 0.1, 0.25).
- If concerned with high outcomes (e.g., understanding high achievers, upper end of the market), examine upper quantiles (e.g., 0.75, 0.90).
- For a comprehensive picture of the entire distribution, choose a set of quantiles covering a range from close to 0 to close to 1 (e.g., τ = 0.1, 0.2, …, 0.9).
- No Cross-Validation for τ selection (typically): You don’t typically use cross-validation to select the “best” τ value in the same way you would for hyperparameters that control model complexity. Cross-validation is more about estimating out-of-sample predictive performance for a given model (with a fixed τ).
Implementation Example (Fitting Models for a Range of τ values - already shown):
The implementation example already demonstrates fitting Quantile Regression models for τ = 0.25, 0.5, and 0.75. You can easily extend this to fit models for a finer grid of τ values to get a more detailed view of coefficient trajectories.
# Example - fitting for a wider range of quantiles
quantiles_range = np.arange(0.1, 0.91, 0.1) # Quantiles from 0.1 to 0.9 in steps of 0.1
model_results_extended = {}
for q in quantiles_range:
mod = smf.quantreg(f"tip_percent ~ party_size + is_weekend", data)
res = mod.fit(q=q)
model_results_extended[q] = res
# You can then analyze and plot coefficients from model_results_extended
In essence, “hyperparameter tuning” in Quantile Regression is about thoughtfully selecting the quantile levels τ that are relevant to your research questions, not about automated optimization for prediction accuracy across τ values. The power lies in examining the quantile-specific effects, which requires choosing a meaningful set of quantiles to investigate.
Accuracy Metrics for Quantile Regression
“Accuracy” in Quantile Regression is evaluated differently than in standard mean regression (OLS) or classification. Since we are predicting quantiles, we need metrics that assess how well our predicted quantiles align with the actual quantiles in the data.
- Pinball Loss (or Quantile Loss):
- Equation (Average Pinball Loss):
\(\text{Mean Pinball Loss}_{\tau} = \frac{1}{n} \sum_{i=1}^{n} L_{\tau}(y_i, \hat{y}_i(\tau)) = \frac{1}{n} \sum_{i=1}^{n} \begin{cases}
\tau (y_i - \hat{y}_i(\tau)) & \text{if } y_i \ge \hat{y}_i(\tau) \\
(1 - \tau) (\hat{y}_i(\tau) - y_i) & \text{if } y_i < \hat{y}_i(\tau)
\end{cases}\)
- yi are actual outcome values, <0xE2><0x98><0xA3>i(τ) are predicted τ-th quantile values, τ is the quantile level, and n is the number of observations.
- Interpretation: Pinball loss is the natural loss function that is minimized in Quantile Regression. Lower pinball loss values indicate better quantile prediction accuracy at the specific quantile level τ. The mean pinball loss is the average loss over all observations for a given quantile level.
- Python Calculation: You can implement the pinball loss function and calculate it for your predictions.
def pinball_loss(y_true, y_pred, tau): loss = np.maximum(tau * (y_true - y_pred), (1 - tau) * (y_pred - y_true)) return np.mean(loss) # Example calculation for quantile 0.5 model on test data (assuming you have test data 'X_test', 'y_test' and fitted 'model_results') y_pred_median = model_results[0.5].predict(X_test) # Get predictions for quantile 0.5 tau_value = 0.5 median_pinball = pinball_loss(y_test, y_pred_median, tau_value) print(f"Pinball Loss at Quantile {tau_value:.2f}: {median_pinball:.4f}") - Equation (Average Pinball Loss):
\(\text{Mean Pinball Loss}_{\tau} = \frac{1}{n} \sum_{i=1}^{n} L_{\tau}(y_i, \hat{y}_i(\tau)) = \frac{1}{n} \sum_{i=1}^{n} \begin{cases}
\tau (y_i - \hat{y}_i(\tau)) & \text{if } y_i \ge \hat{y}_i(\tau) \\
(1 - \tau) (\hat{y}_i(\tau) - y_i) & \text{if } y_i < \hat{y}_i(\tau)
\end{cases}\)
- Check for Quantile Coverage (Calibration):
- Concept: For a correctly calibrated Quantile Regression model at quantile level τ, approximately a proportion τ of the actual outcomes should fall below the predicted τ-th quantile.
- Empirical Check:
- For each observation i, get the predicted τ-th quantile <0xE2><0x98><0xA3>i(τ).
- Count how many actual outcome values yi are less than or equal to their corresponding predicted quantile <0xE2><0x98><0xA3>i(τ).
- Calculate the proportion of such observations.
- This proportion should be close to τ if the model is well-calibrated for quantile τ.
- Python Implementation (Example):
def quantile_coverage(y_true, y_pred_quantile, tau): below_quantile = (y_true <= y_pred_quantile).sum() coverage_ratio = below_quantile / len(y_true) return coverage_ratio # Example for quantile 0.25 model on test data y_pred_q25 = model_results[0.25].predict(X_test) tau_q25 = 0.25 coverage_q25 = quantile_coverage(y_test, y_pred_q25, tau_q25) print(f"Coverage at Quantile {tau_q25:.2f}: Empirical Coverage Ratio = {coverage_q25:.4f}, Target Coverage = {tau_q25:.2f}") # Repeat for other quantiles (0.5, 0.75, etc.)- Interpretation: Ideally, for τ=0.25, you’d expect coverage to be around 0.25, for τ=0.5 coverage around 0.5, and so on. Deviations from the target coverage ratio might suggest miscalibration of the quantile predictions. However, in practice, with limited sample sizes and model approximations, perfect calibration is rarely achieved, and some deviation is expected.
- Comparison to OLS metrics (RMSE, MAE, R-squared): While RMSE and R-squared are designed for mean regression, and MAE is more related to median regression, you can calculate and compare these metrics for Quantile Regression models (especially median regression, τ=0.5) against OLS regression for a descriptive comparison. However, remember that Quantile Regression is not optimized for minimizing squared error (as OLS is) or absolute error across all observations (though median regression minimizes total absolute deviation). Therefore, don’t expect Quantile Regression to necessarily outperform OLS in terms of RMSE or MAE in all cases, especially when the focus is on mean prediction accuracy.
Choosing Metrics:
- Pinball Loss: The most fundamental and appropriate metric for evaluating Quantile Regression accuracy for a given quantile level τ. Use it to compare different Quantile Regression models at the same τ or to track pinball loss across different τ levels.
- Quantile Coverage: Useful for assessing the calibration of your quantile predictions. Check if the empirical coverage ratios are close to the target quantile levels.
- RMSE, MAE, R-squared (for comparison): Can be used for descriptive comparison against OLS or as general regression error measures, but understand their limitations in the context of Quantile Regression.
In practice, report pinball loss and quantile coverage ratios for the quantiles you are interested in to evaluate the performance of your Quantile Regression models.
Productionizing Quantile Regression Models
Productionizing Quantile Regression models follows similar steps as for other regression models, but with some specific considerations.
1. Saving and Loading Trained Models:
- Use
pickleto save the fittedstatsmodelsQuantRegResultsobjects (for each quantile level you want to deploy). - In your production environment, load these saved model results.
2. Deployment Environments:
-
Local Testing/Development: Test your prediction and deployment pipeline locally. Use a virtual environment.
-
On-Premise Server or Cloud:
- API (Web Service): Build a REST API using frameworks like Flask or FastAPI (Python) to expose your Quantile Regression models. You might have different API endpoints for predicting different quantiles (or one endpoint that returns predictions for multiple quantiles).
- Containerization (Docker): Package your API application, model files, and dependencies in Docker containers for consistent deployment.
- Cloud Platforms (AWS, GCP, Azure): Deploy Dockerized API to cloud container services (ECS/EKS, GKE, AKS) or serverless functions (Lambda, etc. - for simpler prediction tasks).
- On-Premise Servers: Set up server environment to run containers or your API application directly.
3. API Design (Example for Quantile Predictions):
Your API could have endpoints like /predict_quantile_25, /predict_quantile_50, /predict_quantile_75 to predict at specific quantiles, or a single endpoint that takes the desired quantile level as a parameter.
-
Input: API should accept structured input data (e.g., JSON) containing predictor variable values for new data points. Input format should match the features used in training.
-
Prediction Logic in API: When an API request is received:
- Load the saved
QuantRegResultsobject for the desired quantile (or load multiple models if serving multiple quantiles). - Preprocess the input data (e.g., one-hot encode categorical features, though scaling might be optional for Quantile Regression, as discussed).
- Use the loaded model’s
predict()method to get the quantile predictions. - Return the prediction(s) in a JSON response.
- Load the saved
4. Example Code (Simplified Flask API - Conceptual for Median Quantile):
from flask import Flask, request, jsonify
import pickle
import pandas as pd
app = Flask(__name__)
# Load median Quantile Regression model at app startup
with open('quantile_regression_model_median.pkl', 'rb') as file:
median_model = pickle.load(file)
@app.route('/predict_median_tip_percent', methods=['POST'])
def predict_median_tip_percent():
try:
data = request.get_json() # Get JSON input
input_df = pd.DataFrame([data]) # Create DataFrame
# Ensure input data has expected predictor columns
expected_columns = ['party_size', 'is_weekend'] # Replace with your actual predictor names
if not all(col in input_df.columns for col in expected_columns):
return jsonify({'error': 'Invalid input format. Missing predictor columns.'}), 400
input_df['is_weekend'] = pd.Categorical(input_df['is_weekend']) # Ensure categorical type
# Make prediction using loaded median model
median_prediction = median_model.predict(input_df)[0] # Get single prediction
return jsonify({'predicted_median_tip_percent': float(median_prediction)}) # Return as JSON
except Exception as e:
return jsonify({'error': str(e)}), 400
if __name__ == '__main__':
app.run(debug=True) # debug=False for production
5. Monitoring and Maintenance:
- Performance Monitoring: Track API request performance, error rates. Monitor pinball loss and quantile coverage (if you have access to actual outcomes over time to evaluate against predictions).
- Model Retraining: Retrain Quantile Regression models periodically with new data to maintain accuracy and adapt to data changes.
- Version Control: Use version control for API code and model files.
- Logging: Implement logging for API requests, predictions, errors, and system events.
Key Production Considerations for Quantile Regression:
- Serving Multiple Quantiles: You might need to deploy and manage multiple models (one for each quantile you want to predict).
- Real-time vs. Batch Prediction: Decide if you need real-time API predictions or batch prediction capabilities based on your application needs.
- Interpretation and Communication: When using Quantile Regression in production, clearly communicate to users that you are providing quantile predictions, not just mean predictions. Explain what quantiles mean in the context of your application.
Conclusion: Quantile Regression - A Deeper Look into Data
Quantile Regression offers a powerful alternative to traditional mean regression, especially when understanding the full distribution of the outcome and how predictor effects vary across different parts of that distribution is crucial.
Real-world problem solving:
- Risk Management: Quantile Regression is essential in finance for Value at Risk (VaR) modeling, in insurance for understanding extreme loss quantiles, and in various fields for analyzing and predicting tail events (extreme outcomes).
- Inequality Analysis: In economics and social sciences, it’s used to study wage inequality, income inequality, and disparities in various outcomes, by examining quantile-specific effects of education, policy interventions, etc.
- Robust Prediction in Heteroscedastic Data: When data exhibits heteroscedasticity (non-constant error variance), Quantile Regression can provide more robust and reliable predictions compared to OLS, as it doesn’t rely on homoscedasticity assumptions.
- Beyond Average Relationships: In many areas, the “average” relationship doesn’t tell the whole story. Quantile Regression allows us to uncover how relationships change at different levels of the outcome, providing a more nuanced and complete understanding.
Limitations and When to Use Alternatives:
- Less Focus on Mean Prediction: If your primary goal is solely to predict the conditional mean accurately and your data reasonably meets OLS assumptions, OLS regression might be simpler and sufficient.
- Interpretability (Slightly More Complex): Interpreting Quantile Regression requires focusing on quantile-specific effects and coefficient trajectories, which might be slightly more complex than interpreting single coefficients in OLS.
- Computational Cost (Potentially Higher for Large Datasets): Optimization in Quantile Regression (linear programming) can be computationally more intensive than OLS, especially for very large datasets. However, for typical datasets, this is generally not a major limitation with modern software.
- Alternatives:
- OLS Regression: If mean prediction is the only goal, and OLS assumptions are reasonably met.
- Robust Regression (e.g., M-estimation, Huber Regression): If robustness to outliers in predictors is a primary concern.
- Generalized Additive Models for Location, Scale and Shape (GAMLSS): A more flexible framework to model the entire distribution of the outcome, including not just location (like mean/quantile) but also scale and shape parameters.
- Machine Learning Methods for Quantile Prediction (Tree-based, Neural Networks): Methods like quantile regression forests, quantile gradient boosting, and neural networks trained with pinball loss are increasingly used for flexible non-linear quantile prediction, especially in large-scale machine learning applications.
Ongoing Use and Developments:
Quantile Regression remains a vibrant area of research and application. It’s widely used in econometrics, statistics, and increasingly in data science and machine learning, especially in areas requiring robust prediction, risk analysis, and a deeper understanding of distributional effects. Developments include:
- Non-parametric and Semi-parametric Quantile Regression: Methods that relax linearity assumptions.
- Quantile Regression for Time Series and Panel Data: Techniques to handle dependencies in time series and panel datasets.
- Machine Learning Integration: Combining Quantile Regression with machine learning algorithms for improved flexibility and scalability.
Quantile Regression offers a valuable and insightful approach to regression analysis, providing a richer understanding of data relationships beyond just average effects. Its robustness, flexibility, and focus on distributional aspects make it an indispensable tool for many data analysts and researchers.
References
- Koenker, R. (2005). Quantile regression. Cambridge university press. (The definitive, comprehensive book on Quantile Regression).
- Koenker, R., & Bassett Jr, G. (1978). Regression quantiles. Econometrica, 33-50. (The seminal paper introducing Quantile Regression).
- Yu, K., & Moyeed, R. A. (2001). Bayesian quantile regression. Statistics & Probability Letters, 54(4), 437-447. (On Bayesian approaches to Quantile Regression).
- Davino, C., Furno, M., & Vistocco, D. (2013). Quantile regression: theory and applications. John Wiley & Sons. (A more applied textbook on Quantile Regression).
- Statsmodels Documentation for Quantile Regression: https://www.statsmodels.org/stable/quantreg.html (Official documentation for
statsmodelsPython library’s Quantile Regression functionality). - “An Introduction to Statistical Learning” by James, Witten, Hastie, Tibshirani: https://www.statlearning.com/ (While not focused solely on Quantile Regression, it provides context on regression and related concepts. Freely available online).
- “Applied Econometrics with R” by Kleiber and Zeileis: (A good resource with code examples, including Quantile Regression, in R. Concepts are transferable to Python).