Strategic Insights into US Airline Market: Fare Prediction and Market Landscape

Association:

iSchool, University of Toronto

Duration:

1 month

Data Analytics

Machine Learning

Python

For the final project of INF1290-Data Analytics: Introduction, Methods, and Practical Approaches, we analyzed fare pricing patterns, market competition, and market segmentation in the US airline industry. The motivation behind this analysis stems from the significant impact that fare pricing and market competition have on the profitability and sustainability of airlines.

After necessary data cleaning and preparation, we conducted EDA including a correlation study, time-series analysis for fare trends, and market analytics for leading locations and top players with the lowest fares.

# Create correlation matrix
correlation_matrix = continous_df.corr()
plt.figure(figsize=(20, 10))

# Generate heatmap
heatmap = sns.heatmap(correlation_matrix, annot=True, linewidths=0.5)
plt.show()
# Plotting the average fare, lowest fare, and largest carrier fare by year
plt.figure(figsize=(12, 8))

# Grouping the data by year and calculating the average fares
average_fare_by_year = df.groupby('Year')[['fare', 'fare_low', 'fare_lg']].mean().reset_index()

# Creating line plots for average fare, lowest fare, and largest carrier fare
sns.lineplot(x='Year', y='fare', data=average_fare_by_year, marker='o', label='Average Fare')
sns.lineplot(x='Year', y='fare_low', data=average_fare_by_year, marker='o', label='Lowest Fare')
sns.lineplot(x='Year', y='fare_lg', data=average_fare_by_year, marker='o', label='Largest Carrier Fare')

# displaying min and max points for average fare
min_fare_year = average_fare_by_year.loc[average_fare_by_year['fare'].idxmin()]
max_fare_year = average_fare_by_year.loc[average_fare_by_year['fare'].idxmax()]

plt.scatter(min_fare_year['Year'], min_fare_year['fare'], color='red', s=100, zorder=5)
plt.scatter(max_fare_year['Year'], max_fare_year['fare'], color='green', s=100, zorder=5)
plt.text(min_fare_year['Year'], min_fare_year['fare'], f"Min: {min_fare_year['fare']:.2f}", color='red', ha='left', va='bottom')
plt.text(max_fare_year['Year'], max_fare_year['fare'], f"Max: {max_fare_year['fare']:.2f}", color='green', ha='right', va='bottom')

# Adding titles and labels
plt.title('Average Fare, Lowest Fare, and Largest Carrier Fare by Year')
plt.xlabel('Year')
plt.ylabel('Fare')
plt.legend(title='Fare Type')

# Displaying the plot
plt.show()
# Calculate the annual average fare for each year
annual_avg_fare = df.groupby('Year')['fare'].mean().reset_index()

# Rename the column to indicate this is the annual average fare
annual_avg_fare.rename(columns={'fare': 'annual_avg_fare'}, inplace=True)

# Calculate the year-over-year fare inflation rate
annual_avg_fare['fare_inflation_rate'] = annual_avg_fare[
    'annual_avg_fare'].pct_change() * 100

# Display the results
annual_avg_fare[['Year', 'annual_avg_fare', 'fare_inflation_rate']]

# Plotting
fig, ax = plt.subplots(figsize=(12, 6))

# Use different colors for positive and negative inflation rates
colors = ['green' if x >= 0 else 'red' for x in annual_avg_fare[
    'fare_inflation_rate']]

annual_avg_fare.plot(
    x='Year',
    y='fare_inflation_rate',
    kind='bar',
    ax=ax,
    color=colors,
    legend=False
)

# Add titles and labels
ax.set_title('Year-over-Year Fare Inflation Rate')
ax.set_xlabel('')
ax.set_ylabel('Fare Inflation Rate (%)')

# Display the plot
plt.show()
# Plotting market share for each airline with distinct styles
sns.lineplot(x='Year', y='large_ms', data=df_southwestern,
             label='Southwest Airlines Market Share',
             linewidth=.5, marker='o', color='blue')
sns.lineplot(x='Year', y='lf_ms', data=df_american,
             label='American Airlines Market Share',
             linestyle='--',
             linewidth=.5,
             marker='s',
             color='green')
sns.lineplot(x='Year', y='lf_ms', data=df_delta,
             label='Delta Airline Market Share',
             linestyle='-.',
             linewidth=.5,
             marker='d',
             color='orange')

# Adding a title and labels
plt.title('Market Share of Carriers with Lowest Fares Over the Years',
          fontsize=18, fontweight='bold')
plt.xlabel('', fontsize=14)
plt.ylabel('Market Share (%)', fontsize=14)

# Enhancing the legend
plt.legend(title='Airlines',
           title_fontsize='13',
           fontsize='12',
           loc='upper left')

# Customizing the grid for better readability
plt.grid(True, linestyle='--', linewidth=0.5)

# Adjusting tick parameters
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Display the plot
plt.show()

Based on our preliminary findings, which identified Southwestern Airlines as a prime candidate for further study, we focused our subsequent analysis on recommending pricing strategies for Southwestern Airlines and gaining insights into their passenger needs. Our analysis addresses the following fundamental research questions:

[R1]: What factors influence fare prices for Southwestern Airlines and how do we inform their pricing strategy?

[R2]: What are the characteristics of the market segments that Southwest Airlines aims to attract?

To address our first research [R1], we designed and evaluated the predictive performance of different machine learning models, which included Linear Regression, Decision Tree Regressor, and Random Forest Regressor, to forecast fare prices for Southwestern Airlines’ flights. To achieve this, we trained all three models with all relevant continuous variables and assessed their accuracy using the R-squared metric. The Random Forest Regressor model demonstrated the highest accuracy in forecasting Southwestern Airlines’ flight fares (r^2 = 0.9982).

The three (3) most common features were the flight distance (nsmiles), the fare per mile (fare_per_mile), and the year of the flight (Year). Recognizing that the flight distance (nsmiles) is a significant factor in predicting fares, our subsequent analysis (clustering) further centres this feature to extract insights and inform on Southwest’s market segments.

To tackle our second research [R2] we leveraged a K-Means clustering model to identify patterns and groups of Southwest flights with similar characteristics. Here, our model partitioned the data into distinct clusters based on two key features: the number of passengers (passengers) and the distance (nsmiles) traveled. this clustering allows for a nuanced analysis of customer segmentation and could inform how Southwest AIrlines could tailor marketing strategies based on the characteristics of each cluster. For example, offering promotional deals for regional flights (Cluster 0) or loyalty programs for frequent long-haul travelers (Cluster 2).

Let's Talk

Let's Talk

Let's Talk

© 2025. All rights Reserved.

© 2025. All rights Reserved.

© 2025. All rights Reserved.