Back to Projects

Project Access

Rotman Datathon 2025: Cost of Living, Economic Development, and Supply Chain

Association:

Rotman School of Management

Duration:

7 days

Data Science

Statistics

Python

2. Business Objectives

The participant task is to build a data-driven model that examines the relationships between rising living costs, economic development, and supply chain performance. The end goal is to produce actionable insights that address the following questions:

2.1. Correlational Insights:

What patterns emerge between cost-of-living increases and supply chain disruptions? Are certain regions or industries more susceptible?

2.2. Supply Chain Resilience Modelling:

Which economic indicators most strongly impact supply chain costs? How can these insights guide predictions of supply chain challenges?

2.3. Strategic Recommendations:

What adaptations can businesses make to their supply chains to buffer against cost increases?

3. End Goal

Provide a set of recommendations for businesses and policymakers to help ensure that supply chains remain stable and cost-effective even as living costs and economic pressures rise.

4. Analytical Task

4.1. Data Analysis and Exploration:

Investigate the correlation between rising living costs and supply chain variables, identifying key regions or demographics most affected.

4.2. Predictive Modelling:

Create a model to forecast potential shifts in supply chain costs based on economic and living cost trends.

4.3. Insightful Recommendations:

Propose actionable strategies for businesses to adapt their supply chain management practices to withstand economic challenges.

5. Deliverables

5.1. Data Plan (included in the report):

Outline the methodology for merging and structuring data from multiple sources, identifying transformations for supply chain analysis.

5.2. Correlation Analysis and Predictive Model (include the details' in the report):

Detailed documentation of findings from the correlation analysis and model outputs, highlighting supply chain implications.

5.3. Presentation:

A business-oriented summary of insights, focusing on actionable supply chain recommendations designed for stakeholders and policymakers.

Prioritize simple, interpretable models (e.g., regression analysis, time series forecasting) that effectively illustrate the relationships between economic factors and supply chain metrics.

Data Overview

The provided dataset contains 3,255 rows of 279 indicators from 217 countries over the period of 15 years (2009 - 2023). The challenges include resource limitation and data availability as there are time constraints and many important indicators have significant missing values.

Execution Plan

The study focuses on key metrics including logistics performance, trade costs, GDP per capita, and consumption patterns. The methodology used includes but is not limited to selective data sampling based on relevance and completeness, strategic handling of missing data, in-depth analysis through descriptive and diagnostic analysis, segmentation, feature engineering, and predictive model building.

A snippet of the work is provided below. For the full view of the project, please access the Drive folder which contains the provided materials, Python file, report paper, and presentation slides.

Findings

Key findings include a strong positive correlation (0.75) between logistics performance and GDP per capita, significant regional disparities in logistics infrastructure and capabilities, and trade costs remain relatively stable despite economic volatility. Regionally, Europe & Central Asia leads in trade (1.99 normalized) and GDP per capita (1.49) while Sub-Saharan Africa and South Asia face challenges with low logistics performance, high poverty rates, and low GDP per capita. North America demonstrates balanced positive performance across metrics.

Recommendation

It is recommended for businesses to consider regional variations in supply chain network planning and build redundancy through multi-regional sourcing. Meanwhile, policymakers should focus on improving logistics infrastructure and policies, developing regional economic and logistics connectivity, and balancing economic growth in accordance with domestic market development.

Deliverables

The results include a Python file of code and programming output, a 22-page report, and a summary presentation for project stakeholders.

Code Snippet

Impute Missing Data

# Create a new binary column to indicate whether the value is missing
df['Logistics Missing'] = df['Logistics performance index: Overall'].isnull().astype(int)

# Group by Country and count missing values
missing_by_country = df.groupby(level=0)['Logistics Missing'].sum()

# Filter countries with more than 9 missing values
countries_to_exclude = missing_by_country[missing_by_country > 9].index
print(countries_to_exclude)

# Filter the original DataFrame
df = df.loc[~df.index.get_level_values(0).isin(countries_to_exclude)]

# Drop the 'Logistics Missing' column
df = df.drop(columns=['Logistics Missing'])

# Impute missing values within each country group
df = df.groupby('Country Name', group_keys=False).apply(
    lambda group: group.fillna(method='ffill').fillna(method='bfill')
)

The in-use dataframe is as follows

Exploratory Data Analysis (EDA)

# Time Series Analysis

# Cost analysis
def plot_metrics_over_time(df):
    # Reset index to get country and year as columns
    df_reset = df.reset_index()
    df_reset.columns = ['Country', 'Year'] + list(df_reset.columns[2:])

    # Calculate mean values for each year
    yearly_means = df_reset.groupby('Year').agg({
        'Consumer price index': 'mean',
        'Cost to export, border compliance': 'mean',
        'Cost to import, border compliance': 'mean'
    }).reset_index()

    # Create the plot
    plt.figure(figsize=(12, 6))

    # Create multiple y-axes for different scales
    ax1 = plt.gca()
    ax2 = ax1.twinx()

    # Plot each metric
    line1 = ax1.plot(yearly_means['Year'],
                     yearly_means['Consumer price index'],
                     'b-',
                     label='Consumer Price Index',
                     linewidth=2)

    line2 = ax2.plot(yearly_means['Year'],
                     yearly_means['Cost to export, border compliance'],
                     'r-',
                     label='Export Costs',
                     linewidth=2)

    line3 = ax2.plot(yearly_means['Year'],
                     yearly_means['Cost to import, border compliance'],
                     'g-',
                     label='Import Costs',
                     linewidth=2)

    # Customize the plot
    ax1.set_xlabel('Year', fontsize=12)
    ax1.set_ylabel('Consumer Price Index', color='b', fontsize=12)
    ax2.set_ylabel('Costs (USD)', color='r', fontsize=12)

    # Combine legends from both axes
    lines = line1 + line2 + line3
    labels = [l.get_label() for l in lines]
    ax1.legend(lines, labels, loc='upper left', fontsize = 'x-small')

    # Add title
    plt.title('Trends in CPI, Export Costs, and Import Costs (2009-2023)',
             fontsize=14,
             pad=20)

    # Add grid for better readability
    ax1.grid(True, alpha=0.3)

    # Rotate x-axis labels for better readability
    plt.xticks(rotation=45)

    # Adjust layout to prevent label cutoff
    plt.tight_layout()

    # Show the plot
    plt.show()

    # Print some summary statistics
    print("\nSummary Statistics:")
    print("\nAverage values by year:")
    print(yearly_means.round(2).to_string(index=False))

    # Calculate overall change
    first_year = yearly_means.iloc[0]
    last_year = yearly_means.iloc[-1]

    print("\nOverall Changes (2009-2023):")
    print(f"CPI Change: {((last_year['Consumer price index'] / first_year['Consumer price index']) - 1) * 100:.1f}%")
    print(f"Export Costs Change: {((last_year['Cost to export, border compliance'] / first_year['Cost to export, border compliance']) - 1) * 100:.1f}%")
    print(f"Import Costs Change: {((last_year['Cost to import, border compliance'] / first_year['Cost to import, border compliance']) - 1) * 100:.1f}%")

# Execute function
plot_metrics_over_time(df)

# Correlation Matrix

# Correlation matrix
def analyze_correlation_economic(df):

    # Reset index to make country and year accessible columns
    df_reset = df.reset_index()
    df_reset.columns = ['Country Name', 'Year'] + list(df_reset.columns[2:])

    # Create relevant metric correlations
    economic_supply_metrics = [
        'Consumer price index',
        'Households and NPISHs final consumption expenditure',
        'Inflation, consumer prices',
        'GDP per capita',
        'Gini index',
        'GNI',
        'Foreign direct investment, net inflows',
        'General government final consumption expenditure',
        'Labor force participation rate, total',
        'Population density',
        'Air transport, freight',
        'Cost to export, border compliance',
        'Cost to import, border compliance',
        'Exports of goods and services',
        'Imports of goods and services',
        'Logistics performance index: Overall',
        'Trade'
    ]

    # Calculate correlation matrix
    correlation_matrix = df_reset[economic_supply_metrics].corr()

    # Visualizations
    plt.figure(figsize=(26, 16))

    # Correlation Heatmap
    plt.subplot(2, 2, 1)
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
    plt.title('Correlation between Economic and Supply Chain Metrics')

# Execute function
analyze_correlation_economic(df)

# Correlation analysis for LPI
def analyze_logistics_relationships(df):
    # Reset index to get country and year as columns
    df_reset = df.reset_index()
    df_reset.columns = ['Country Name', 'Year'] + list(df_reset.columns[2:])

    # Create subplots
    fig, axes = plt.subplots(3, 2, figsize=(15, 12))
    fig.suptitle('Key Relationships with Logistics Performance Index', fontsize=16, y=1.02)

    # Plot 1: LPI vs Air Transport
    sns.scatterplot(data=df_reset,
                   x='Logistics performance index: Overall',
                   y='Air transport, freight',
                   ax=axes[0,0])
    axes[0,0].set_title('LPI vs Air Transport\nCorrelation: 0.43')

    # Plot 2: LPI vs Export Costs
    sns.scatterplot(data=df_reset,
                   x='Logistics performance index: Overall',
                   y='Cost to export, border compliance',
                   ax=axes[0,1])
    axes[0,1].set_title('LPI vs Export Costs\nCorrelation: -0.44')

    # Plot 3: LPI vs Import Costs
    sns.scatterplot(data=df_reset,
                   x='Logistics performance index: Overall',
                   y='Cost to import, border compliance',
                   ax=axes[1,0])
    axes[1,0].set_title('LPI vs Import Costs\nCorrelation: -0.48')

    # Plot 4: LPI vs Trade
    sns.scatterplot(data=df_reset,
                   x='Logistics performance index: Overall',
                   y='Trade',
                   ax=axes[1,1])
    axes[1,1].set_title('LPI vs Trade\nCorrelation: 0.29')

    # Plot 5: LPI vs GDP per capita
    sns.scatterplot(data=df_reset,
                   x='Logistics performance index: Overall',
                   y='GDP per capita',
                   ax=axes[2,0])
    axes[2,0].set_title('LPI vs GDP per capita\nCorrelation: 0.75')

    # Plot 6: LPI vs Households Consumption Expenditure
    sns.scatterplot(data=df_reset,
                   x='Logistics performance index: Overall',
                   y='Households and NPISHs final consumption expenditure',
                   ax=axes[2,1])
    axes[2,1].set_title('LPI vs Household spending\nCorrelation: -0.46')

    # Adjust layout
    plt.tight_layout()

    # Add trend lines to each plot
    for ax in axes.flat:
        x = df_reset[ax.get_xlabel()]
        y = df_reset[ax.get_ylabel()]
        z = np.polyfit(x, y, 1)
        p = np.poly1d(z)
        ax.plot(x, p(x), "r--", alpha=0.8)

    plt.show()

    # Calculate additional statistics
    print("\nDetailed Statistics for Each Relationship:")

    relationships = [
        ('Air transport, freight', 0.43),
        ('Cost to export, border compliance', -0.44),
        ('Cost to import, border compliance', -0.48),
        ('Trade', 0.29),
        ('GDP per capita', 0.75),
        ('Households and NPISHs final consumption expenditure', -0.46)
    ]

    for var, corr in relationships:
        data = df_reset[[var, 'Logistics performance index: Overall']].dropna()
        slope, intercept, r_value, p_value, std_err = stats.linregress(
            data['Logistics performance index: Overall'],
            data[var]
        )

        print(f"\n{var}:")
        print(f"Correlation: {corr:.3f}")
        print(f"R-squared: {r_value**2:.3f}")
        print(f"P-value: {p_value:.3e}")

# Run the analysis
analyze_logistics_relationships(df)

# Segmentation

def gdp_segmentation(df):
    # Reset index to get country and year as columns
    df_reset = df.reset_index()
    df_reset.columns = ['Country Name', 'Year'] + list(df_reset.columns[2:])

    # Segment countries by GDP per capita
    df_reset['GDP_Category'] = pd.qcut(df_reset['GDP per capita'],
                                     q=3,
                                     labels=['Low GDP', 'Medium GDP', 'High GDP'])

    # Analysis by GDP segment
    gdp_segments = []

    for gdp_cat in df_reset['GDP_Category'].unique():
        segment_data = df_reset[df_reset['GDP_Category'] == gdp_cat]

        # Calculate correlations for this segment
        # Handle potential missing values
        mask = ~(pd.isna(segment_data['Consumer price index']) | pd.isna(segment_data['Trade']))
        if mask.sum() > 1:  # Need at least 2 points for correlation
            corr = np.corrcoef(
                segment_data.loc[mask, 'Consumer price index'],
                segment_data.loc[mask, 'Trade']
            )[0,1]
        else:
            corr = np.nan

        # Calculate mean values with handling for missing values
        mean_values = {
            'Mean_CPI': segment_data['Consumer price index'].mean(),
            'Mean_FDI': segment_data['Foreign direct investment, net inflows'].mean(),
            'Mean_Air_Transport': segment_data['Air transport, freight'].mean(),
            'Mean_Export_Cost': segment_data['Cost to export, border compliance'].mean(),
            'Mean_Import_Cost': segment_data['Cost to import, border compliance'].mean(),
            'Mean_Export': segment_data['Exports of goods and services'].mean(),
            'Mean_Import': segment_data['Imports of goods and services'].mean(),
            'Mean_LPI': segment_data['Logistics performance index: Overall'].mean(),
            'Mean_Trade': segment_data['Trade'].mean()
        }

        gdp_segments.append({
            'GDP_Category': gdp_cat,
            'CPI_Trade_Correlation': corr,
            **mean_values  # Include all mean values
        })

    return {'gdp_segments': gdp_segments}

def print_insights(results):
    print("\nInsights by GDP Category:")
    for segment in results['gdp_segments']:
        print(f"\n{segment['GDP_Category']}:")
        print(f"CPI-Trade Correlation: {segment['CPI_Trade_Correlation']:.3f}")
        print(f"Average CPI: {segment['Mean_CPI']:.2f}")
        print(f"Average FDI: {segment['Mean_FDI']:,.2f}% of GDP")
        print(f"Average Air Transport: {segment['Mean_Air_Transport']:,.2f} million ton-km")
        print(f"Average Export Cost: ${segment['Mean_Export_Cost']:.2f}")
        print(f"Average Import Cost: ${segment['Mean_Import_Cost']:.2f}")
        print(f"Average Exports: {segment['Mean_Export']:,.2f}% of GDP")
        print(f"Average Imports: {segment['Mean_Import']:,.2f}% of GDP")
        print(f"Average Logistics Performance Index: {segment['Mean_LPI']:.2f}")
        print(f"Average Trade: {segment['Mean_Trade']:,.2f}% of GDP")

# Execute function
results = gdp_segmentation(df)
print_insights(results)

# Volatility analysis     
def volatility_analysis(df):
    # Reset index to get country and year as columns
    df_reset = df.reset_index()
    df_reset.columns = ['Country Name', 'Year'] + list(df_reset.columns[2:])
    
    df_reset['CPI_Volatility'] = df_reset.groupby('Country Name')['Consumer price index'].transform(lambda x: x.std())
    df_reset['Household_Expenditure_Volatility'] = df_reset.groupby('Country Name')['Households and NPISHs Final consumption expenditure, PPP'].transform(lambda x: x.std())
    
    # Find countries with highest volatility
    top_cpi = df_reset.groupby('Country Name').agg({
        'CPI_Volatility': 'first',
        'Household_Expenditure_Volatility': 'first'
    }).sort_values('CPI_Volatility', ascending=False).head(5)
    
    top_expenditure = df_reset.groupby('Country Name').agg({
        'CPI_Volatility': 'first',
        'Household_Expenditure_Volatility': 'first'
    }).sort_values('Household_Expenditure_Volatility', ascending=False).head(5)
    
    # Get the list of volatile countries
    volatile_cpi_countries = top_cpi.index.tolist()
    volatile_exp_countries = top_expenditure.index.tolist()
    
    # Filter data for volatile countries
    cpi_volatile_data = df_reset[df_reset['Country Name'].isin(volatile_cpi_countries)]
    exp_volatile_data = df_reset[df_reset['Country Name'].isin(volatile_exp_countries)]
    
    # Create visualizations
    sns.set_style("whitegrid")
    
    # 1. Air Freight Trends
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 12))
    
    # Plot for CPI volatile countries
    for country in volatile_cpi_countries:
        country_data = cpi_volatile_data[cpi_volatile_data['Country Name'] == country]
        ax1.plot(country_data['Year'], country_data['Air transport, freight'], 
                marker='o', label=country)
    
    ax1.set_title('Air Freight Trends - Countries with High CPI Volatility')
    ax1.set_xlabel('Year')
    ax1.set_ylabel('Air Transport Freight')
    ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax1.grid(True)
    
    # Plot for Expenditure volatile countries
    for country in volatile_exp_countries:
        country_data = exp_volatile_data[exp_volatile_data['Country Name'] == country]
        ax2.plot(country_data['Year'], country_data['Air transport, freight'], 
                marker='o', label=country)
    
    ax2.set_title('Air Freight Trends - Countries with High Expenditure Volatility')
    ax2.set_xlabel('Year')
    ax2.set_ylabel('Air Transport Freight')
    ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # 2. Trade Trends
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 12))
    
    # Plot for CPI volatile countries
    for country in volatile_cpi_countries:
        country_data = cpi_volatile_data[cpi_volatile_data['Country Name'] == country]
        ax1.plot(country_data['Year'], country_data['Trade'], 
                marker='o', label=country)
    
    ax1.set_title('Trade Trends - Countries with High CPI Volatility')
    ax1.set_xlabel('Year')
    ax1.set_ylabel('Trade (% of GDP)')
    ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax1.grid(True)
    
    # Plot for Expenditure volatile countries
    for country in volatile_exp_countries:
        country_data = exp_volatile_data[exp_volatile_data['Country Name'] == country]
        ax2.plot(country_data['Year'], country_data['Trade'], 
                marker='o', label=country)
    
    ax2.set_title('Trade Trends - Countries with High Expenditure Volatility')
    ax2.set_xlabel('Year')
    ax2.set_ylabel('Trade (% of GDP)')
    ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Print volatility rankings
    print("\nTop 5 Countries with Highest CPI Volatility:")
    print(top_cpi)
    
    print("\nTop 5 Countries with Highest Household Expenditure Volatility:")
    print(top_expenditure)
    
    # Calculate summary statistics for volatile countries
    print("\nSummary Statistics for Volatile Countries:")
    
    print("\nCPI Volatile Countries - Average Trade and Air Freight:")
    print(cpi_volatile_data.groupby('Country Name').agg({
        'Trade': 'mean',
        'Air transport, freight': 'mean'
    }).round(2))
    
    print("\nExpenditure Volatile Countries - Average Trade and Air Freight:")
    print(exp_volatile_data.groupby('Country Name').agg({
        'Trade': 'mean',
        'Air transport, freight': 'mean'
    }).round(2))
    
    return df_reset

volatile_data = volatility_analysis(df)

def regional_patterns(df):
    # Reset index to get country and year as columns
    df_reset = df.reset_index()
    df_reset.columns = ['Country Name', 'Year'] + list(df_reset.columns[2:])

    # Define regions and their countries
    region_mapping = {
        'Europe & Central Asia': ['Luxembourg', 'Norway', 'Switzerland', 'Ireland', 'Turkiye', 'Russian Federation'],
        'Middle East & North Africa': ['Qatar', 'Syrian Arab Republic', 'United Arab Emirates', 'Iraq'],
        'Sub-Saharan Africa': ['Congo, Dem. Rep.', 'Madagascar', 'Liberia', 'Guinea-Bissau', 'Sudan', 'Angola'],
        'South Asia': ['Afghanistan', 'India'],
        'East Asia & Pacific': ['China', 'Vietnam', 'Australia', 'Indonesia'],
        'North America': ['United States', 'Canada'],
        'Latin America & Caribbean': ['Venezuela, RB', 'Chile', 'Costa Rica']
    }

    # Create reverse mapping from country to region
    country_to_region = {country: region for region, countries in region_mapping.items()
                        for country in countries}

    # Filter for only specified countries and assign regions
    df_reset = df_reset[df_reset['Country Name'].isin(country_to_region.keys())]
    df_reset['Region'] = df_reset['Country Name'].map(country_to_region)

    # Calculate regional patterns
    regional_patterns = df_reset.groupby('Region').agg({
        'Consumer price index': 'mean',
        'GDP per capita': 'mean',
        'Gini index': 'mean',
        'GNI per capita, PPP': 'mean',
        'Foreign direct investment, net inflows': 'mean',
        'General government final consumption expenditure': 'mean',
        'Employment to population ratio, 15+, total': 'mean',
        'Poverty headcount ratio at national poverty lines': 'mean',
        'Air transport, freight': 'mean',
        'Cost to export, border compliance': 'mean',
        'Cost to import, border compliance': 'mean',
        'Logistics performance index: Overall': 'mean',
        'Trade': 'mean'
    }).round(2)

    # Create visualizations
    create_regional_visualizations(df_reset)

    return regional_patterns

def create_regional_visualizations(df):
    sns.set_style("whitegrid")

    # 1.1 GDP per capita by Region (Box Plot)
    plt.figure(figsize=(12, 6))
    sns.boxplot(x='Region', y='GDP per capita', data=df)
    plt.xticks(rotation=45, ha='right')
    plt.title('GDP per Capita Distribution by Region')
    plt.tight_layout()
    plt.show()

    # 1.2 Household spending by Region (Box Plot)
    plt.figure(figsize=(12, 6))
    sns.boxplot(x='Region', y='Households and NPISHs Final consumption expenditure, PPP', data=df)
    plt.xticks(rotation=45, ha='right')
    plt.title('Consumption Expenditure Distribution by Region')
    plt.tight_layout()
    plt.show()

    # 2. Development Indicators Heatmap
    indicators = ['GDP per capita', 'Gini index', 'Trade',
                 'Employment to population ratio, 15+, total',
                 'Logistics performance index: Overall']

    pivot_data = df.pivot_table(
        index='Region',
        values=indicators,
        aggfunc='mean'
    )

    # Normalize the data for better visualization
    normalized_data = (pivot_data - pivot_data.mean()) / pivot_data.std()

    plt.figure(figsize=(12, 8))
    sns.heatmap(normalized_data, annot=True, cmap='RdYlBu', center=0, fmt='.2f')
    plt.title('Regional Development Indicators (Normalized Values)')
    plt.tight_layout()
    plt.show()

    # 3. Trade and Economic Performance
    plt.figure(figsize=(12, 6))
    sns.scatterplot(data=df, x='Trade', y='GDP per capita', hue='Region',
                   alpha=0.6, size='Logistics performance index: Overall',
                   sizes=(50, 400))
    plt.title('Trade vs GDP per Capita by Region')
    plt.xlabel('Trade')
    plt.ylabel('GDP per Capita')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()

def print_regional_insights(regional_patterns):
    print("\nRegional Economic Insights:")

    # Create a summary table with key metrics
    summary_table = pd.DataFrame({
        'GDP per capita': regional_patterns['GDP per capita'].map('${:,.0f}'.format),
        'Trade (% of GDP)': regional_patterns['Trade'].map('{:.1f}%'.format),
        'Employment Ratio (%)': regional_patterns['Employment to population ratio, 15+, total'].map('{:.1f}%'.format),
        'Logistics Score': regional_patterns['Logistics performance index: Overall'].map('{:.2f}'.format),
        'Gini Index': regional_patterns['Gini index'].map('{:.1f}'.format)
    })

    print("\nRegional Economic Summary:")
    print(tabulate(summary_table, headers='keys', tablefmt='pretty'))

    # Detailed insights by region
    print("\nDetailed Regional Insights:")
    for region in regional_patterns.index:
        print(f"\n{region}:")
        print(f"GDP per capita: ${regional_patterns.loc[region, 'GDP per capita']:,.2f}")
        print(f"Gini Index: {regional_patterns.loc[region, 'Gini index']:.1f}")
        print(f"GNI per capita (PPP): ${regional_patterns.loc[region, 'GNI per capita, PPP']:,.2f}")
        print(f"FDI (% of GDP): {regional_patterns.loc[region, 'Foreign direct investment, net inflows']:.1f}%")
        print(f"Consumption expenditure: ${regional_patterns.loc[region, 'General government final consumption expenditure']:,.2f}")
        print(f"Employment ratio: {regional_patterns.loc[region, 'Employment to population ratio, 15+, total']:.1f}%")
        print(f"Poverty headcount ratio: {regional_patterns.loc[region, 'Poverty headcount ratio at national poverty lines']:.1f}%")
        print(f"Air transport: {regional_patterns.loc[region, 'Air transport, freight']:,.2f} million ton-km")
        print(f"Trade (% of GDP): {regional_patterns.loc[region, 'Trade']:.1f}%")
        print(f"Logistics performance: {regional_patterns.loc[region, 'Logistics performance index: Overall']:.2f}")

patterns = regional_patterns(df)
print_regional_insights(patterns)

Feature Engineering

def calculate_wa_cost(df):
    """
    Calculate weighted average for cost of living:
    1. CPI changes
    2. Household consumption expenditure

    Calculate weighted average for cost to trade:
    1. Cost to export
    2. Cost to import
    """
    # Reset index
    df_reset = df.reset_index()
    df_reset.columns = ['Country Name', 'Year'] + list(df_reset.columns[2:])

    # Calculate year-over-year changes for each component
    df_reset['CPI_Change'] = df_reset.groupby('Country Name')['Consumer price index'].pct_change() * 100
    df_reset['Consumption_Change'] = df_reset.groupby('Country Name')['Households and NPISHs final consumption expenditure'].pct_change() * 100
    df_reset['WA_CoL'] = (
        0.5 * df_reset['CPI_Change'] +  # Base CPI change
        0.5 * df_reset['Consumption_Change']  # Consumption patterns
    )

    # Calculate supply chain impact metrics (in percentage)
    df_reset['Export_Cost_Change'] = df_reset.groupby('Country Name')['Cost to export, border compliance'].pct_change() * 100
    df_reset['Import_Cost_Change'] = df_reset.groupby('Country Name')['Cost to import, border compliance'].pct_change() * 100
    df_reset['WA_CoT'] = (
        0.5 * df_reset['Export_Cost_Change'] +  # Export costs
        0.5 * df_reset['Import_Cost_Change']  # Import costs
    )

    # Create visualizations
    fig, axes = plt.subplots(2, 1, figsize=(15, 12))

    # Plot 1: Cost of Living Components Over Time
    yearly_components = df_reset.groupby('Year').agg({
        'CPI_Change': 'mean',
        'Consumption_Change': 'mean',
        'Inflation, consumer prices': 'mean',
        'WA_CoL': 'mean'
    })

    axes[0].plot(yearly_components.index, yearly_components['WA_CoL'], 'b-', label='Cost of Living', linewidth=3)
    axes[0].plot(yearly_components.index, yearly_components['CPI_Change'], 'r--', label='CPI Change', linewidth=2)
    axes[0].plot(yearly_components.index, yearly_components['Consumption_Change'], 'g--', label='Consumption Change', linewidth=2)
    axes[0].plot(yearly_components.index, yearly_components['Inflation, consumer prices'], 'y--', label='Inflation', linewidth=2)
    axes[0].set_title('Cost of Living and Its Components Over Time')
    axes[0].legend()
    axes[0].grid(True)
    axes[0].set_ylabel('Percentage Change (%)')

    # Plot 2: Cost of Living vs Supply Chain Costs
    yearly_metrics = df_reset.groupby('Year').agg({
        'WA_CoL': 'mean',
        'Export_Cost_Change': 'mean',
        'Import_Cost_Change': 'mean',
        'WA_CoT': 'mean'
    })

    axes[1].plot(yearly_metrics.index, yearly_metrics['WA_CoL'], 'b-', label='Cost of Living', linewidth=3)
    axes[1].plot(yearly_metrics.index, yearly_metrics['WA_CoT'], 'r-', label='Cost of Trade', linewidth=3)
    axes[1].plot(yearly_metrics.index, yearly_metrics['Export_Cost_Change'], 'r--', label='Export Cost Change', linewidth=2)
    axes[1].plot(yearly_metrics.index, yearly_metrics['Import_Cost_Change'], 'g--', label='Import Cost Change', linewidth=2)
    axes[1].set_title('Cost of Living vs Supply Chain Cost Changes')
    axes[1].legend()
    axes[1].grid(True)
    axes[1].set_ylabel('Percentage Change (%)')

    plt.tight_layout()
    plt.show()

    # Calculate correlations
    cost_correlations = df_reset[[
        'WA_CoL',
        'Air transport, freight',
        'Export_Cost_Change',
        'Import_Cost_Change',
        'Logistics performance index: Overall',
        'Trade',
        'WA_CoT'
    ]].corr()[['WA_CoL', 'WA_CoT']]

    # Group analysis by Cost of Living quartiles
    df_reset['CoL_Quartile'] = pd.qcut(df_reset['WA_CoL'].fillna(0), q=4, labels=['Low', 'Medium-Low', 'Medium-High', 'High'])

    quartile_analysis = df_reset.groupby('CoL_Quartile').agg({
        'Air transport, freight': 'mean',
        'Cost to export, border compliance': 'mean',
        'Cost to import, border compliance': 'mean',
        'Logistics performance index: Overall': 'mean',
        'WA_CoL': ['mean', 'count'],  # Include count to see sample size
        'WA_CoT': 'mean'
    }).round(2)

    print("\nCorrelations with Cost Measures:")
    print(cost_correlations.round(3))

    print("\nSupply Chain Metrics by Cost of Living Quartile:")
    print(quartile_analysis)

    # Additional statistics
    print("\nAverage Cost Measures by Year:")
    print(yearly_metrics[['WA_CoL', 'WA_CoT']].round(2))

    return df_reset

df_with_costs = calculate_wa_cost(df)

Regression Model

def predict_trade_ability(df):
    """
    Create panel data model to predict trade ability (Trade % of GDP)
    """
    # Reset index and prepare panel data
    df_reset = df.reset_index()
    df_reset.columns = ['Country', 'Year'] + list(df_reset.columns[2:])
    
    # Prepare panel data
    df_panel = df_reset.set_index(['Country', 'Year'])
    
    # Dependent variable: Trade (% of GDP)
    Y = df_panel['Trade']
    
    # Independent variables selected for trade prediction
    X = df_panel[[
        'GDP per capita',  # Economic development
        'Air transport, freight',  # Transportation capacity
        'Cost to export, border compliance',  # Trade barriers
        'Cost to import, border compliance',  # Trade barriers
        'Logistics performance index: Overall',  # Infrastructure quality
        'Foreign direct investment, net inflows',  # International investment
        'Employment to population ratio, 15+, total',  # Labor market
        'Consumer price index'  # Domestic prices
    ]]
    
    # Handle missing values
    X = X.fillna(X.mean())
    Y = Y.fillna(Y.mean())
    
    # Fixed Effects Model
    model_fe = PanelOLS(Y, X, entity_effects=True)
    fe_results = model_fe.fit()
    
    # Random Effects Model
    model_re = PanelOLS(Y, X, entity_effects=False)
    re_results = model_re.fit()
    
    # Hausman Test
    def hausman(fe_res, re_res):
        b_diff = fe_res.params - re_res.params
        cov_diff = fe_res.cov - re_res.cov
        stat = np.dot(b_diff.T, np.linalg.solve(cov_diff, b_diff))
        p_value = 1 - stats.chi2.cdf(stat, len(b_diff))
        return stat, p_value
    
    hausman_stat, hausman_p_value = hausman(fe_results, re_results)
    
    # Print results
    print("\nFixed Effects Model Results:")
    print(fe_results)
    
    print("\nRandom Effects Model Results:")
    print(re_results)
    
    print(f"\nHausman Test Statistic: {hausman_stat:.2f}, P-value: {hausman_p_value:.4f}")
    
    # Choose model based on Hausman test
    chosen_model = fe_results if hausman_p_value < 0.05 else re_results
    
    # Add predictions to the panel data
    df_panel['Predicted_Trade'] = chosen_model.fitted_values
    
    # Visualizations
    # 1. Actual vs Predicted
    plt.figure(figsize=(12, 6))
    plt.scatter(df_panel['Trade'], df_panel['Predicted_Trade'], alpha=0.5)
    plt.plot([df_panel['Trade'].min(), df_panel['Trade'].max()], 
             [df_panel['Trade'].min(), df_panel['Trade'].max()], 
             'r--', lw=2)
    plt.xlabel('Actual Trade (% of GDP)')
    plt.ylabel('Predicted Trade (% of GDP)')
    plt.title('Actual vs Predicted Trade Openness')
    plt.grid(True)
    plt.show()
    
    # 2. Feature Importance
    plt.figure(figsize=(12, 6))
    coefficients = pd.Series(chosen_model.params)
    coefficients = coefficients.sort_values(ascending=True)
    coefficients.plot(kind='bar')
    plt.title('Feature Importance in Trade Prediction')
    plt.xlabel('Features')
    plt.ylabel('Coefficient Value')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # 3. Trade by Logistics Performance
    plt.figure(figsize=(10, 6))
    df_panel['LPI_Quartile'] = pd.qcut(df_panel['Logistics performance index: Overall'], 
                                      4, labels=['Low', 'Medium-Low', 'Medium-High', 'High'])
    sns.boxplot(x='LPI_Quartile', y='Trade', data=df_panel.reset_index())
    plt.title('Trade Distribution by Logistics Performance')
    plt.xlabel('Logistics Performance Index Quartile')
    plt.ylabel('Trade (% of GDP)')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    return df_panel, chosen_model

panel_data, model = predict_trade_ability(df)

Back to Projects

Project Access

Other Projects

Product Management

Design

Research

Loop by Cuemill

Product Management

Design

Research

Loop by Cuemill

Product Management

Design

Research

Loop by Cuemill

Machine Learning

Data Science

Python

Utilizing NLP Techniques to Predict E-commerce Recommendation & Rating

Machine Learning

Data Science

Python

Utilizing NLP Techniques to Predict E-commerce Recommendation & Rating

Machine Learning

Data Science

Python

Utilizing NLP Techniques to Predict E-commerce Recommendation & Rating

All Projects

Home

About

Projects

The Product Hub

Resume

Home

About

Projects

The Product Hub

Resume

Rotman Datathon 2025: Cost of Living, Economic Development, and Supply Chain

Data Overview

Execution Plan

Findings

Recommendation

Deliverables

Code Snippet

Impute Missing Data

Exploratory Data Analysis (EDA)

Feature Engineering

Regression Model

Other Projects

Loop by Cuemill

Loop by Cuemill

Loop by Cuemill

Utilizing NLP Techniques to Predict E-commerce Recommendation & Rating

Utilizing NLP Techniques to Predict E-commerce Recommendation & Rating

Utilizing NLP Techniques to Predict E-commerce Recommendation & Rating

Let's Talk

Let's Talk

Let's Talk