Studying Canadian Internet Use: Cyber Security Incidents and Internet Activities

Association:

iSchool, University of Toronto

Duration:

1 month

Data Science

Statistics

R

For the final project of INF1344-Introduction to Statistics for Data Science, our group decided to conduct an exploratory study about the Canadian internet landscape regarding cyber security and internet usage patterns. Our study aims to answer two main questions: ‘How does susceptibility to cyber security incidents and internet usage patterns vary among different demographic groups in Canada?’ and ‘Is there any relationship between how frequently people use the internet and the likelihood of encountering an internet incident?’

Our research project uses data sets within the Canadian Internet Use Survey 2022 provided by Statistics Canada. This survey includes data from 2018 to 2022 on numerous phenomena of the internet landscape among Canadians aged 15 years and older. The survey’s sample size is approximately 55,700 household members “...based on a stratified design employing probability sampling” (Statistics Canada, 2023). The independent variables of our interests are age, gender, and level of education in examining incident and usage rates.

Descriptive Statistics

Through summary statistics and visual inspections, initial insights and overall trends in internet usage and cyber security incidents across different demographics and timeframes have been found. Notably, the average percentage of cyber security incidents has steadily increased across all age groups from 2018 to 2022, with younger adults (15 to 44 years old) consistently reporting the highest overall occurrence percentages; meanwhile, some internet activities like emails, instant messaging, and online banking consistently top the chart despite there are some differences in internet usage among different group demographics.

# Heatmap of incident percentages by age group
incident_heatmap <- incident_clean %>%
  group_by(age, incident) %>%
  summarise(average_percentage = mean(percentage, na.rm = TRUE), .groups = 'drop') %>%
  ggplot(aes(x = age, y = incident, fill = average_percentage)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "darkblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Cyber Incidents by Age Group",
       x = "Age Group", 
       y = "Incident Type",
       fill = "Avg Percentage")
print(incident_heatmap)
# Grouped bar chart of average incident rate by age groups and years
ggplot(incident_summary, aes(x = factor(year), y = Mean_Percentage, fill = age)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Cyber Security Incidents by Age Group and Year",
    x = "Year",
    y = "Mean Percentage",
    fill = "Age Group"
  ) +
  theme_minimal() +
  theme(text = element_text(size = 14))
# Trend analysis over years
trend_analysis <- incident_clean %>%
  group_by(year, category) %>%
  summarise(
    mean_percentage = mean(percentage, na.rm = TRUE)
  ) %>%
  ggplot(aes(x = year, y = mean_percentage, color = category)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right") +
  labs(title = "Incident Trends Over Time",
       x = "Year",
       y = "Average Percentage")
# Time trend for top activities
ggplot(all_years, aes(x = year, y = avg_usage, color = activity)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text.x = element_text(angle = 0),
    plot.title = element_text(size = 12, face = "bold")
  ) +
  labs(
    title = "Top Internet Activities by Year (2018-2022)",
    x = "Year",
    y = "Average Usage Percentage",
    color = "Activity"
  ) +
  scale_y_continuous(limits = c(0, 100))
# Age Impact on Internet Activity
age_impact <- activity_clean %>%
  group_by(age_group, activity) %>%
  summarise(avg_usage = mean(percentage, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(age_group, avg_usage), y = avg_usage)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal() +
  labs(title = "Internet Usage by Age Group",
       x = "Age Group",
       y = "Average Usage Percentage")
# Education Level Impact on Internet Activity
education_impact <- activity_clean %>%
  group_by(education, activity) %>%
  summarise(avg_usage = mean(percentage, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(education, avg_usage), y = avg_usage)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal() +
  labs(title = "Internet Usage by Education Level",
       x = "Education Level",
       y = "Average Usage Percentage")
# Top 3 activities by gender
top_activity_by_gender <- activity_clean %>%
  group_by(gender, activity) %>%
  summarise(avg_percentage = mean(percentage, na.rm = TRUE)) %>%
  group_by(gender) %>%
  slice_max(order_by = avg_percentage, n = 3) %>%
  arrange(gender, desc(avg_percentage))

## # A tibble: 6 × 3
## # Groups:   gender [2]
##   gender  activity                                                avg_percentage
##   <chr>   <chr>                                                            <dbl>

Inferential Statistics

The inferential statistical analysis, such as linear regression, ANCOVA, and correlation analysis, explored and provided the statistical foundation for the relationship between internet usage and cybersecurity incidents across different demographic groups mentioned above.

  • To validate the differences in the dependent variable (percentage) across different age groups while controlling for the effect of incident type, Analysis of Covariance (ANCOVA) was used. The F-value for the age variable is 15.856, and the p-value is 1.951e-08 (which is highly significant, far below the standard threshold of 0.05), suggesting strong statistical evidence to reject the null hypothesis. This means that there are significant differences in the average incident percentages across the different age groups.

  • To confirm the differences in internet usage between demographics, statistical testing was adopted. The ANOVA tests for age groups and education levels both shows a highly significant effect of age and education level on internet activity usage, (F(3, 1730) = 101.8, p < 0.001) and (F(2, 1731) = 45.57, p < 0.001) respectively, suggesting strong statistical evidence to reject the null hypothesis. This indicates that individuals of different age groups and with different education levels engage in internet activities at varying rates. On the other side, the t-test for gender differences in internet activity usage reveals no significant difference (t = 0.378, p = 0.705); no statistical evidence to reject the null hyphothesis. The mean percentage of usage for men (48.78%) and women (48.24%) is nearly identical, suggesting no gender-based disparities in internet activity engagement.

  • Despite showing no statistical difference between the two genders in t-test, when accounting for other variables (age_group, activity, education) and controlling for these covariates, gender shows a statistically significant effect. This ANCOVA provides a more complex model of variance and suggests gender's impact is subtle but real when other factors are considered, alongside other statistically significant effects of age groups, educational levels, and type of internet activities on activity percentage.


Regarding the relationship between the two focused factors of this study, internet incident and activity, a correlation was calculated. A correlation of 0.729 indicates a fairly strong positive relationship between the two variables, average internet usage (avg_usage) and mean incident percentage (mean_occurrence). This suggests that as average internet usage increases, the proportion of incident occurrences also tends to rise. Specifically, for every unit increase in the percentage of internet usage, there is approximately a 0.73-unit increase in the proportion of security incidents, on average. This finding provides valuable insight into the relationship between internet activity and vulnerability to cyber incidents. For instance, while younger age groups tend to experience more frequent security incidents, this could be attributed to their significantly higher internet usage. Conversely, older age groups, despite being considered less tech-savvy, may experience fewer security incidents partly because they engage less with the internet overall. Limited internet usage reduces their exposure to potential threats such as phishing, malware, and fraud.

# Merge the datasets on 'year' and 'age_group'
merged_data <- left_join(activity_summary, incident_summary, by = c("year", "age_group"))
print(merged_data)

## # A tibble: 12 × 4
## # Groups:   year [3]
##     year age_group         avg_usage mean_occurrence
##    <dbl> <chr>                 <dbl>           <dbl>

# Calculate correlation
cor(merged_data$avg_usage, merged_data$mean_occurrence)

## [1] 0.7293801
# Visualize the relationship between avg_usage and mean_occurrence
ggplot(merged_data, aes(x = avg_usage, y = mean_occurrence, color = as.factor(year))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title = "Scatter Plot of Average Usage vs. Mean Incident Percentage",
    x = "Average Usage",
    y = "Mean Incident Percentage",
    color = "Year"
  ) +
  theme_minimal()

To study the relationship between incident occurrence and internet usage in-depth, the interaction term is used to help identify whether the relationship between one independent variable (average internet usage) and the dependent variable (mean incident percentage) changes depending on the value of another independent variable (age group). In other words, for this analysis, we might suspect that the relationship between internet usage and incident occurrences varies by age group; it is valuable to see how age and internet activity together influence the outcome of incident rate. Regarding the model fit metrics:

  • Residual Standard Error of 0.9596 indicates that the model fits the data well.

  • Multiple R-squared of 0.9798 shows that the model explains ~98% of the variance in mean_occurrence. Adjusted R-squared: 0.9444 adjusts for the number of predictors and is slightly lower than R-squared, which is expected.

  • F-statistic of 27.67 and p-value of 0.0031 indicate the overall model is statistically significant.

model <- lm(mean_occurrence ~ avg_usage * age_group, data = merged_data)
summary(model)

## 
## Call:
## lm(formula = mean_occurrence ~ avg_usage * age_group, data = merged_data)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
##  0.77717  0.02421 -0.21935  0.27779 -1.42821 -0.03732  0.31945 -0.54977 
##        9       10       11       12 
##  0.65104  0.01312 -0.10011  0.27197 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                          -100.6721    24.8974  -4.043   0.0156 * 
## avg_usage                               1.8286     0.3971   4.604   0.0100 **
## age_group25 to 44 years                60.3610    26.6345   2.266   0.0861 . 
## age_group45 to 64 years                77.3999    25.6334   3.019   0.0392 * 
## age_group65 years and over             75.2339    26.0000   2.894   0.0444 * 
## avg_usage:age_group25 to 44 years      -0.8331     0.4319  -1.929   0.1260   
## avg_usage:age_group45 to 64 years      -1.0037     0.4205  -2.387   0.0754 . 
## avg_usage:age_group65 years and over   -0.7496     0.4629  -1.619   0.1807   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9596 on 4 degrees of freedom
## Multiple R-squared:  0.9798, Adjusted R-squared:  0.9444 
## F-statistic: 27.67 on 7 and 4 DF,  p-value: 0.003116

Overall, this model supports the idea that internet usage has a significant positive relationship with online incident occurrences, and this relationship varies across age groups. Older age groups (45-64 and 65+ years old) experience higher incident rates compared to the baseline group (15 to 24 years old), after accounting for internet usage (or when average usage is zero). Interaction effects between age group and internet usage (avg_usage:age_group) show how the relationship between average usage and mean occurrence changes across different age groups. They indicate whether the slope of average usage differs depending on the age group as interaction terms change the slope of the line for different groups.

Conclusion

Digital spaces have become an integral part of modern life, making cybersecurity a critical concern for individuals and organizations alike. This study highlights a strong association between average internet usage and the mean percentage of cybersecurity incidents, emphasizing the interconnectedness of online activity and vulnerability to digital threats.

Our analysis reveals that while cybersecurity issues affect individuals across all age groups, those aged 25–44 are particularly susceptible. Additionally, the findings indicate an increasing risk for older adults as they engage more actively in digital environments throughout time. Our results support a key theme in the literature review that security incidents affect everyone but at varied rates and through different means. Beyond age, other behavioral and socioeconomic factors play a crucial role in influencing the likelihood of encountering security incidents.

These insights underline the necessity of developing targeted awareness and prevention strategies. Tailored approaches should address the specific online behaviors and risks faced by different demographic groups, ensuring that interventions are both relevant and effective.

Let's Talk

Let's Talk

Let's Talk

© 2025. All rights Reserved.

© 2025. All rights Reserved.

© 2025. All rights Reserved.