Published on 2025-06-26T05:27:11Z
What is Pearson Correlation? Examples & Applications in Analytics
Pearson Correlation is a statistical measure that quantifies the linear relationship between two continuous variables. In analytics, it’s commonly used to understand how two metrics, such as pageviews and session duration, move together. The coefficient ® ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.
This metric helps analysts identify feature associations, select variables for predictive models, and interpret A/B test outcomes. It assumes data are normally distributed, linearly related, and free of extreme outliers. While powerful, Pearson correlation can be misleading for nonlinear relationships or skewed distributions, so it’s important to validate assumptions before drawing conclusions.
Pearson correlation
Measure of linear relationship between two metrics, ranging from -1 to +1, indicating strength and direction in analytics data.
Why Pearson Correlation Matters in Analytics
Understanding how metrics relate is key to data-driven decision making. Pearson correlation provides a simple yet powerful way to assess linear relationships between continuous variables in web and product analytics.
-
Measuring linear relationships
Pearson correlation quantifies the strength and direction of a linear association between two metrics, for example, session duration and pageviews per user.
-
Typical use cases
Use cases include evaluating how changes in one metric (like average session duration) associate with another (like bounce rate) or identifying correlated features for predictive modeling.
Calculating Pearson Correlation
Learn the formula behind Pearson correlation, the steps to compute it manually, and how to leverage code or spreadsheet tools for practical calculation.
-
Pearson formula
The coefficient r is calculated as the covariance of X and Y divided by the product of their standard deviations: r = cov(X, Y) / (σX · σY).
-
Code example in python
Use Python’s pandas library to compute correlation on exported analytics data.
- Python example:
import pandas as pd # Assume df has 'pageviews' and 'session_duration' r = df['pageviews'].corr(df['session_duration']) print('Pearson r:', r)
- Python example:
Implementing with GA4 and PlainSignal
Extract data from popular analytics platforms like GA4 and PlainSignal to calculate Pearson correlation in your preferred environment.
-
Ga4 via bigquery
Export GA4 data to BigQuery and execute SQL to compute correlation between metrics.
- Bigquery sql:
SELECT CORR(event_count, user_engagement_time) AS pearson_r FROM `project.dataset.events_*` WHERE _TABLE_SUFFIX BETWEEN '20250101' AND '20250601';
- Bigquery sql:
-
Plainsignal cookie-free data
Embed PlainSignal’s script to collect pageview and engagement metrics, then calculate correlation externally.
- Embedding plainsignal script:
<link rel="preconnect" href="//eu.plainsignal.com/" crossorigin /> <script defer data-do="yourwebsitedomain.com" data-id="0GQV1xmtzQQ" data-api="//eu.plainsignal.com" src="//cdn.plainsignal.com/PlainSignal-min.js"></script>
- Embedding plainsignal script:
Limitations and Best Practices
Pearson correlation is sensitive to its assumptions and data quality. Follow best practices to ensure accurate interpretation.
-
Assumptions to check
Data should be linearly related, approximately normally distributed, and free from significant outliers.
- Linearity:
Inspect scatter plots for a linear pattern.
- Normality:
Assess distributions using histograms or normality tests like Shapiro-Wilk.
- Outliers:
Identify and handle outliers, as they can heavily skew the correlation coefficient.
- Linearity:
-
Interpreting values
Values close to +1 or -1 indicate strong linear relationships; values near 0 suggest weak or no linear association.