Python Package Installation Analysis May 2024

Python Package Installation Analysis: May 2024

Project Overview

I’ve recently started a new project to track package installations in Python. My goal is to create a monthly report or newsletter with the trending packages and general trends and behaviors in package installs. With this new project, we have the opportunity to gain valuable insights into the Python package landscape.

Data Source

For this project, I’m utilizing the powerful Google BigQuery public dataset. Read more about the dataset here. I’m currently running a query on Google Cloud, downloading the data as a CSV, and then analyzing it in Python in this GitHub repository. The use of these robust tools ensures the accuracy and efficiency of our data analysis.

May 24: EDA Findings

Find the analysis notebook here

Since May was the first month I queried the data and had no insights on the MoM trends, I initiated an Exploratory Data Analysis (EDA) on the dataset. This involved analyzing the distribution of package installs, which led to some interesting findings about the most used packages.

The Dataset

Below is a sample of the dataset used to analyze the data. There are 2 columns:

  • Project: the name of the package
  • Installs: How many times this package was downloaded in May
Project Installs
boto3 1,388,601,787
botocore 645,035,046
urllib3 533,921,148
requests 485,817,094
wheel 474,966,050

Summarizing the Data

I’ve split the projects into buckets based on the number of installs.

Bin projects_count projects_installs projects_installs_pct
(0, 1] 11,043 11,043 0.00%
(1, 10] 9,507 38,229 0.00%
(10, 100] 241,675 11,623,831 0.03%
(100, 1000] 212,489 67,274,154 0.17%
(1000, 10000] 49,432 144,526,470 0.36%
(10000, 100000] 12,661 401,882,649 1.01%
(100000, 1000000] 4,492 1,399,147,950 3.52%
(1000000, 1388601787] 2,058 37,677,185,601 94.90%

This table presents data on projects and their corresponding installation counts categorized into different bins based on the number of installations. Here’s a description of the table:

  • Bin: This column represents the range of installations per project. Each bin range is represented as an interval, such as (0, 1], (1, 10], (10, 100], and so on.
  • projects_count: This column shows the number of projects falling within each bin range. For instance, there are 11,043 projects in the (0, 1] range, 9,507 projects in the (1, 10] range, and so forth.
  • projects_installs: This column indicates the number of installations corresponding to the projects within each bin range. For example, there are 11,043 installations in the (0, 1] range, 38,229 installations in the (1, 10] range, and so on.
  • projects_installs_pct: This column displays the percentage of total installations represented by the projects within each bin range. For instance, projects within the (0, 1] range represent 0.00% of the total installations, projects within the (1, 10] range represent 0.00%, and so forth.

Concentration of Installations

While most projects fall into the lower installation ranges, a significant portion of the total installations is concentrated in projects with higher installation counts. For instance, projects with over 1,000,000 installations represent 94.90% of the total installations despite being only 2,058 projects.

Long Tail Distribution

The distribution follows a long tail pattern, where a few projects with extremely high installation counts contribute to a large portion of the total installations. Most projects have relatively low installation counts, while a small number have very high installation counts.

Pareto Analysis

Pareto Analysis, often referred to as the 80/20 rule or the principle of factor sparsity, is a technique used to prioritize efforts. It states that, for many outcomes, roughly 80% of the consequences come from 20% of the causes. We can apply this analysis to the provided data to identify which installation ranges contribute to most installations.

Pareto Analysis

This chart illustrates a Pareto Analysis, which is a graphical representation showing how many PyPI projects contribute to 80% of the total installs. The x-axis represents the percentage of PyPI projects, plotted on a logarithmic scale ranging from 0.001% to 100%. The y-axis shows the cumulative percentage of total installs, spanning from 0% to 100%.

Key insights from the chart include:

The analysis reveals a significant imbalance in the distribution of installs across projects. A small fraction of packages (0.1% or 507 packages) accounts for a disproportionately large share (80%) of total installs. This highlights the dominance of a few highly popular packages in the PyPI ecosystem, while the majority of projects contribute to a smaller proportion of total installs.


Stay tuned for more insights and detailed reports on Python package trends and behaviors. Your feedback and suggestions are always welcome as we continue to explore the fascinating world of Python packages.