How to fit a power law distribution

4. How to fit a power law distribution#

import networkx as nx
import seaborn as sns

import numpy as np
from operator import itemgetter

%pylab inline

Populating the interactive namespace from numpy and matplotlib

4.1. The Powerlaw package#

We use the Python toolbox powerlaw that implements a method proposed by Aaron Clauset and collaborators in this paper. The paper explains why fitting a power law distribution using a linear regression of logarthim is not correct. A more sound approach is based on a Maximum Likelihood Estimator.

The package can be installed using pip as pip install powerlaw. Full documentation is available here. Several examples and a detailed description of the library has been published in a paper on PLOS ONE .

As stated by Clauset, Shalizi and Newman:

In practice, we can rarely, if ever, be certain that an observed quantity is drawn from a power-law distribution. The most we can say is that our observations are consistent with the hypothesis that \(x\) is drawn from a distribution of the form of \(p(x) \propto x^{-\alpha}\). In some cases we may also be able to rule out some other competing hypotheses.

import powerlaw as pwl

4.2. Analysis of ca-AstroPh#

We analyze the network file ‘ca-AstroPh’ from the SNAP repository. This is a co-authorhip network, thus, it is undirected.

filepath = "./../datasets/ca-AstroPh.txt"

G = nx.Graph()

fh = open(filepath, "r")
for line in fh.readlines():
    s = line.strip().split()
    if s[0] != "#":
        origin = int(s[0])
        dest = int(s[1])
        G.add_edge(origin, dest)
fh.close()

print("The network has", len(G), "nodes and", len(G.edges()), "links.")

The network has 18772 nodes and 198110 links.

4.2.1. Degree distribution#

Let’s plot the degree distribution of the network.

It’s important to keep in mind the difference between probability density function and probablity mass function.

from collections import Counter

deg = dict(G.degree()).values()
deg_distri = Counter(deg)

We plot the degree probability mass function.

x = []
y = []
for i in sorted(deg_distri):
    x.append(i)
    y.append(deg_distri[i] / len(G))

plt.figure(figsize=(10, 7))
plt.plot(x, y)

plt.xlabel("degree $k$", fontsize=18)
plt.ylabel("$P(k)$", fontsize=18)

plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.yscale("log")
plt.xscale("log")
plt.axis([1, 1000, 0.00001, 0.2])
plt.show()

../_images/283e473fd843c6ca990b5800328757102e86001ed6ca156fa3e498a5c2a037cb.png

Using the ‘hist()’ function of matplotlib we can plot the probability density distribution, choosing the number of bins.

plt.figure(figsize=(10, 7))

plt.hist(deg, bins=90, density=True, log=True, histtype="stepfilled")

plt.plot(x, y, "black", "o")
plt.xscale("log")
plt.yscale("log")
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)

Text(0, 0.5, '$P(k)$')

../_images/abaf1a36b129dfd5c15cbfc37841d7e16951627b7d42f25f6eb98c99eda01594.png

The powerlaw package provides direct access to the probability density function.

degree = list(deg)

pwl_distri = pwl.pdf(degree, bins=90)

pwl_distri

(array([  1.        ,   6.58888889,  12.17777778,  17.76666667,
         23.35555556,  28.94444444,  34.53333333,  40.12222222,
         45.71111111,  51.3       ,  56.88888889,  62.47777778,
         68.06666667,  73.65555556,  79.24444444,  84.83333333,
         90.42222222,  96.01111111, 101.6       , 107.18888889,
        112.77777778, 118.36666667, 123.95555556, 129.54444444,
        135.13333333, 140.72222222, 146.31111111, 151.9       ,
        157.48888889, 163.07777778, 168.66666667, 174.25555556,
        179.84444444, 185.43333333, 191.02222222, 196.61111111,
        202.2       , 207.78888889, 213.37777778, 218.96666667,
        224.55555556, 230.14444444, 235.73333333, 241.32222222,
        246.91111111, 252.5       , 258.08888889, 263.67777778,
        269.26666667, 274.85555556, 280.44444444, 286.03333333,
        291.62222222, 297.21111111, 302.8       , 308.38888889,
        313.97777778, 319.56666667, 325.15555556, 330.74444444,
        336.33333333, 341.92222222, 347.51111111, 353.1       ,
        358.68888889, 364.27777778, 369.86666667, 375.45555556,
        381.04444444, 386.63333333, 392.22222222, 397.81111111,
        403.4       , 408.98888889, 414.57777778, 420.16666667,
        425.75555556, 431.34444444, 436.93333333, 442.52222222,
        448.11111111, 453.7       , 459.28888889, 464.87777778,
        470.46666667, 476.05555556, 481.64444444, 487.23333333,
        492.82222222, 498.41111111, 504.        ]),
 array([7.30022168e-02, 2.97289351e-02, 1.38874827e-02, 1.29915161e-02,
        7.35836420e-03, 7.44414824e-03, 6.11926142e-03, 4.47983313e-03,
        4.63233808e-03, 2.83087327e-03, 2.41148464e-03, 2.14460096e-03,
        1.61083361e-03, 1.57270737e-03, 1.25816590e-03, 1.07706626e-03,
        9.91282224e-04, 4.76577992e-04, 6.38614509e-04, 5.43298911e-04,
        4.76577992e-04, 3.33604594e-04, 4.00325513e-04, 3.62199274e-04,
        2.76415235e-04, 2.66883676e-04, 1.90631197e-04, 2.09694317e-04,
        1.52504957e-04, 7.62524787e-05, 1.62036517e-04, 6.67209189e-05,
        5.71893591e-05, 3.81262394e-05, 8.57840386e-05, 2.85946795e-05,
        0.00000000e+00, 8.57840386e-05, 3.81262394e-05, 6.67209189e-05,
        1.90631197e-05, 2.85946795e-05, 9.53155984e-06, 0.00000000e+00,
        1.90631197e-05, 3.81262394e-05, 0.00000000e+00, 9.53155984e-06,
        3.81262394e-05, 0.00000000e+00, 9.53155984e-06, 0.00000000e+00,
        1.90631197e-05, 0.00000000e+00, 0.00000000e+00, 9.53155984e-06,
        0.00000000e+00, 1.90631197e-05, 1.90631197e-05, 1.90631197e-05,
        0.00000000e+00, 0.00000000e+00, 1.90631197e-05, 0.00000000e+00,
        9.53155984e-06, 9.53155984e-06, 0.00000000e+00, 0.00000000e+00,
        9.53155984e-06, 9.53155984e-06, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 1.90631197e-05, 0.00000000e+00,
        9.53155984e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 9.53155984e-06]))

plt.figure(figsize=(10, 7))
plt.yscale("log")
plt.xscale("log")

plt.plot(x, y, "ro")

pwl.plot_pdf(degree, color="black", linewidth=2)

plt.xticks(fontsize=20)
plt.yticks(fontsize=22)

plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)

Text(0, 0.5, '$P(k)$')

../_images/ae34a54a98ce9dcf13b748274158e8498a6b925e7090490632d3ec6834351426.png

4.2.2. Linear binning#

plt.figure(figsize=(10, 7))
plt.yscale("log")
plt.xscale("log")
plt.plot(x, y, "ro")

pwl.plot_pdf(degree, linear_bins=True, color="black", linewidth=2)

plt.xticks(fontsize=20)
plt.yticks(fontsize=22)

plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)

Text(0, 0.5, '$P(k)$')

../_images/455ed14c0043e0b4d5db61958a6080d23fdcb7a39b670e9db1e5b6b0ebead453.png

4.3. Parameter estimation#

The library powerlaw allows to estimate the exponent \(\alpha\) and the minimum value for the scaling \(x_{min}\)

degree

fit_function = pwl.Fit(degree)

Calculating best minimal value for power law fit
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

fit_function

<powerlaw.Fit at 0x7fbfcdbabb38>

fit_function.power_law

/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

<powerlaw.Power_Law at 0x7fbfcdbab780>

fit_function.power_law.alpha

4.543577046506554

fit_function.power_law.sigma

0.1993417830511849

fit_function.power_law.xmin

123.0

We fix the minimum value for the scaling \(x_{min}=10\)

fit_function_fixmin = pwl.Fit(degree, xmin=10)

/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

fit_function_fixmin.xmin

10.0

fit_function_fixmin.power_law.alpha

/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

1.9475409247436344

fit_function_fixmin.power_law.sigma

0.009808681018591505

We can look at the values of the Kolgomorov-Sminorv distance of the two fits to compare them. Smaller distances correspond to better fits.

fit_function.power_law.D

0.028347190083988588

fit_function_fixmin.power_law.D

0.12754011660628156

4.4. Visualize distributions and fit#

fig = plt.figure(figsize=(10, 7))

plt.plot(x, y, "ro")

fig = pwl.plot_pdf([x for x in degree if x > 123], color="r", linewidth=2, label="data")

fit_function.power_law.plot_pdf(
    ax=fig, color="b", linestyle="-", linewidth=1, label="fit"
)

fig.legend(fontsize=22)

plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)

Text(0, 0.5, '$P(k)$')

../_images/82cfbb26c24d00b6e461890174ab8b4856b8b19cbe95ecbd89aa109af0da3c90.png

fig = plt.figure(figsize=(10, 7))

plt.plot(x, y, "ro")

fig = pwl.plot_pdf([x for x in degree if x > 10], color="r", linewidth=2, label="Data")

fit_function_fixmin.power_law.plot_pdf(
    ax=fig, color="b", linestyle="-", linewidth=1, label="Fit"
)

fig.legend(fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("Degree", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)

Text(0, 0.5, '$P(k)$')

../_images/96e8e03ddeff024f955df599c3c9cedde9a8cf7f75587419c27df5b5dbe72082.png

4.5. Comparing Candidate Distributions#

We can compare the goodness of fit of several distributions. Distributions other than a power-law can provide a better fit to the data.

fit_function.supported_distributions

{'power_law': powerlaw.Power_Law,
 'lognormal': powerlaw.Lognormal,
 'exponential': powerlaw.Exponential,
 'truncated_power_law': powerlaw.Truncated_Power_Law,
 'stretched_exponential': powerlaw.Stretched_Exponential,
 'lognormal_positive': powerlaw.Lognormal_Positive}

R, p = fit_function.distribution_compare(
    "power_law", "exponential", normalized_ratio=True
)

/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

R, p

(2.4450468427863137, 0.014483332945893693)

R is the loglikelihood ratio between the two candidate distributions. This number will be positive if the data is more likely in the first distribution, and negative if the data is more likely in the second distribution. The significance value for that direction is p.

R2, p2 = fit_function.distribution_compare(
    "power_law", "lognormal_positive", normalized_ratio=True
)

/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

R2, p2

(0.26521294560765446, 0.7908454227344833)

R3, p3 = fit_function.distribution_compare(
    "power_law", "truncated_power_law", normalized_ratio=True
)

Assuming nested distributions
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

R3, p3

(-0.46891429705144816, 0.5781526627750692)

R4, p4 = fit_function.distribution_compare(
    "power_law", "stretched_exponential", normalized_ratio=True
)

/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

R4, p4

(0.43974656259335, 0.6601206746049643)

Analyze the power law with \(x_{min}=10\).

Here, the truncated power law is the best fit that explains the data. Even an exponential is a better fit to the data.

R, p = fit_function_fixmin.distribution_compare(
    "power_law", "exponential", normalized_ratio=True
)

/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

R, p

(-9.83422144673739, 8.018539959420145e-23)

fig = plt.figure(figsize=(10, 7))

fig = pwl.plot_pdf([x for x in degree if x > 10], color="r", linewidth=2, label="Data")

fit_function_fixmin.exponential.plot_pdf(
    ax=fig, color="black", linestyle="-", linewidth=2, label="Fit"
)

fig.legend(fontsize=22)

plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)

Text(0, 0.5, '$P(k)$')

../_images/9ed6c0bfd52316694dcf80da2a56bd144b032db5c7a8ad52858edb82a8e98576.png

R3, p3 = fit_function_fixmin.distribution_compare(
    "power_law", "truncated_power_law", normalized_ratio=True
)
R3, p3

Assuming nested distributions
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

(-28.906628686152743, 0.0)

fig = plt.figure(figsize=(10, 7))

fig = pwl.plot_pdf([x for x in degree if x > 10], color="r", linewidth=2, label="Data")

fit_function_fixmin.truncated_power_law.plot_pdf(
    ax=fig, color="black", linestyle="-", linewidth=2, label="Fit"
)

fig.legend(fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("Degree", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)

Text(0, 0.5, '$P(k)$')

../_images/f4167b8a8ea263bd63513252878c0ab8a1699e294162cab0cf71aa47867264bd.png