4. How to fit a power law distribution#
import networkx as nx
import seaborn as sns
import numpy as np
from operator import itemgetter
%pylab inline
Populating the interactive namespace from numpy and matplotlib
4.1. The Powerlaw package#
We use the Python toolbox powerlaw that implements a method proposed by Aaron Clauset and collaborators in this paper. The paper explains why fitting a power law distribution using a linear regression of logarthim is not correct. A more sound approach is based on a Maximum Likelihood Estimator.
The package can be installed using pip
as pip install powerlaw
.
Full documentation is available here.
Several examples and a detailed description of the library has been published in a paper on PLOS ONE
.
As stated by Clauset, Shalizi and Newman:
In practice, we can rarely, if ever, be certain that an observed quantity is drawn from a power-law distribution. The most we can say is that our observations are consistent with the hypothesis that \(x\) is drawn from a distribution of the form of \(p(x) \propto x^{-\alpha}\). In some cases we may also be able to rule out some other competing hypotheses.
import powerlaw as pwl
4.2. Analysis of ca-AstroPh#
We analyze the network file ‘ca-AstroPh’ from the SNAP repository. This is a co-authorhip network, thus, it is undirected.
filepath = "./../datasets/ca-AstroPh.txt"
G = nx.Graph()
fh = open(filepath, "r")
for line in fh.readlines():
s = line.strip().split()
if s[0] != "#":
origin = int(s[0])
dest = int(s[1])
G.add_edge(origin, dest)
fh.close()
print("The network has", len(G), "nodes and", len(G.edges()), "links.")
The network has 18772 nodes and 198110 links.
4.2.1. Degree distribution#
Let’s plot the degree distribution of the network.
It’s important to keep in mind the difference between probability density function and probablity mass function.
from collections import Counter
deg = dict(G.degree()).values()
deg_distri = Counter(deg)
We plot the degree probability mass function.
x = []
y = []
for i in sorted(deg_distri):
x.append(i)
y.append(deg_distri[i] / len(G))
plt.figure(figsize=(10, 7))
plt.plot(x, y)
plt.xlabel("degree $k$", fontsize=18)
plt.ylabel("$P(k)$", fontsize=18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.yscale("log")
plt.xscale("log")
plt.axis([1, 1000, 0.00001, 0.2])
plt.show()
Using the ‘hist()’ function of matplotlib we can plot the probability density distribution, choosing the number of bins.
plt.figure(figsize=(10, 7))
plt.hist(deg, bins=90, density=True, log=True, histtype="stepfilled")
plt.plot(x, y, "black", "o")
plt.xscale("log")
plt.yscale("log")
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)
Text(0, 0.5, '$P(k)$')
The powerlaw package provides direct access to the probability density function.
degree = list(deg)
pwl_distri = pwl.pdf(degree, bins=90)
pwl_distri
(array([ 1. , 6.58888889, 12.17777778, 17.76666667,
23.35555556, 28.94444444, 34.53333333, 40.12222222,
45.71111111, 51.3 , 56.88888889, 62.47777778,
68.06666667, 73.65555556, 79.24444444, 84.83333333,
90.42222222, 96.01111111, 101.6 , 107.18888889,
112.77777778, 118.36666667, 123.95555556, 129.54444444,
135.13333333, 140.72222222, 146.31111111, 151.9 ,
157.48888889, 163.07777778, 168.66666667, 174.25555556,
179.84444444, 185.43333333, 191.02222222, 196.61111111,
202.2 , 207.78888889, 213.37777778, 218.96666667,
224.55555556, 230.14444444, 235.73333333, 241.32222222,
246.91111111, 252.5 , 258.08888889, 263.67777778,
269.26666667, 274.85555556, 280.44444444, 286.03333333,
291.62222222, 297.21111111, 302.8 , 308.38888889,
313.97777778, 319.56666667, 325.15555556, 330.74444444,
336.33333333, 341.92222222, 347.51111111, 353.1 ,
358.68888889, 364.27777778, 369.86666667, 375.45555556,
381.04444444, 386.63333333, 392.22222222, 397.81111111,
403.4 , 408.98888889, 414.57777778, 420.16666667,
425.75555556, 431.34444444, 436.93333333, 442.52222222,
448.11111111, 453.7 , 459.28888889, 464.87777778,
470.46666667, 476.05555556, 481.64444444, 487.23333333,
492.82222222, 498.41111111, 504. ]),
array([7.30022168e-02, 2.97289351e-02, 1.38874827e-02, 1.29915161e-02,
7.35836420e-03, 7.44414824e-03, 6.11926142e-03, 4.47983313e-03,
4.63233808e-03, 2.83087327e-03, 2.41148464e-03, 2.14460096e-03,
1.61083361e-03, 1.57270737e-03, 1.25816590e-03, 1.07706626e-03,
9.91282224e-04, 4.76577992e-04, 6.38614509e-04, 5.43298911e-04,
4.76577992e-04, 3.33604594e-04, 4.00325513e-04, 3.62199274e-04,
2.76415235e-04, 2.66883676e-04, 1.90631197e-04, 2.09694317e-04,
1.52504957e-04, 7.62524787e-05, 1.62036517e-04, 6.67209189e-05,
5.71893591e-05, 3.81262394e-05, 8.57840386e-05, 2.85946795e-05,
0.00000000e+00, 8.57840386e-05, 3.81262394e-05, 6.67209189e-05,
1.90631197e-05, 2.85946795e-05, 9.53155984e-06, 0.00000000e+00,
1.90631197e-05, 3.81262394e-05, 0.00000000e+00, 9.53155984e-06,
3.81262394e-05, 0.00000000e+00, 9.53155984e-06, 0.00000000e+00,
1.90631197e-05, 0.00000000e+00, 0.00000000e+00, 9.53155984e-06,
0.00000000e+00, 1.90631197e-05, 1.90631197e-05, 1.90631197e-05,
0.00000000e+00, 0.00000000e+00, 1.90631197e-05, 0.00000000e+00,
9.53155984e-06, 9.53155984e-06, 0.00000000e+00, 0.00000000e+00,
9.53155984e-06, 9.53155984e-06, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 1.90631197e-05, 0.00000000e+00,
9.53155984e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 9.53155984e-06]))
plt.figure(figsize=(10, 7))
plt.yscale("log")
plt.xscale("log")
plt.plot(x, y, "ro")
pwl.plot_pdf(degree, color="black", linewidth=2)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)
Text(0, 0.5, '$P(k)$')
4.2.2. Linear binning#
plt.figure(figsize=(10, 7))
plt.yscale("log")
plt.xscale("log")
plt.plot(x, y, "ro")
pwl.plot_pdf(degree, linear_bins=True, color="black", linewidth=2)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)
Text(0, 0.5, '$P(k)$')
4.3. Parameter estimation#
The library powerlaw allows to estimate the exponent \(\alpha\) and the minimum value for the scaling \(x_{min}\)
degree
[75,
31,
43,
2,
90,
42,
118,
30,
113,
3,
28,
8,
158,
3,
13,
77,
36,
257,
12,
134,
55,
171,
28,
86,
15,
274,
47,
91,
22,
150,
2,
12,
14,
6,
52,
67,
108,
27,
142,
30,
62,
7,
86,
47,
61,
8,
49,
132,
41,
12,
5,
32,
20,
50,
56,
18,
44,
81,
28,
9,
61,
27,
22,
84,
123,
120,
82,
59,
10,
109,
21,
32,
30,
2,
28,
132,
6,
15,
72,
3,
33,
2,
20,
20,
13,
103,
77,
18,
10,
12,
12,
93,
19,
93,
10,
19,
60,
21,
9,
8,
48,
222,
40,
10,
12,
8,
103,
5,
18,
154,
27,
66,
220,
24,
9,
152,
323,
115,
34,
271,
53,
4,
12,
336,
30,
297,
1,
89,
9,
108,
25,
175,
162,
1,
16,
99,
75,
46,
19,
73,
25,
28,
99,
9,
68,
39,
52,
292,
138,
22,
10,
26,
118,
63,
30,
169,
16,
66,
49,
67,
1,
86,
19,
8,
333,
143,
26,
32,
13,
1,
13,
100,
66,
16,
43,
31,
30,
27,
2,
19,
28,
1,
8,
18,
29,
20,
9,
30,
8,
2,
132,
8,
6,
4,
11,
88,
33,
128,
15,
9,
24,
178,
21,
16,
55,
12,
30,
57,
10,
31,
156,
66,
76,
25,
113,
6,
105,
19,
3,
73,
27,
16,
128,
58,
24,
53,
25,
21,
88,
30,
3,
41,
39,
8,
63,
136,
104,
14,
91,
85,
97,
8,
109,
45,
41,
47,
14,
62,
111,
22,
38,
28,
3,
185,
57,
362,
115,
9,
174,
17,
58,
66,
50,
22,
52,
51,
121,
117,
117,
44,
54,
55,
118,
46,
27,
74,
65,
9,
31,
80,
13,
112,
100,
73,
369,
60,
19,
74,
188,
166,
99,
152,
8,
73,
64,
74,
9,
71,
427,
14,
18,
98,
88,
420,
42,
213,
163,
38,
31,
107,
64,
102,
173,
9,
139,
60,
82,
95,
52,
54,
123,
23,
70,
281,
32,
504,
28,
64,
126,
38,
48,
80,
149,
16,
45,
57,
175,
24,
140,
153,
85,
96,
85,
40,
43,
18,
56,
149,
126,
133,
83,
117,
26,
71,
35,
41,
58,
59,
155,
222,
144,
65,
92,
43,
98,
122,
35,
153,
24,
229,
11,
87,
44,
19,
120,
93,
151,
52,
41,
52,
125,
92,
153,
109,
88,
41,
23,
213,
61,
128,
141,
112,
108,
49,
95,
102,
150,
31,
235,
43,
7,
81,
108,
9,
135,
109,
120,
52,
92,
45,
75,
93,
108,
51,
85,
28,
39,
46,
28,
61,
80,
93,
85,
52,
65,
41,
8,
69,
9,
123,
14,
37,
17,
15,
74,
65,
39,
74,
18,
118,
60,
42,
54,
29,
96,
42,
15,
69,
17,
39,
129,
35,
6,
18,
87,
59,
29,
39,
65,
25,
75,
55,
110,
28,
86,
12,
33,
43,
60,
25,
24,
41,
15,
14,
30,
72,
99,
41,
15,
16,
20,
13,
149,
27,
18,
13,
21,
13,
38,
29,
52,
30,
32,
26,
42,
60,
6,
64,
31,
25,
44,
4,
68,
5,
81,
31,
24,
88,
52,
19,
24,
42,
1,
4,
1,
8,
3,
2,
12,
2,
15,
1,
2,
2,
7,
2,
10,
12,
2,
16,
74,
6,
1,
2,
34,
2,
29,
3,
38,
36,
12,
2,
5,
13,
34,
3,
61,
3,
96,
3,
7,
26,
38,
44,
49,
18,
18,
58,
18,
5,
18,
7,
3,
43,
44,
4,
3,
23,
23,
104,
41,
34,
6,
5,
22,
30,
23,
10,
63,
32,
57,
123,
52,
79,
61,
89,
56,
49,
107,
78,
81,
53,
40,
69,
47,
108,
21,
142,
71,
39,
119,
19,
51,
108,
67,
7,
13,
47,
64,
80,
55,
23,
40,
39,
79,
58,
133,
41,
39,
108,
38,
29,
95,
54,
25,
75,
43,
54,
75,
46,
79,
86,
22,
93,
25,
117,
60,
83,
147,
77,
7,
2,
7,
4,
22,
11,
4,
16,
9,
39,
5,
16,
30,
67,
25,
4,
13,
1,
4,
56,
5,
17,
7,
12,
114,
22,
21,
5,
3,
11,
22,
13,
16,
4,
2,
2,
12,
3,
2,
15,
2,
5,
13,
33,
5,
5,
38,
3,
3,
42,
3,
2,
3,
5,
8,
21,
4,
17,
11,
18,
49,
41,
5,
37,
11,
3,
5,
7,
4,
112,
6,
4,
4,
4,
3,
22,
5,
3,
3,
3,
11,
7,
5,
3,
19,
19,
99,
8,
39,
31,
11,
41,
10,
14,
29,
11,
45,
14,
27,
24,
25,
29,
9,
34,
3,
139,
16,
30,
15,
10,
60,
76,
2,
3,
10,
31,
41,
15,
12,
1,
46,
34,
112,
21,
21,
73,
7,
74,
15,
2,
7,
17,
82,
5,
5,
38,
80,
96,
22,
8,
8,
70,
5,
13,
17,
3,
64,
7,
18,
6,
20,
52,
32,
16,
7,
1,
28,
6,
22,
36,
151,
74,
52,
52,
53,
26,
36,
11,
85,
10,
26,
65,
18,
14,
40,
1,
20,
13,
8,
27,
44,
2,
77,
116,
120,
105,
157,
2,
26,
42,
29,
65,
8,
136,
153,
11,
57,
166,
23,
176,
124,
111,
85,
36,
109,
52,
141,
61,
269,
36,
189,
4,
149,
33,
61,
3,
214,
83,
144,
188,
55,
54,
47,
84,
11,
202,
88,
110,
30,
20,
10,
61,
45,
62,
51,
47,
350,
52,
84,
9,
23,
6,
224,
106,
125,
26,
13,
65,
258,
134,
122,
34,
56,
126,
126,
66,
22,
28,
73,
139,
174,
148,
20,
49,
83,
200,
165,
129,
55,
50,
19,
37,
91,
69,
81,
45,
110,
68,
224,
57,
73,
180,
153,
110,
63,
173,
124,
141,
38,
93,
91,
131,
61,
64,
86,
64,
90,
145,
173,
24,
91,
107,
91,
56,
30,
56,
80,
32,
76,
162,
23,
72,
56,
135,
163,
230,
39,
161,
19,
44,
58,
1,
175,
96,
137,
64,
385,
76,
124,
61,
195,
135,
122,
15,
69,
77,
3,
81,
142,
26,
36,
73,
162,
104,
71,
159,
80,
196,
152,
18,
211,
114,
20,
111,
102,
22,
119,
105,
105,
85,
...]
fit_function = pwl.Fit(degree)
Calculating best minimal value for power law fit
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
fit_function
<powerlaw.Fit at 0x7fbfcdbabb38>
fit_function.power_law
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
<powerlaw.Power_Law at 0x7fbfcdbab780>
fit_function.power_law.alpha
4.543577046506554
fit_function.power_law.sigma
0.1993417830511849
fit_function.power_law.xmin
123.0
We fix the minimum value for the scaling \(x_{min}=10\)
fit_function_fixmin = pwl.Fit(degree, xmin=10)
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
fit_function_fixmin.xmin
10.0
fit_function_fixmin.power_law.alpha
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
1.9475409247436344
fit_function_fixmin.power_law.sigma
0.009808681018591505
We can look at the values of the Kolgomorov-Sminorv distance of the two fits to compare them. Smaller distances correspond to better fits.
fit_function.power_law.D
0.028347190083988588
fit_function_fixmin.power_law.D
0.12754011660628156
4.4. Visualize distributions and fit#
fig = plt.figure(figsize=(10, 7))
plt.plot(x, y, "ro")
fig = pwl.plot_pdf([x for x in degree if x > 123], color="r", linewidth=2, label="data")
fit_function.power_law.plot_pdf(
ax=fig, color="b", linestyle="-", linewidth=1, label="fit"
)
fig.legend(fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)
Text(0, 0.5, '$P(k)$')
fig = plt.figure(figsize=(10, 7))
plt.plot(x, y, "ro")
fig = pwl.plot_pdf([x for x in degree if x > 10], color="r", linewidth=2, label="Data")
fit_function_fixmin.power_law.plot_pdf(
ax=fig, color="b", linestyle="-", linewidth=1, label="Fit"
)
fig.legend(fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("Degree", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)
Text(0, 0.5, '$P(k)$')
4.5. Comparing Candidate Distributions#
We can compare the goodness of fit of several distributions. Distributions other than a power-law can provide a better fit to the data.
fit_function.supported_distributions
{'power_law': powerlaw.Power_Law,
'lognormal': powerlaw.Lognormal,
'exponential': powerlaw.Exponential,
'truncated_power_law': powerlaw.Truncated_Power_Law,
'stretched_exponential': powerlaw.Stretched_Exponential,
'lognormal_positive': powerlaw.Lognormal_Positive}
R, p = fit_function.distribution_compare(
"power_law", "exponential", normalized_ratio=True
)
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
R, p
(2.4450468427863137, 0.014483332945893693)
R is the loglikelihood ratio between the two candidate distributions. This number will be positive if the data is more likely in the first distribution, and negative if the data is more likely in the second distribution. The significance value for that direction is p.
R2, p2 = fit_function.distribution_compare(
"power_law", "lognormal_positive", normalized_ratio=True
)
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
R2, p2
(0.26521294560765446, 0.7908454227344833)
R3, p3 = fit_function.distribution_compare(
"power_law", "truncated_power_law", normalized_ratio=True
)
Assuming nested distributions
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
R3, p3
(-0.46891429705144816, 0.5781526627750692)
R4, p4 = fit_function.distribution_compare(
"power_law", "stretched_exponential", normalized_ratio=True
)
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
R4, p4
(0.43974656259335, 0.6601206746049643)
Analyze the power law with \(x_{min}=10\).
Here, the truncated power law is the best fit that explains the data. Even an exponential is a better fit to the data.
R, p = fit_function_fixmin.distribution_compare(
"power_law", "exponential", normalized_ratio=True
)
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
R, p
(-9.83422144673739, 8.018539959420145e-23)
fig = plt.figure(figsize=(10, 7))
fig = pwl.plot_pdf([x for x in degree if x > 10], color="r", linewidth=2, label="Data")
fit_function_fixmin.exponential.plot_pdf(
ax=fig, color="black", linestyle="-", linewidth=2, label="Fit"
)
fig.legend(fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("degree $k$", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)
Text(0, 0.5, '$P(k)$')
R3, p3 = fit_function_fixmin.distribution_compare(
"power_law", "truncated_power_law", normalized_ratio=True
)
R3, p3
Assuming nested distributions
/Users/Michele/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
(-28.906628686152743, 0.0)
fig = plt.figure(figsize=(10, 7))
fig = pwl.plot_pdf([x for x in degree if x > 10], color="r", linewidth=2, label="Data")
fit_function_fixmin.truncated_power_law.plot_pdf(
ax=fig, color="black", linestyle="-", linewidth=2, label="Fit"
)
fig.legend(fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel("Degree", fontsize=22)
plt.ylabel("$P(k)$", fontsize=22)
Text(0, 0.5, '$P(k)$')