On the Lavalette’s Nonlinear Zipf’s Law

Version February 2002

by Acad. Prof. Dr. Ioan-Iovitz Popescu






  The ranking law, established by the French biophysicist Daniel Lavalette (1996), states that the impact factor q of a set of N scientific journals, ordered by descending ranking number n, obeys the general relationship

q(n) = c [Nn/(N-n+1)]-b

with only two fitting parameters, namely the exponent b and the scaling constant c = q(1). When plotted on a double logarithmic log (q), log (n) diagram, the corresponding line deviates from straightness in a smooth, characteristic fashion (Fig. 1), hence the alternate names we propose as curved Zipf line and nonlinear Zipf’s law. Indeed, holding a better promise for various applications and theoretical investigations, this law is barely more complex than the well known rank-frequency Zipf’s law

q(n) = c n-b

with the exponent b = 1 in the original expression of George Kingsley Zipf (1949). Two adjustable corrections have been subsequently introduced by Benoit Mandelbrot (1954), namely a slight correction added to the power 1 and a number d added to the rank n, the modified law becoming

q(n)µ c (n + d)-b


with three fitting parameters, b, c, and d. Thus, whereas the role of independent variable in the Zipf’s and Mandelbrot’s laws is played by the descending ranking number, n, in the Lavalette’s law this is accomplished by the ratio n/(N-n+1) between the descending and the ascending ranking numbers. Generally, q(n) can be any quantity used in ordering a set of occurrences, such as the frequency of natural or randomly generated words, size of cities and other settlements, income size, citation frequency and impact factors, frequency of access to web sites, size of oil and other mineral deposits, earthquake magnitude, galactic intensity, and so on.

     Zipf’s law has been originally stated for linguistics, with N having in this case the meaning of total number of different (distinct) words or vocabulary of the considered text. Defining further the text length L by the total number of running words, the ratio p(n) = q(n)/L represents the probability to find the word with rank n. Obviously, we have the constraint that the probabilities p(n) must sum to 1, inasmuch as the frequencies q(n) sum to L. From here it results that the simple hyperbolic Zipf’s law q = c/n cannot hold true when the lexicon increases indefinitely, since summing over the distribution gives a non-convergent series. Therefore, faster converging probability distributions have to be used to model Zipf-like distributions in this limit, such as the Riemann zeta functionz, defined by the series
converging for b > 1 and diverging for b =< 1.
    Generally, Zipf curves follow roughly a straight line with slope -b when the (q, n) dataset is plotted on a double logarithmic graph, excepting the words of the low end (with highest ranks) when the actual data drop off quite steeply. Also the most frequent words (with lowest ranks) do not necessarily follow as fast as expected by the original Zipf's law, that is the quantity q under study proportional to 1, 1/2, 1/3, 1/4, and so on. In fact, on a double logarithmic scale, the actual distributions exhibit a more or less pronounced curvature downwards instead of an ideal Zipfian linear fall. In particular, Fig. 2 reveals only a slightly curved rank-frequency Zipf plot for 917 distinct words occurring in the text of USA Constitution. This kind of shape is characteristic for any text and contributed to the illusion of a general linear Zipf’s law. However, the nonlinearity is always manifesting even in linguistics and the fitting appears to be best accomplished with a single Lavalette function with two free parameters, b and c. Moreover, still better fitting may possibly be achieved by considering N as a third fine parameter to search for an optimum vocabulary size for the considered text length. In other words, we suggest that the fitting success of a well-behaved sample would indicate rather the reaching an optimal corpus - shape and size, in spite of the observation (Powers, 1998) that, as the sample size increases, the number of new words tends to increase faster than that of repeated.

    The interest in the Zipf’s law deviations, explanations and applications has recently been rejuvenated with partial success by D.M.W. Powers (1998), J. Laherrere (1996, 1998), S. Redner (1998), C. Tsallis (2000) and others. Thus, the main results of the studies addressed to citations of publications (Redner), or to citations of authors (Laherrere) were that the stretched exponential
 

q(n) µ exp[-(n/n0)b]


fits reasonably well the data for relatively small n-values, while the needed asymptotic behavior is the inverse power law q(n) = c n-b with , which by no means can be provided by the exponential. Considerably better results have recently been obtained (Tsallis and de Albuquerque) with a single function of the power-law type, namely

q(n) µ [1 (b - 1)-1ln]-b


 

with the exponent b = 2.89 close to the previous one. The essential shapes of the above competing distributions are presented schematically in Fig.3.
   Here we are concerned with the Lavalette’s extension (1996) of the Zipf’s law, due to its best agreement with actual data with the help of a single two-parameter (b and c) fitting function along the entire range of  frequency count. This is illustrated below for the present (version October 2001) collection of 7557 scientific journals ordered by their average impact factor, the later playing the role of frequency. The quality of the Lavalette fitting is demonstrated in Fig. 4 for 4 random disjoint subsets, corresponding to journal titles having the initial letter A, B, C, or D. None of the competing functions shown in Fig. 3 is better suited to fit these data than the Lavalette nonlinear distribution. A similar shape has any subset of journals ranked by impact factors, as it is shown in Fig. 5 for journal initials pertaining to each letter of the alphabet, and in Fig. 6 for journals pertaining to the main scientific fields. Finally, an earthquake graphical illustration is subjected to your consideration in Figs 7a and 7b. It appears therefrom that the earthquake moment magnitude plays the role of a natural ranking number and should be kept on the abscissa for fitting purposes. As already traditional in Zipfian matters, an impressing list of applications remains always open. The same is true for the explanation, modeling and meaning of this mysterious law, which represents a permanent intellectual challenge from the very beginning (Zipf, 1949) up to the present days (Troll, 1998). The reader may convince himself of the continuous actuality of this subject by simply using the proper key words in his web-search. Thus, for instance, a Google-search for Zipf’s Law provides about 3,310 results, to be compared to Benford’s Law (1,580 results), but also to Einstein’s Law (52,200 results) or Newton’s Law (86,800 results) - to say nothing about the topic of fractals (256,000), or the whole scientific fields of mathematics (6,300,000), physics (7,440,000), chemistry (5,820,000), biology (6,550,000), medicine (12,900,000), economy (13,100,000), engineering (19,700,000).
Finally, we shall discuss the related Benford’s law, also called the first digit law, first digit phenomenon, or leading digit phenomenon, inasmuch as its theory and applications became most advanced in the last decade (Hill, 1996, 1998; Knuth, 1996; Matthews, 1999; McMinn, 2000). Indeed, numerous statistical data are governed by the Benford distribution, i.e. by the specific shape of the frequency graph leading to Benford probability 

p(n) log(1+1/n)

for any first (leading, initial) digit n from 1 to 9. Surprisingly, for instance, the probability that the first digit be the number “1” is p(1) = 0.301, and not the value 1/9 = 0.111, as expected if all digits were equally likely. Next most popular digit follows “2” with p(2) = 0.176 an so on up to the least probable digit “9” with p(9) = 0.046. Here is a table of the initial digits percentages as predicted by Benford’s law and as collected from a list of the first 500 Fibonacci numbers (or, nowadays, produced with an Internet available Fibonacci Calculator):

Initial Digit Percentages resulted from the first 500 Fibonacci numbers
First Digit
1
2
3
4
5
6
7
8
9
Fibonacci Percentage 
30.2 %
17.6 %
12.6 %
9.4 %
8.0 %
6.6 %
5.8 %
5,4 %
4.4 %
Predicted by Benford's Law 
30.1 %
17.6 %
12.5 %
9.7 %
7.9 %
6.7 %
5.8 %
5.1 %
4.6 %

The analysis has been extended to the second digit and so on up to the general case of the nth digit following the first non-zero digit. As expected, the corresponding probability quickly approaches 1/10 as we proceed to less significant digits. Generally, digits of all the numbers making up the Fibonacci sequence tend to conform to Benford's law.Forintroduction in this matter see Benford's Law in Probability and Statistics athttp://www.mathpages.com/home/iprobabi.htm and McMinn, Benford.s Law & the Fibonacci Numbers (2000).

Fig. 8 and the attached excel file on significant-digit percentage (click here) deduced from the occurrences in the first 500 Fibonacci numbers are intended to illustrate the extraordinary connection between the Benford’s law and Fibonacci numbers, and to demonstrate the excellent agreement between predicted and actual significant digit percentages also beyond the first digit “9”. Some details are reproduced in the following table below from the most common first two digits “10”, up to the least likely “99”.

Significant-Digit Percentages resulted from the first 500 Fibonacci numbers
Sum
First and Second Digits
10
20
30
40
50
60
70
80
90
Fibonacci Percentage 
4 %
2 %
1.4 %
1.0 %
0.8 %
0.8 %
0.8 %
0.4 %
0.4 %
11.6 %
First and Second Digits
19
29
39
49
59
69
79
89
99
Fibonacci Percentage 
2.2 %
1.4 %
1.2 %
0.8 %
0.6 %
0.6 %
0.4 %
0.6 %
0.4 %
8.2 %

From this table one may deduce, for example, that the second digit is most likely to be “0” (11.6 %) and “9” the least (8.2 %). Indeed, the percentage for the second digit being a “0” is equal to the sum of percentages of the first two digits being “10”, “20”, …, “90”, resulting by addition in an amount of 11.6 %; a similar reasoning for the second digit being a “9” results in an amount of 8.2 %. But the most significant new fact consists in the suitability of a single Lavalette function to match quite well the whole range of frequencies (probabilities, percentages) predicted by the Benford’s law and resulted from the Fibonacci number statistics, as it is shown in Fig. 8 for the first and second digits. Consequently and surprisingly, three apparently different branches can be inter linked – Fibonacci sequence and golden mean,Benford’s law and leading digit phenomena, and finally, the Lavalette’s nonlinear Zipf’s law.

 
 
REFERENCES




 
Fig.1. Illustrating the curved Zipf line shape by the normalized avalette ranking function q/c = [Nn/(Nn+1)] in terms of the descending anking number n for a (negative) slope b = 1/2 and various total numbers N of items of the considered set.

 
 
Fig.2. A slightly curved rank-frequency Zipf plot of 917 distinct word occurrences in the text of USA Constitution. The corresponding excel list of words is attached (click here)

 
 
 
Fig. 3. Illustrating essential shapes of competing ranking distributions

 
 
Fig.4. Illustrating the Lavalette ranking law and the Lavalette fitting for 4 random subsets (of journals with title initial letter A, B, C, or D) excerpted from the present collection (7557 journals) and ranked by average journal impact factors (JIF).
The corresponding excel list is attached (click here)
 

 
Fig.5. Illustrating the Lavalette ranking law for 26 random subsets (of journals with title initial letter belonging to various letters of  the alphabet) excerpted from the present collection (7557 journals) and ranked by average journal impact factors (JIF). The Lavalette shaping is obvious and the fitting pleasure is left to the reader, together with the attached excel list (click here)
 

 
Fig.6. Illustrating the Lavalette ranking law for 12 natural subsets (of journals belonging to various scientific fields) excerpted from the present collection (7557 journals) and ranked by average journal impact factors (JIF). The Lavalette shaping is obvious and the fitting pleasure is left to the reader, together with the attached excel list (click here)
 

 
Fig. 7a. Ranking Roumanian earthquake moment magnitudes in Mw = (2/3) log Mo - 10.7 [where Mo is the scalar moment of the best double couple in dyne-cm, according to the Hanks and Kanamori formula (1979), http://neic.usgs.gov/neis/phase_data/mag_formulas.html]. Data obtained by courtesy of the Roumanian National Seismic Network, The National Institute of Earth Physics (NIEP), Bucharest-Magurele, Roumania. The corresponding excel list of the Roumanian earthquakes is attached (click here)
 

 
Fig. 7b. An interchange of X-Y axes in the preceding figure reveals the earthquake moment magnitude (Mw) as a natural rank and the (cumulative) ranking number as a frequency of earthquakes stronger than Mw. In contrast to Fig. 7a, the data exhibit in this case a Lavalette shaping.

 
Fig. 8. Illustrating the Lavalette single function fitting of the significant-digit percentages and, consequently, of the corresponding Benford distribution. The excel list of the data is attached (click here)