|

Current Lecture











E-Mail Contact
| |
CODES - Equidistant Letter
Sequences - Statistical Science Paper
|
Statistical Science
1994, Vol. 9, No. 3, 429-438 (abridged)
Equidistant Letter Sequences in the
Book of Genesis
Doron Witztum, Eliyahu Rips and Yoav Rosenberg
Abstract. It has been noted that when the Book of Genesis
is written as two-dimensional arrays, equidistant letter sequences
spelling words with related meanings often appear in close proximity.
Quantitative tools for measuring this phenomenon are developed.
Randomization analysis shows that the effect is significant at the
level of 0.00002.
Key words and phrases: Genesis, equidistant letter
sequences, cylindrical representations, statistical analysis.
1. INTRODUCTION
The phenomenon discussed in this paper was first discovered several
decades ago by Rabbi Weissmandel [7]. He found some
interesting patterns in the Hebrew Pentateuch (the Five Books of Moses),
consisting of words or phrases expressed in the form of equidistant
letter sequences (ELS's)--that is, by selecting sequences of equally
spaced letters in the text.
As impressive as these seemed, there was no rigorous way of determining
if these occurrences were not merely due to the enormous quantity of
combinations of words and expressions that can be constructed by
searching out arithmetic progressions in the text. The purpose of the
research reported here is to study the phenomenon systematically. The
goal is to clarify whether the phenomenon in question is a real one,
that is, whether it can or cannot be explained purely on the basis of
fortuitous combinations.
The approach we have taken in this research can be illustrated by the
following example. Suppose we have a text written in a foreign language
that we do not understand. We are asked whether the text is meaningful
(in that foreign language) or meaningless. Of course, it is very
difficult to decide between these possibilities, since we do not
understand the language. Suppose now that we are equipped with a very
partial dictionary, which enables us to recognize a small portion of the
words in the text: "hammer" here and "chair" there,
and maybe even "umbrella" elsewhere. Can we now decide between
the two possibilities?
Not yet. But suppose now that, aided with the partial dictionary, we can
recognize in the text a pair of conceptually related words, like
"hammer" and "anvil." We check if there is a
tendency of their appearances in the text to be in "close
proximity." If the text is meaningless, we do not expect to see
such a tendency, since there is no reason for it to occur. Next, we
widen our check; we may identify some other pairs of conceptually
related words: like "chair" and "table," or
"rain" and "umbrella." Thus we have a sample of such
pairs, and we check the tendency of each pair to appear in close
proximity in the text. If the text is meaningless, there is no reason to
expect such a tendency. However, a strong tendency of such pairs to
appear in close proximity indicates that the text might be meaningful.
Note that even in an absolutely meaningful text we do not expect that,
deterministically, every such pair will show such tendency. Note also,
that we did not decode the foreign language of the text yet: we do not recognize
its syntax and we cannot read the text.
This is our approach in the research described in the paper. To test
whether the ELS's in a given text may contain "hidden
information," we write the text in the form of two-dimensional
arrays, and define the distance between ELS's according to the ordinary
two-dimensional Euclidean metric. Then we check whether ELS's
representing conceptually related words tend to appear in "close
proximity."
Suppose we are given a text, such as Genesis (G). Define an
equidistant letter sequence (ELS) as a sequence of letters in the text
whose positions, not counting spaces, form an arithmetic progression;
that is, the letters are found at the positions
n, n+d, n+2d, ... , n+(k-1)d.
We call d the skip, n the start
and k the length of the ELS. These three parameters
uniquely identify the ELS, which is denoted (n,d,k).
Let us write the text as a two-dimensional array--that is, on a single
large page--with rows of equal length, except perhaps for the last row.
Usually, then, an ELS appears as a set of points on a straight line. The
exceptional cases are those where the ELS "crosses" one of the
vertical edges of the array and reappears on the opposite edge. To
include these cases in our framework, we may think of the two vertical
edges of the array as pasted together, with the end of the first line
pasted to the beginning of the second, the end of the second to the
beginning of the third and so on. We thus get a cylinder on which the
text spirals down in one long line.
It has been noted that when Genesis is written in this way, ELS's
spelling out words with related meanings often appear in close
proximity. In Figure 1 we see the example of 'patish-שיטפ'
(hammer) and 'sadan-נדס'
(anvil); in Figure 2 , 'Zidkiyahu-והיקדצ'
(Zedekia) and 'Matanya-הינתמ'
(Matanya), which was the original name of King Zedekia (Kings II,
24:17). In Figure 3 we see yet another example
of 'hachanuka-הכונחה'
(the Chanuka) and 'chashmonaee-יאנומשח'
(Hasmonean), recalling that the Hasmoneans were the priestly family that
led the revolt against the Syrians whose successful conclusion the
Chanuka feast celebrates.
Fig. 2.
Fig. 3.
Indeed, ELS's for short words, like those for 'patish-שיטפ'
(hammer) and 'sadan-נדס'
(anvil), may be expected on general probability grounds to appear close
to each other quite often, in any text. In Genesis, though, the
phenomenon persists when one confines attention to the more
"noteworthy" ELS's, that is, those in which the skip |d|
is minimal over the whole text or over large parts of it. Thus
for 'patish-שיטפ'
(hammer), there is no ELS with a smaller skip than that of Figure 1 in
all of Genesis; for 'sadan-נדס'
(anvil), there is none in a section of text comprising 71% of G;
the other four words are minimal over the whole text of G. On
the face of it, it is not clear whether or not this can be attributed to
chance. Here we develop a method for testing the significance of the
phenomenon according to accepted statistical principles. After making
certain choices of words to compare and ways to measure proximity, we
perform a randomization test and obtain a very small p-value,
that is, we find the results highly statistically significant. |
Up to Section 1 Down to Section
3 Down to Appendix
|
2. OUTLINE OF THE PROCEDURE
In this section we describe the test in outline. In the Appendix,
sufficient details are provided to enable the reader to repeat the
computations precisely, and so to verify their correctness. The authors
will provide, upon request, at cost, diskettes containing the program
used and the texts G, I, R, T, U,
V and W (see Section 3).
We test the significance of the phenomenon on samples of pairs of
related words (such as hammer-anvil and Zedekia-Matanya). To do this we
must do the following:
(i) define the notion of "distance" between any two
words, so as to lend meaning to the idea of words in "close
proximity";
(ii) define statistics that express how close, "on the
whole," the words making up the sample pairs are to each other
(some kind of average over the whole sample);
(iii) choose a sample of pairs of related words on which to run the
test;
(iv) determine whether the statistics defined in (ii) are
"unusually small" for the chosen sample.
Task (i) has several components. First, we must define the notion of
"distance" between two given ELS's in a given array; for this
we use a convenient variant of the ordinary Euclidean distance. Second,
there are many ways of writing a text as a two-dimensional array,
depending on the row length; we must select one or more of these arrays
and somehow amalgamate the results (of course, the selection and/or
amalgamation must be carried out according to clearly stated, systematic
rules). Third, a given word may occur many times as an ELS in a text;
here again, a selection and amalgamation process is called for. Fourth,
we must correct for factors such as word length and composition. All
this is done in detail in Sections A.1 and A.2 of the Appendix.
We stress that our definition of distance is not unique. Although there
are certain general principles (like minimizing the skip d)
some of the details can be carried out in other ways. We feel that
varying these details is unlikely to affect the results substantially.
Be that as it may, we chose one particular definition, and have,
throughout, used only it, that is, the function c(w,w')
described in Section A.2 of the Appendix had been defined before any
sample was chosen, and it underwent no changes. [Similar remarks apply
to choices made in carrying out task (ii).]
Next, we have task (ii), measuring the overall proximity of pairs of
words in the sample as a whole. For this, we used two different
statistics P1 and P2 ,
which are defined and motivated in the Appendix (Section A.5).
Intuitively, each measures overall proximity in a different way. In each
case, a small value of Pi indicates that the words
in the sample pairs are, on the whole, close to each other. No other
statistics were ever calculated for the first, second or indeed
any sample.
In task (iii), identifying an appropriate sample of word pairs, we
strove for uniformity and objectivity with regard to the choice of pairs
and to the relation between their elements. Accordingly, our sample was
built from a list of p ersonalities (p) and the dates (Hebrew
day and month) (p') of their death or birth. The personalities
were taken from the Encyclopedia of Great Men in Israel [5].
At first, the criterion for inclusion of a personality in the sample was
simply that his entry contain at least three columns of text and that a
date of birth or death be specif ied. This yielded 34 personalities (the
first list--Table 1). In order to avoid any conceivable
appearance of having fitted the tests to the data, it was later decided
to use a fresh sample, without changing anything else. This was done by
considering all personalities whose entries contain between 1.5 and 3
columns of text in the Encyclopedia; it yielded 32
personalities (the second list--Table 2). The significance test
was carried out on the second sample only.
Note that personality-date pairs (p,p') are not word
pairs. The personalities each have several appellations, there are
variations in spelling and there are different ways of designating
dates. Thus each personality-date pair (p,p')
corresponds to several word pairs (w,w'). The precise
method used to generate a sample of word pairs from a list of
personalities is explained in the Appendix (Section A.3).
The measures of proximity of word pairs (w,w') result
in statistics P1 and P2
. As explained in the Appendix (Section A.5), we also used a variant of
this method, which generates a smaller sample of word pairs from the
same list of personalities. We denote the statistics P1
and P2 , when applied to this smaller
sample, by P3 and P4 .
Finally, we come to task (iv), the significance test itself. It is so
simple and straightforward that we describe it in full immediately.
The second list contains of 32 personalities. For each of the 32!
permutations p of these personalities, we
define the statistic P1p
obtained by permuting the personalities in accordance with p,
so that Personality i is matched with the dates of Personality p(i).
The 32! numbers P1p
are ordered, with possible ties, according to the usual order of the
real numbers. If the phenomenon under study were due to chance, it would
be just as likely that P1 occupies any one of the
32! places in this order as any other. Similarly for P2,
P3 and P4. This is our
null hypothesis.
To calculate significance levels, we chose 999,999 random permutations p
of the 32 personalities; the precise way in which this was done is
explained in the Appendix (Section A.6). Each of
these permutations p determines a statistic P1p;
together with P1, we have thus 1,000,000 numbers.
Define the rank order of P1 among these
1,000,000 numbers as the number of P1p
not exceeding P1; if P1 is tied
with other P1p,
half of these others are considered to "exceed" P1.
Let r1 be the rank order of P1,
divided by 1,000,000; under the null hypothesis, r1
is the probability that P1 would rank as low as it
does. Define r2, r3
and r4 similarly (using the same
999,999 permutations in each case).
After calculating the probabilities r1
through r4, we must make an
overall decision to accept or reject the research hypothesis. In doing
this, we should avoid selecting favorable evidence only. For example,
suppose that r3 = 0.01, the other ri
being higher. There is then the temptation to consider r3
only, and so to reject the null hypothesis at the level of 0.01. But
this would be a mistake; with enough sufficiently diverse statistics, it
is quite likely that just by chance, some one of them will be low. The
correct question is, "Under the null hypothesis, what is the
probability that at least one of the four ri
would be less than or equal to 0.01?" Thus denoting the event
"ri <= 0.01"
by Ei, we must find the probability not of E3,
but of "E1 or E2 or
E3 or E4." If the Ei
were mutually exclusive, this probability would be 0.04; overlaps only
decrease the total probability, so that it is in any case less than or
equal to 0.04. Thus we can reject the null hypothesis at the level of
0.04, but not 0.01.
More generally, for any given d, the
probability that at least one of the four numbers ri
is less than or equal to d is at most 4 d.
This is known as the Bonferroni inequality. Thus the overall
significance level (or p-value), using all four statistics, is r0
:= 4 min ri. |
Up to Section 1 Up to Section
2 Down to Appendix
|
3. RESULTS AND CONCLUSIONS
In Table 3, we list the rank order of each of the four Pi
among the 1,000,000 corresponding Pip.
Thus the entry 4 for P4 means that for
precisely 3 out of the 999,999 random permutations p,
the statistic P4p
was smaller than P4 (none was equal). It
follows that min ri =
0.000004 so r0 = 4 min ri
= 0.000016. The same calculations, using the same 999,999 random
permutations, were performed for control texts. Our first control text, R,
was obtained by permuting the letters of G randomly (for
details, see Section A.6 of the Appendix). After an
earlier version of this paper was distributed, one of the readers, a
prominent scientist, suggested to use as a control text Tolstoy's War
and Peace. So we used text T consisting of the
initial segment of the Hebrew translation of Tolstoy's War and Peace
[6]--of the same length of G. Then we were
asked by a referee to perform a control experiment on some early Hebrew
text. He also suggested to use randomization on words in two forms: on
the whole text and within each verse. In accordance, we checked texts I,
U and W: text I is the Book of Isaiah [2];
W was obtained by permuting the words of G randomly; U
was obtained from G by permuting randomly words within each
verse. In addition, we produced also text V by permuting the
verses of G randomly. (For details, see Section A.6 of the
Appendix.) Table 3 gives the results of these calculations, too. In the
case of I, min ri
is approximately 0.900; in the case of R it is 0.365; in the
case of T it is 0.277; in the case of U it is 0.276;
in the case of V it is 0.212; and in the case of W it
is 0.516. So in five cases r0 = 4
min ri exceeds 1, and in
the remaining case r0 = 0.847;
that is, the result is totally nonsignificant, as one would expect for
control texts.
We conclude that the proximity of ELS's with related meanings in the
Book of Genesis is not due to chance. |

TABLE 3
Rank order of Pi among one million Pip
| ___ |
P1 |
P2 |
P3 |
P4 |
| G |
453 |
5 |
570 |
4 |
| R |
619,140 |
681,451 |
364,859 |
573,861 |
| T |
748,183 |
363,481 |
580,307 |
277,103 |
| I |
899,830 |
932,868 |
929,840 |
946,261 |
| W |
883,770 |
516,098 |
900,642 |
630,269 |
| U |
321,071 |
275,741 |
488,949 |
491,116 |
| V |
211,777 |
519,115 |
410,746 |
591,503 |
Up to Section 1 Up to Section
2 Up to Section 3
|
APPENDIX: DETAILS OF THE PROCEDURE
In this Appendix we describe the procedure in sufficient detail to
enable the reader to repeat the computations precisely. Some motivation
for the various definitions is also provided.
In Section A.1, a "raw" measure of distance between words is
defined. Section A.2 explains how we normalize this raw measure to
correct for factors like the length of a word and its composition (the
relative frequency of the letters occurring in it). Section A.3 provides
the list of personalities p with their dates p' and
explains how the sample of word pairs (w, w') is
constructed from this list. Section A.4 identifies the precise text of
Genesis that we used. In Section A.5, we define and motivate the four
summary statistics P1, P2, P3
and P4. Finally, Section A.6 provides the
details of the randomization.
Sections A.1 and A.3 are relatively technical; to gain an understanding
of the process, it is perhaps best to read the other parts first. |
|
A.3 The Sample of Word Pairs
The reader is referred to Section 2, task (iii), for a general
description of the two samples. As mentioned there, the significance
test was carried out only for the second list, set forth in Table 2.
Note that the personalities each may have several appelations (names),
and there are different ways of designating dates. The sample of word
pairs (w, w') was constructed by taking each name of
each personality and pairing it with each designation of that
personality's date. Thus when the dates are permuted, the total number
of word pairs in the sample may (and usually will) vary.
We have used the following rules with regard to Hebrew spelling:
1. For words in Hebrew, we always chose what is called the grammatical
orthography--"ktiv dikduki." See the entry "ktiv"
in Even-Shoshan's dictionary [1].
2. Names and designations taken from the Pentateuch are spelled as
in the original.
3. Yiddish is written using Hebrew letters; thus, there was no need
to transliterate Yiddish names.
4. In transliterating foreign names into Hebrew, the letter "alef-א"
is often used as a mater lectionis; for example, "Luzzatto"
may be written "וטצול"
or "וטאצול."
In such cases we used both forms.
In designating dates, we used three fixed variations of the format of
the Hebrew date. For example, for the 19th of Tishri, we used ירשת
ט'י, ירשת
ט'יב and ירשתב
ט'י. The 15th and 16th of any Hebrew month can be denoted as ה'י
or ו'ט and
ו'י or ז'ט,
respectively. We used both alternatives.
The list of appellations for each personality was provided by Professor
S. Z. Havlin, of the Department of Bibliography and Librarianship at Bar
Ilan University, on the basis of a computer search of the
"Responsa" database at that university.
Our method of rank ordering of ELS's based on (x, y, z)-perturba
tions requires that words have at least five letters to apply the
perturbations. In addition, we found that for words with more than eight
letters, the number of (x, y, z)-perturbed
ELS's which actually exist for such words was too small to satisfy our
criteria for applying the corrected distance. Thus the words in our list
are restricted in length to the range 5-8. The resulting sample consists
of 298 word pairs (see Table 2). |
|
A.4 The Text
We used the standard, generally accepted text of Genesis known as the Textus
Receptus. One widely available edition is that of the Koren
Publishing Company in Jerusalem. The Koren text is precisely the same as
that used by us. |
|
A.6 The Randomizations
The 999,999 random permutations of the 32 personalities were chosen in
accordance with Algorithm P of Knuth [4],
page 125. The pseudorandom generator required as input to this algorithm
was that provided by Turb-Pascal 5.0 of Borland Inter Inc. This, in
turn, requires a seed consisting of 32 binary bits; that is, an integer
with 32 digits when written to the base 2. To generate this seed, each
of three prominent scientists was asked to provide such an integer, just
before the calculation was carried out. The first of the three tossed a
coin 32 times; the other two used the parities of the digits in widely
separated blocks in the decimal expansion of p.
The three resulting integers were added modulo 232. The
resulting seed was 01001 10000 10011 11100 00101 00111 11.
The control text R was constructed by permuting the 78,064
letters of G with a single random permutation, generated as in
the previous paragraph. In this case, the seed was picked arbitrarily to
be the decimal integer 10 (i.e., the binary integer 1010). The control
text W was constructed by permuting the words of G in
exactly the same way and with the same seed, while leaving the letters
within each word unpermuted. The control text V was constructed
by permuting the verses of G in the same way and with the same
seed, while leaving the letters within each verse unpermuted.
The control text U was constructed by permuting the words
within each verse of G in the same way and with the same seed,
while leaving unpermuted the letters within each word, as well as the
verses. More precisely, the Algorithm P of Knuth [4]
that we used requires n - 1 random numbers to produce a random
permutation of n items. The pseudorandom generator of Borland
that we used produces, for each seed, a long string of random numbers.
Using the binary seed 1010, we produced such a long string. The first
six numbers in this string were used to produce a random permutation of
the seven words constituting the first verse of Genesis. The next
13 numbers (i.e., the 7th through the 19th random numbers in the string
produced by Borland) were used to produce a random permutation of the 14
words constituting the second verse of Genesis, and so on. |
|
REFERENCES
[1] EVEN-SHOSHAN, A. (1989). A New Dictionary of
the Hebrew Language. Kiriath Sefer, Jerusalem.
[2] FCAT (1986). The Book of Isaiah, file ISAIAH.MT.
Facility for Computer Analysis of Texts (FCAT) and Tools for Septuagint
Studies (CATSS), Univ. Pennsylvania, Philadelphia. (April 1986.)
[3] FELLER, W. (1966). An Introduction to
Probability Theory and Its Applications 2. Wiley,
New York.
[4] KNUTH, D. E. (1969). The Art of Computer
Programming 2. Addison-Wesley, Reading, MA.
[5] MARGALIOTH, M., ed. (1961). Encyclopedia of
Great Men in Israel; a Bibliographical Dictionary of Jewish Sages and
Scholars from the 9th to the End of the 18th Century 1-4.
Joshua Chachik, Tel Aviv.
[6] TOLSTOY, L. N. (1953) War and Peace.
Hebrew translation by L. Goldberg, Sifriat Poalim, Merhavia.
[7] WEISSMANDEL, H. M. D. (1958). Torath Hemed.
Yeshivath Mt. Kisco, Mt. Kisco.
|
Up to Section 1 Up to Section 2
Up to Section 3 Up to Appendix
|