|
Program cross-entropy version 1.3.2
Program cross-entropy version 1.3.2
Initial revision 2002-08-29; Last revision 2003-08-22
1 Download
2 File readme
3 Usage and options summary
4 Description
5 Project revision history
6 License
1 Download
Sources: src/cross-entropy-1.3.2.tgz [37 Kb ]
Win9x-EXE (minGW cross-compiled): mingw/cross-entropy.zip [28 Kb ]
2 File readme
cross-entropy --- calculation of relative entropy and R-measure
SUPPORTED ENVIRONMENTS
http://www.gnu.org GNU/Linux
http://www.mingw.org MinGW --- Minimalist GNU For Windows
COMPILATION
To compile the cross-entropy program, run (GNU) make in this
directory.
The program will fall to ./bin catalog.
BRIEF INSTRUCTION
This program can be used to calculate relative entropy of one file
w.r.t another. Program uses suffix arrays, so the maximal memory
overhead is four times the size of the text itself. It uses Markov
model of given order to find estimate for relative entropy. Be aware,
that the large order you require, the less precision of the estimate.
You can read about estimates of relative entropy in the paper
Kukushkina O.V., Polikarpov A.A., Khmelev D.V., Using Literal and
Grammatical Statistics for Authorship Attribution (in Russian)
[Opredeleniye avtorstva teksta s ispol'zovaniyem bukvennoi i
grammaticheskoi informacii]//Problemy Peredachi Informatsii, 2001,
vol.37, no.2, pp.96-108. Translated in «Problems of Information
Transmission», 2001, vol.37, no.2, pp. 172-184.
http://www.math.toronto.edu/dkhmelev/PAPERS/published/gramcodes/gramcodeseng.html
This program also able to compute relative R-measure of one text with
respect to another. More detailed description of R-measure (as well
as examples of application of this program to Markov chain
computation) can be found in the following paper
Khmelev D., Teahan W. A repetition based measure for
verification of text collections and for text categorization.
SIGIR'2003, July 28--August 1, 2003, Toronto, Canada.
http://www.math.toronto.edu/dkhmelev/PAPERS/published/2003/sigir/sig.html
License conditions are described in file LICENSE.txt
BUGS
Versions 1.2.3 and earlier outputted R()=nan on large files (about 400
Kb). Please update cross-entropy to version >=1.3.0.
3 Usage and options summary
user@computer$ ./bin/cross-entropy --help
Usage: cross-entropy [OPTION]... FILE1 FILE2
-a, --array suffix array for FILE1 (default FILE1.ary)
-b, --barray suffix array for FILE2 (default FILE2.ary)
-c, --checkonly check the correctness of both
suffix arrays and exit
-s, --self-suffix-array Ignore FILE1.ary and FILE2.ary and calculate
suffix arrays itself to do computations
-o, --order <number> set the order to be equal to <number> (default 2)
-l, --low-order <number> calculate starting from the low order <number>
-t, --top-order <number> to the to <number>
-r, --repetition-index calculate R-index
-q, --quiet output entropies only
-h, --help display this help and exit
-m, --man display description
-v, --version display version and exit
4 Description
user@computer$ ./bin/cross-entropy --man
This program calculates relative entropy of given order(s) of FILE2
w.r.t. FILE1. By default it expects that suffix array for both files
are constructed and are kept in FILE1.ary and FILE2.ary, resp. If one
of suffix arrays is missed, then cross-entropy tries to construct
suffix array on its own, which increase the computation time
significantly and the memory used by factor 5 (this behaviour may be
forced by --self-suffix-array flag). Suffix arrays are expected in Big
Endians 4-byte per integer format as provided by mksary program by Satoru
Takabayashi.
As a test example one can try to calculate the relative entropy
of file containing the string 'aaaaabaaaaac' only w.r.t. itself.
It's relative entropy is
-(8/11 ln(4/5) - 2/11 ln(10)) = 0.5809380541
The same quantity without normalization is
2 ln(10) - 8 ln(4/5) = 6.390318596 = 11*0.5809380541,
where ln() is a logarithm w.r.t number e=2.7182818284...
Another example: let test1.txt contains 'aaaaabaaaaac' and test2.txt
contains 'aaaaab'. Then H_2(test2.txt|test1.txt) is 0.639032.
Option -r allows to compute so-called R-index (see description in
papers by Khmelev&Teahan). It applies only when suffix arrays were
precomputed. Examples:
R(test1.txt|test2.txt)=0.679366
R(test2.txt|test1.txt)=1.000000
5 Project revision history
Files of the project were modified on the following dates:
2002-08-29
2002-08-30
2002-09-13
2003-05-16
2003-08-22
6 License
cross-entropy - calculation of relative entropy and R-measure
Available at http://www.math.toronto.edu/dkhmelev/PROGS/tacu/
Author:
Dmitry V. Khmelev
dkhmelev((at))math.toronto.edu
[change ((at)) to @ in order to get proper address - antispam]
University of Toronto,
Department of Mathematics,
100 St George Street,
M5S 3G3 ON,
Canada
LICENSING TERMS
This program is granted free of charge for research and education
purposes. However you must obtain a license from the author to use it
for commercial purposes.
Scientific results produced using the software provided shall
acknowledge the use of cross-entropy. The proper references are:
D. Khmelev, Text Analysis and Conversion Utilities
http://www.math.toronto.edu/dkhmelev/PROGS/tacu/
Kukushkina O.V., Polikarpov A.A., Khmelev D.V., Using Literal and
Grammatical Statistics for Authorship Attribution (in Russian)
[Opredeleniye avtorstva teksta s ispol'zovaniyem bukvennoi i
grammaticheskoi informacii]//Problemy Peredachi Informatsii, 2001,
vol.37, no.2, pp.96-108. Translated in «Problems of Information
Transmission», 2001, vol.37, no.2, pp. 172-184.
http://www.math.toronto.edu/dkhmelev/PAPERS/published/gramcodes/gramcodeseng.html
Moreover shall the author of cross-entropy be informed about the
publication.
The software must not be modified and distributed without prior
permission of the author.
By using cross-entropy you agree to the licensing terms.
NO WARRANTY
BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
PARTIES PROVIDE THE PROGRAM ÄS IS" WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF
THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO
LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY
OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED
OF THE POSSIBILITY OF SUCH DAMAGES.
1 Download
2 File readme
3 Usage and options summary
4 Description
5 Project revision history
6 License
|