Programs Sep, 25 >> TACU >> [ duplicator | cross-entropy | generator | suffsort | trised | xcitata ]

Program cross-entropy version 1.3.2

Program cross-entropy version 1.3.2

Initial revision 2002-08-29; Last revision 2003-08-22

1  Download
2  File readme
3  Usage and options summary
4  Description
5  Project revision history
6  License

1  Download

Sources: src/cross-entropy-1.3.2.tgz [37 Kb ]

Win9x-EXE (minGW cross-compiled): mingw/cross-entropy.zip [28 Kb ]

2  File readme

cross-entropy --- calculation of relative entropy and R-measure


SUPPORTED ENVIRONMENTS

http://www.gnu.org    GNU/Linux 
http://www.mingw.org  MinGW --- Minimalist GNU For Windows


COMPILATION

To compile the cross-entropy program, run (GNU) make in this
directory.

The program will fall to ./bin catalog.


BRIEF INSTRUCTION

This program can be used to calculate relative entropy of one file
w.r.t another. Program uses suffix arrays, so the maximal memory
overhead is four times the size of the text itself. It uses Markov
model of given order to find estimate for relative entropy. Be aware,
that the large order you require, the less precision of the estimate.

You can read about estimates of relative entropy in the paper

Kukushkina O.V., Polikarpov A.A., Khmelev D.V., Using Literal and
Grammatical Statistics for Authorship Attribution (in Russian)
[Opredeleniye avtorstva teksta s ispol'zovaniyem bukvennoi i
grammaticheskoi informacii]//Problemy Peredachi Informatsii, 2001,
vol.37, no.2, pp.96-108. Translated in «Problems of Information
Transmission», 2001, vol.37, no.2, pp. 172-184.
http://www.math.toronto.edu/dkhmelev/PAPERS/published/gramcodes/gramcodeseng.html

This program also able to compute relative R-measure of one text with
respect to another.  More detailed description of R-measure (as well
as examples of application of this program to Markov chain
computation) can be found in the following paper

Khmelev D., Teahan W. A repetition based measure for
verification of text collections and for text categorization.
SIGIR'2003, July 28--August 1, 2003, Toronto, Canada.
http://www.math.toronto.edu/dkhmelev/PAPERS/published/2003/sigir/sig.html

License conditions are described in file LICENSE.txt

BUGS

Versions 1.2.3 and earlier outputted R()=nan on large files (about 400
Kb). Please update cross-entropy to version >=1.3.0.


3  Usage and options summary

user@computer$ ./bin/cross-entropy --help
Usage: cross-entropy [OPTION]... FILE1 FILE2
  -a, --array               suffix array for FILE1 (default FILE1.ary)
  -b, --barray              suffix array for FILE2 (default FILE2.ary)
  -c, --checkonly           check the correctness of both
                            suffix arrays and exit
  -s, --self-suffix-array   Ignore FILE1.ary and FILE2.ary and calculate
                            suffix arrays itself to do computations
  -o, --order <number>      set the order to be equal to <number> (default 2)
  -l, --low-order <number>  calculate starting from the low order <number> 
  -t, --top-order <number>  to the to <number>
  -r, --repetition-index    calculate R-index
  -q, --quiet               output entropies only
  -h, --help                display this help and exit
  -m, --man                 display description
  -v, --version             display version and exit


4  Description

user@computer$ ./bin/cross-entropy --man
This program calculates relative entropy of given order(s) of FILE2
w.r.t. FILE1. By default it expects that suffix array for both files
are constructed and are kept in FILE1.ary and FILE2.ary, resp.  If one
of suffix arrays is missed, then cross-entropy tries to construct
suffix array on its own, which increase the computation time
significantly and the memory used by factor 5 (this behaviour may be
forced by --self-suffix-array flag). Suffix arrays are expected in Big
Endians 4-byte per integer format as provided by mksary program by Satoru
Takabayashi.


As a test example one can try to calculate the relative entropy
of file containing the string 'aaaaabaaaaac' only w.r.t. itself.
It's relative entropy is 
  -(8/11 ln(4/5) - 2/11 ln(10)) = 0.5809380541 
The same quantity without normalization is
  2 ln(10) - 8 ln(4/5) = 6.390318596 = 11*0.5809380541,
where ln() is a logarithm w.r.t number e=2.7182818284...

Another example: let test1.txt contains 'aaaaabaaaaac' and test2.txt
contains 'aaaaab'. Then H_2(test2.txt|test1.txt) is 0.639032.

Option -r allows to compute so-called R-index (see description in
papers by Khmelev&Teahan). It applies only when suffix arrays were
precomputed. Examples:
R(test1.txt|test2.txt)=0.679366
R(test2.txt|test1.txt)=1.000000


5  Project revision history

Files of the project were modified on the following dates:

2002-08-29

2002-08-30

2002-09-13

2003-05-16

2003-08-22

6  License

cross-entropy - calculation of relative entropy and R-measure

Available at http://www.math.toronto.edu/dkhmelev/PROGS/tacu/

Author:

Dmitry V. Khmelev dkhmelev((at))math.toronto.edu [change ((at)) to @ in order to get proper address - antispam]

University of Toronto, Department of Mathematics, 100 St George Street, M5S 3G3 ON, Canada

LICENSING TERMS

This program is granted free of charge for research and education purposes. However you must obtain a license from the author to use it for commercial purposes.

Scientific results produced using the software provided shall acknowledge the use of cross-entropy. The proper references are:

D. Khmelev, Text Analysis and Conversion Utilities http://www.math.toronto.edu/dkhmelev/PROGS/tacu/

Kukushkina O.V., Polikarpov A.A., Khmelev D.V., Using Literal and Grammatical Statistics for Authorship Attribution (in Russian) [Opredeleniye avtorstva teksta s ispol'zovaniyem bukvennoi i grammaticheskoi informacii]//Problemy Peredachi Informatsii, 2001, vol.37, no.2, pp.96-108. Translated in «Problems of Information Transmission», 2001, vol.37, no.2, pp. 172-184. http://www.math.toronto.edu/dkhmelev/PAPERS/published/gramcodes/gramcodeseng.html

Moreover shall the author of cross-entropy be informed about the publication.

The software must not be modified and distributed without prior permission of the author.

By using cross-entropy you agree to the licensing terms.

NO WARRANTY

BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM ÄS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

1  Download
2  File readme
3  Usage and options summary
4  Description
5  Project revision history
6  License

Programs Sep, 25 >> TACU >> [ duplicator | cross-entropy | generator | suffsort | trised | xcitata ]

- ???????@Mail.ru
© 2002-2005 D.Khmelev -