Programs Sep, 25 >> TACU >> [ duplicator | cross-entropy | generator | suffsort | trised | xcitata ]

Program duplicator version 1.0.9

Program duplicator version 1.0.9

Initial revision 2002-08-31; Last revision 2004-05-31

1  Download
2  File readme
3  Usage and options summary
4  Description
5  Project revision history
6  License

1  Download

Sources: src/duplicator-1.0.9.tgz [43 Kb ]

Win9x-EXE (minGW cross-compiled): mingw/duplicator.zip [38 Kb ]

2  File readme

duplicator --- locator of duplicated and plagiarized documents


SUPPORTED ENVIRONMENTS

http://www.gnu.org    GNU/Linux 
http://www.mingw.org  MinGW --- Minimalist GNU For Windows


COMPILATION

Enter make (or gmake) in the directory where sources reside


BRIEF INSTRUCTION

This program can be used for the location of duplicated and
plagiarized documents in a text collections. Before you use this
program you have to concantenate all documents of the text collection
into one file, separating documents by special "sentinel" symbol (byte),
which is not encountered in any of the documents of the coollection.
Usually symbol '\000' works fine, by default program uses line feed
'\n' symbol to separate documents, which can be overrided by option
--record-separator (this assumes that all records in the file have no
linefeed symbol, you can change it to '\015'; linefeed separators are
useful because they allow use of Unix commands 'head' and 'tail' for
location of the records). Suppose that you've put your collection with
'\000'-separated records into file COLLECTION.TXT.

Next, you have to construct suffix array

duplicator -m --record-separator=0 COLLECTION.TXT

This action can take quite a long time (hours, or even days,
especially if the length of COLLECTION.TXT times five is less than the
size of RAM). The behaviour (in particular, memory consumption) can be
adjusted using special switch --memory-for-index. The size of
resulting suffix array may exceed file size limit on your system (some
systems can not have more than 2Gb files). This can be dealt with
option --max-file-size. 

Using suffix array constructed you can use all other options for
search of duplicates and plagiarized documents. The most useful
commands are

duplicator -r --record-separator=0 COLLECTION.TXT

duplicator -l --record-separator=0 COLLECTION.TXT

The first command outputs R-measure for each document w.r.t the rest
of the collection into file COLLECTION.TXT.rrepeats.txt 

Second command outputs longest repeated substrings of each document in
the rest of the collection into file COLLECTION.TXT.lrepeats.txt 

More detailed description of R-measure can be found in the following
paper

Khmelev D., Teahan W. A repetition based measure for
verification of text collections and for text categorization.
SIGIR'2003, July 28--August 1, 2003, Toronto, Canada.

License conditions are described in file LICENSE.txt


3  Usage and options summary

user@computer$ ./duplicator --help
Usage: duplicator -[mc12rls] [LONGOPTIONS] FILE
  -m  --makesuffixarray     construct suffix array in FILE.s00, FILE.s01...
  -c  --checkorder          check the correctness of suffix array
  -s  --lcpstat             puts Lcp statistics to FILE.lcpstat.txt
  -l  --lrepeat             puts LCS info into FILE.lrepeats.txt
  -r  --rrepeat             puts approximate R repeats to FILE.rrepeats.txt
  -1  --countrepetitions    record stats to FILE.cmnsbstr.txt
  -2  --countr              puts R statistics to FILE.rstat.txt
  -h, --help                display this help and exit
  -d, --description         display complete description
  -v, --version             display version and exit
Long options: (each option requires a numerical argument)
  --record-separator=<num>  the code in 0..255 of record separator character
  --max-file-size=<num>     (in Mb's), by default 2000 (Mb),
  --memory-for-index=<num>  size of each piece in multi-stage sort (100 Mb)
  The following options manage the data representation in FILE.lrepeats.txt
  --lrepeat-level=<num>     output LCS longer than <num> only (default 50)
  --lrepeat-max-list-length (default 2000) the maximal number of
                            LCS per record displayed in FILE.lrepeats.txt
  --lrepeat-outstrlen=<num> sample LCS displayed are at most <num>(default 20)
  The following options manage the work of action -r or -1
  --rrepeat-max-list=<num>  (default 10) the pretenders list size
  --r-level=<num>           omit Lcps shorter than --r-level (default 1) 

4  Description

user@computer$ ./duplicator --description
<Usage information from the previous section is omitted>
 

  *** DESCRIPTION
  (input files: FILE)
  (general options: --record-separator 
                    --out-recnum-offset --zero-recnum-offset)

This program give some information on duplications and repetitions in
file FILE. Repetitions are calculated between "records", which should
be separated by record separator (RS), which '\n' is by default,
and can be modified with --record-separator <RS code> option. Duplicator
simply substitutes each <RS> character with zero '\000' character after
loading source FILE into memory, so any zero characters in FILE makes an
error unless <RS> is '\000'.

With options -l, -r, -1, -2, duplicator produces some output files,
containing infomation on records. The records by default are numbered
from 1. If you prefer, you can make offsets to be zero using
--zero-recnum-offset option. Other offsets (in a range from
-1000*1000*1000..1000*1000*1000) can be obtained set with
--out-recnum-offset option accepting integer argument in a form
--out-recnum-offset=<num>


   *** Option -m 
  (input files: FILE)
  (output files: FILE.s00, FILE.s01 etc)
  (temporary files: _tmppool%03d.bin)
  (sub-options: --max-file-size --memory-for-index)

In order to compute repetition characteristics, the duplicator needs to
construct suffix array. You should call it with option -m to do it:

  duplicator -m FILE

If you have enough memory (about size of FILE times 5), then the
program would construct the suffix array and put it into FILE.s00. If
your FILE is really large (more than 512 Mb), then it could happen
that the size of suffix array will exceed the limit for file size in
your system (which is usually 2Gb). If suffix array is larger that
2000 than it is splitted into several pieces
numbered, resp. FILE.s00, FILE.s01 etc. You can modify maximum file
size using option --max-file-size setting the maximal file size on
your system in Mb's.

More usual problem is the shortage of memory. If your RAM size is less
than the size of FILE, then upgrade your memory. If your RAM size is
slightly more than size of FILE (e.g. 1.5 times more), then duplicator
can make suffix array in pieces (assuming that you have
8*length(FILE) free space on your hard drive). The size of each piece
in Mb's is defined by option --memory-for-index (default 100Mb).
Program will construct temporary files of with names _tmppool%03d.bin
in working directory and will remove them after it has been finished
successfully.

Suffix array construction is a hard job, so the program will never
overwrite the existing file(s). Hence, if you wish to reconstruct
suffix array of if the program was interrupted, you should manually
delete all the files related. Notice also that only this option
requires a lot of memory, all other functions memory requirement are
much lighter: they simply load FILE into memory and than operate
scanning through suffix array consequently, loading only a small part
of it (at most the size of longest record elements)

  *** Option -c 
  (input files: FILE  FILE.s00 [and FILE.s01 etc if exist])
  (output: stdout)
  (sub-options: --display-suffix-sample)
outputs the report into stdout. It checks the number of suffixes,
information on mis-sorted suffixes. 

You can use option --display-suffix-sample to display a sample of
20 suffixes from the middle of the suffix array (more precisely,
last 20 suffixes of first 1/30 of suffix array)

Suffix array structure is quite simple. After the input file is loaded
into memory. it is logically preceded by <RS> character and is kept in
a single large array of characters S[0..n], where n is the length of
file (in fact, if the last record did not end with <RS> character, the
text is appended by another <RS> character). The main fact for us is
that the text itself is S[1..n]. A sub-array S[i..n] is called a suffix
number i. Suffix i is empty if S[i]=0 and non-empty otherwise. In
suffix array FILE.s00 only non-empty suffixes are kept. Each non-empty
suffix is a 4-byte positive number. This numbers are sorted in
lexicographical order of corresponding articles and form suffix
array. To be absolutely precise if two suffixes S[i_1..n] and S[i_2..n]
are the same as *zero-tailored*string*, than the order is determined by
by order of number i_1 and i_2, i.e., if i_1<i_2 then
S[i_1..n]<S[i_2..n]. This definition of order imply that there are no
"equal" suffixes at all.

The structure if FILE.s00 is as follows, where offsets and
sizes are given in bytes

offset  size  value  meaning
     0     4    T    a total number of non-empty suffixes
     4     4   i_1   S[i_1..n] --- the smallest suffix
     8     4   i_2   S[i_2..n]
 ....................
   4*T     4   i_T   S[i_T..N] --- the largest suffix

If 4*T is larger than --max-file-size, then for some k we have the
following structure of FILE.s00:

offset  size  value  meaning
     0     4    T    a total number of non-empty suffixes
     4     4   i_1   S[i_1..n] --- the smallest suffix
     8     4   i_2   S[i_2..n]
 ....................
   4*k     4   i_k   S[i_k..N] 
4*(k+1)    4    0    trailing zero means that the suffix array is 
                     continued in next FILE.s01

FILE.s01 structure:
offset  size  value    meaning
     0     4   i_{k+1} S[i_{k+1}..n]
     4     4   i_{k+2} S[i_{k+2}..n] 
 ....................

The trailing zero means that we should go for the next suffix array
file in FILE.s01, FILE.s02 etc.

  *** Option -l (--lrepeat) 
  (input files: FILE  FILE.s00 [and FILE.s01 etc if exist])
  (output: FILE.lrepeats.txt)
FILE. More precisely, for each record R it determines a list of other
records, which have common substring with the record R of length at
least --lrepeat-level (which is 50 by default).

The program keeps no more than --lrepeat-max-list-length records in a
list of CS records for each record R while calculation. The records
with longer CS have the priority and they replace those with smaller
CS. 

In output file  FILE.lrepeats.txt information on records is
displayed line-per-record, e.g.,
     1  1920 L=30 repsnum=0
     2   269 L=33 repsnum=0
     3   198 L=39 repsnum=0
     4  2143 L=2143 repsnum=1 (16,2143,2391,4533,"is not under pressur")
     5   270 L=270 repsnum=2 (12751,270,4535,4804,"reported the farmer ") (14502
,164,4623,4786," and have matured re")
     6   444 L=28 repsnum=0
...........................

where the first column contains the number of the record, the second
being the record length, the third being the length of longest
repetition of substring in this record. The rest of the line
represents the records which have at least --lrepeat-level common
substring with a given one. It begins with "repsnum=<num>", where
<num> is the number of these records, possibly 0. Each longest
repetition with other records is given by three scalars
"(rec,lcs,offs,loffs,str)" in brackets, separated by commas. Here
'rec' is the record number, 'lcs' being the longest common substring
between rec and record for this line, 'offs' being the number of byte,
starting the shared string, 'loffs' being the number of the last byte
of the shared string, and 'str' being first few characters of shared
string. The values of loffs and lcs are useful with 'head' and 'tail'
Unix commands, which allow to view the repeated string completely. In
order to do it, just type

  head -c`loffs` FILE |tail -c`lcs`

where `loffs` and `lcs` are numerical values from a quinter above. For
example, the example above is obtained for the file reuters.txt and
the command would be:

   head -c4533 reuters.txt|tail -c 2143|head -c70

producing the following output (cutted)
is not under pressure to act quickly on its proposed equity offering a

The string str is a standard C string with echoed special symbols and ".

The maximal length of str is --lrepeat-outstrlen (default 20).

A final remark: this mode is better used to detect that there are no
repeted strings in the corpora. The problem is that for each pair of
records only the repetition of maximal length is reported, while there
could be a lot of repetitions of the same length which are not
displayed.

  *** Option -r (--rrepeat)
  (input files: FILE  FILE.s00 [and FILE.s01 etc if exist])
  (output: FILE.rrepeats.txt)
  (sub-options: --rrepeat-max-list --r-level)
Finds out all approximate repeats for R statistics.

  *** Option -1 (--countrepetitions) 
  (input files: FILE  FILE.s00 [and FILE.s01 etc if exist])
  (output: FILE.cmnsbstr.txt)
Counts repetition statistics.

  *** Option -2 (--countr)
  (input files: FILE  FILE.s00 [and FILE.s01 etc if exist])
  (output: FILE.rstat.txt)
  (sub-options: --r-level)
Counts R statistics only

  *** Process management
Ctrl-C should be used to display the progress in lengthy subroutines.

For Unix users: To stop the program suspend it pressing Ctrl-Z and use
kill -9. You can also send a signal kill -2 to the program to check
current status.

For Win32 users: To stop the program close the console window.


5  Project revision history

Files of the project were modified on the following dates:

2002-08-31

2002-09-01

2002-09-10

2002-09-12

2002-09-25

2002-10-23

2002-10-25

2003-03-08

2003-05-12

2003-05-16

2004-05-31

6  License

duplicator - locator of duplicated and plagiarized documents

Available at http://www.math.toronto.edu/dkhmelev/PROGS/tacu/

Author:

Dmitry V. Khmelev dkhmelev((at))math.toronto.edu [change ((at)) to @ in order to get proper address - antispam]

University of Toronto, Department of Mathematics, 100 St George Street, M5S 3G3 ON, Canada

LICENSING TERMS

This program is granted free of charge for research and education purposes. However you must obtain a license from the author to use it for commercial purposes.

Scientific results produced using the software provided shall acknowledge the use of duplicator. The proper references are:

D. Khmelev, Text Analysis and Conversion Utilities http://www.math.toronto.edu/dkhmelev/PROGS/tacu/

Khmelev D., Teahan W. A repetition based measure for verification of text collections and for text categorization. SIGIR'2003, July 28-August 1, 2003, Toronto, Canada. http://www.math.toronto.edu/dkhmelev/PAPERS/published/2003/sigir/sig.html

Moreover shall the author of duplicator be informed about the publication.

The software must not be modified and distributed without prior permission of the author.

By using duplicator you agree to the licensing terms.

NO WARRANTY

BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM ÄS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

1  Download
2  File readme
3  Usage and options summary
4  Description
5  Project revision history
6  License

Programs Sep, 25 >> TACU >> [ duplicator | cross-entropy | generator | suffsort | trised | xcitata ]

- ???????@Mail.ru
© 2002-2005 D.Khmelev -