3 Experimentation Details

implementation paper

Figure 1, p.4, the Parrennes examples. These examples are in the Fabrice directory. Go to this directory, and run grep -c input_clause ⟨file⟩, where ⟨file⟩ ranges over alice_full1.h1.p, alice_full2.h1.p, alice_full3.h1.p, needham_preuve.h1.p and needham_saturation.h1.p. This gives you the figure in the # of clauses column.

Now run tptp2dfg ⟨file⟩. Get the argument of the functions directive, listing all functions in the signature, and count them. This gives you the figures in the #functions column.

To count the #literals/clause figure, do grep -c -- '--' ⟨file⟩ to count all negative literals, grep -c -- '++' ⟨file⟩ to count all positive literals, add the two numbers, divide by # of clauses.

To compute the average size of clauses, do

cat ⟨file⟩ | grep -v input_clause | sed -e s/--// | sed -e s/++// | sed -e "s/[][().,]/ /g" | wc -w

then divide by # of clauses.

Claim “Almost none of the problems from the TPTP library are in the H₁ class”, top of p.5: run the shell script findh1.

Figure 2, p.5. This table was built by launching the shell script precision. Then count the number of lines starting with *** to fill in the #defin. coarse row, and the number of lines starting with +++ to fill in the #possib. coarse row. The Total # row is obtained by running

EXT=h1
for i in tptp/*
do
  echo $i: `grep -l Killed $i/*.bench.$EXT | wc -l` killed, `grep -l NoMem $i/*.bench.$EXT | wc -l` nomem, `ls $i/*.h1.p | wc -l` files
done

and taking the last number of each line (just before “files”).

All TPTP benchmarks, running h1 (with all variants mentioned in the paper, and in fact at least one more) or tt SPASS, are run by launching the shell script bench. Beware this takes something like eight hours to complete.

This produces the tables of Figure 3, 4, 7, 13, 16, 20, 22 automatically, in files called times.h1.tex, times.h1-no-skel.tex, etc.

This also prepares some required files for other tables mentioned below.

Figure 5 is obtained by launching gnuplot v4.0 or higher, then typing the commands

set term postscript eps noenhanced 20
set output "times.h1.eps"
load "times.h1.plot"

If you don't have any file called times.h1.plot, you probably forgot to run bench. The times.h1.plot is obtained by running statbench h1.

Figure 6 is obtained, similarly, by launching gnuplot v4.0 or higher, then typing the commands

set term postscript eps noenhanced 20
set output "mem.h1.eps"
load "mem.h1.plot"

Figure 7 is obtained by launching gnuplot v4.0 or higher, then typing the commands

set term postscript eps noenhanced 20
set output "times.spass.eps"
load "times.spass.plot"

Figure 8 is obtained by launching gnuplot v4.0 or higher, then typing the commands

set term postscript eps noenhanced 20
set output "mem.spass.eps"
load "mem.spass.plot"

Before that, you need to change the set xrange line in mem.spass.plot to [1:], otherwise something will be missing from the diagram.

Figure 9 was obtained by copying each command line following the prompt > in file bench.h1 into a console and running them as shown. Do not run this as though it were a shell script, it isn't. The answers we have got were then copied into the bench.h1 file right after the corresponding command.

The “generated” column of the figure was obtained by copying the number at the end of the line “Total generated (derived+subsumed forward+frozen):”. Similarly for the “subsumed” column and the “Total subsumed (forward+backward+fully defined):” line, the “automaton” column and the “Automata clauses:” line. Times were gotten by reading user time (just before “user” in lines such as “358.26 user, 0.68 system, 744.06 elapsed – Max VSize = 268380KB, Max RSS = 265588KB”). Memory usage was obtained by reading the number just after “Max VSize = ” on the same line, and dividing by 1 024 to get a result in Mb.

Figure 10 was obtained similarly, from three sets of measures of the kind found in Figure 9, then computing averages and standard deviations.

p.7, claim that “SPASS did not terminate on any of the Parrennes examples, even when allotted 60 minutes of computation time.”. Claim that “SPASS did not terminate on the original, unapproximated problems S either, again within a 60 minute time limit.” On p.11, claim that “SPASS does terminate on the original problems, when run with the -RSSi=1 (enable sort simplification), -FuncWeight=999999 (compare terms mainly by function count and ignore predicates), and -Select=2 (select all negative literals) options.”, and that “SPASS -RSSi=1 -FuncWeight=999999 -Select=2 still does not terminate in less than 60 minutes on the (approximated) Parrennes examples.”

These conclusions were reached by running SPASS 2.0 using the commands listed in file bench.spass. Do not run the shell script file directly. Rather, go to the right directory, copy each command and paste each into some console. The results we obtained are indicated in comments after each command line.

p.19, claim that “all of the Parrennes and TPTP examples whose type proxy approximation was found to be unsatisfiable were indeed unsatisfiable.” This was observed by launching the shell script precision-monadic-proxy. Then count the number of lines starting with *** (definitely coarse; there should not be any), and +++ (possibly coarse; there should not be any either).

Figure 14, p.25, was obtained just like Figure 9, except through the commands found in bench.h1-no-skel.

Figure 17, p.29, was obtained just like Figure 9, except through the commands found in bench.h1-monadic-proxy. Note that half of the commands in the latter file use the additional flag -all. This is used to fill in Figure 18, p.30.

Figure 19, p.33, was obtained just like Figure 9, except through the commands found in bench.h1-noda.

Figure 21, p.39, was obtained by running the shell script abbrvbench. Then launch gnuplot v4.0 or higher, and type the commands

set term postscript eps noenhanced 20
set output "abbrv.h1.eps"
load "abbrv.h1.plot"

Figure 23, p.42, was obtained by running the shell script interbench. Then launch gnuplot v4.0 or higher, and type the commands

set term postscript eps noenhanced 20
set output "inter.h1.eps"
load "inter.h1-no-alt.plot"

p.42. To compare which problems succeeded/failed with or without abbreviation of non-trivial blocks, after all benchmarks have been run, type

FILES=`find tptp -name '*.bench.h1-no-alt' -exec grep -q NoMem {} \; -print`
FILES="`find tptp -name '*.bench.h1-no-alt' -exec grep -q Killed {} \; -print` $FILES"
grep Intersection $FILES

This gives you the list of problems which failed with h1 -no-alternation, together with the number of intersection predicates generated for each. You may remove from this list those problems that already failed without the -no-alternation option, by typing

OFILES=`find tptp -name '*.bench.h1' -exec grep -q NoMem {} \; -print`
OFILES="`find tptp -name '*.bench.h1' -exec grep -q Killed {} \; -print` $OFILES"

and removing the files from $OFILES from those of $FILES.

p.42, claim that “The COM006-1 case is stranger: while a contradiction was found, the 462 step refutation does not use any of the 5 defining clauses produced.” This was obtained by running

memtime -m 400000 -c 300 h1.opt -progress -no-alternation tptp/COM/COM006-1.h1.p
h1logstrip tptp/COM/COM006-1.h1.log.gz >c6.log
grep -c '\[' c6.log

The result we got was 462. Each time the abbreviation of non-trivial blocks rule is used, some lines will be output if tptp/COM/COM006-1.h1.log.gz containing the words inter-def: and inter-use:. Check with grep that there are no such words in the stripped log file c6.log, showing that no clause using any intersection predicate was eventually useful in deriving a contradiction.

Figure 24, p.43, was obtained just like Figure 9, except through the commands found in bench.h1-no-alt. We added a #intersection predicates column, obtained by reading the lines starting with “Intersection predicates:” in the latter file.

Figure 25, p.46, was obtained by looking again at file bench.h1. This time, we fill the “forward subsumption column” with the figure at line “Subsumed clauses [forward] :”, the “backward subsumption” column with the number at line “Subsumed clauses [backward]:”, the “full def. subsumption, straight” column with the number at line “Subsumed parent clauses because of fully defined symbols:”, and the “full def. subsumption, backward” column with the number at line “Backward subsumed clauses because of fully defined symbols:”.