Implemented noncommutative product for solvable polynomials, together with relation tables. SolvableOrderedMapPolynomial extends OrderedMapPolynomial. RatSolvableOrderedMapPolynomial extends SolvableOrderedMapPolynomial. Interface SolvablePolynomial extends Ordered Polynomial. Some more left multiplication methods, left reduction and left Groebner Bases, twosided Groebner Base test and computation. Pairlist class changed to avoid usage of criterion 4 if running for SolvablePolynomials, also criterion4() method itself checks for SolvablePolynomials.
run / n | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
first | 3 | 13 | 32 | 92 | 128 | 188 | 274 | 420 | 683 | 1126 | 1795 | 2793 | 4380 | 6741 |
second | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 8 | 10 | 13 | 16 | 21 | 27 | 35 |
Timings in ms on AMD XP 2800+. Computation of (Y^n) * (X^n) with respect to the relation Y * X = X Y - H. In the first run the relation table is populated with the products Y^i * X^i, i = 2,...,n. In the second run the relations are reused, showing almost no computing time anymore for the products.
run / n | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|
first | 28 | 94 | 303 | 1234 | 5185 | 24647 |
second | 1 | 12 | 107 | 782 | 4569 | 23897 |
Second example shows th computation of ( Xa + Xb + Xc + Ya + Yb + Yc + Ha + Hb )^n in U(sl_3). Since in the relation table only products of two variables are stored, the improvement is minimal (for low n).
# Threads CPUs |
# JVMs | time (sec) | #put | #remove | % total |
---|---|---|---|---|---|
1, seq | 1 | 160.2 | 70 | 327 | 13.5 |
1, par | 1 | 157.0 | 70 | 327 | 13.5 |
2, par | 1 | 82.2 | 72 | 329 | 12.7 |
1, dist | 1 | 177.2 | 77 | 334 | 11.4 |
2, dist | 2 | 92.2 | 90 | 347 | 8.6 |
4, dist | 2 | 56.2 | 112 | 369 | 5.9 |
8, dist | 2 | 58.9 | 255 | 516 | 1.5 |
4, dist | 4 | 51.2 | 117 | 374 | 5.5 |
6, dist | 4 | 43.7 | 129 | 386 | 4.6 |
8, dist | 4 | 62.9 | 259 | 519 | 1.5 |
Timings taken on a 16 CPU Intel Xeon SMP computer running
at 2.7 GHz and with 32 GB RAM.
JVM 1.4.2 started with AggressiveHeap and UseParallelGC.
#JVMs = number of distinct Java virtual machines.
#put = number of polynomials put to pair list.
#remove = number of pairs removed from pair list,
i.e. after application of criterions,
but including nulls up to now.
% total = per cent of removed pairs from total pairs generated,
#remove / ( #put * (#put-1) / 2 ) * 100.
# Threads CPUs |
# JVMs | time (sec) | #put | #remove | % total |
---|---|---|---|---|---|
1, dist | 1 | 24726.2 | 140 | 781 | 8.0 |
2, dist | 2 | 12356.1 | 165 | 806 | 5.9 |
4, dist | 4 | 6859.3 | 218 | 859 | 3.6 |
8, dist | 4 | 7465.1 | 411 | 1054 | 1.2 |
8, dist | 8 | 6412.9 | 344 | 986 | 1.6 |
8, dist | 8 | 7173.3 | 399 | 1041 | 1.3 |
Overhead for distributed variant is about 10% in Katsura 6 (G). Distributed 1 means one distributed process is running for the reduction of S-polynomials. There is always a master process handling polynomial input / output, setup and management of distributed workers and handling of the pair list. Communication between master and workers is always via TCP/IP with object serialization, even if running on one computer.
New classes implementing a distributed hash table to hold the polynomials in distributed GB. Index of polynomials in Pairlist is used as hash key. Communication is now using message types GBTransportMess. Now polynomials are only transported once to each reducer since only polynomial hash indexes are transported. Distributed list is asynchronous and late updated, so some duplicate H-polynomials (head terms) could be (are) produced. Solution by local put to hash table with dummy index? Timings are not dramatically better.
Todo: check reduction algorithm to use later arriving polynomials.
-Xms200M -Xmx400M -XX:+AggressiveHeap -XX:+UseParallelGCMemory must be adjusted with respect to your situation.
Seperated versions with Pair Sequence Respecting Order (PS) and normal versions. PS versions try to keep the order of reduced polynomials added to the ideal base the same as in the sequential version. Normal versions now running OK on parallel computer with the right JVM options. Refactoring with Eclipse (organize imports, static methods).
# Threads CPUs |
Katsura 6 TO(G) load* |
Katsura 6 TO(G) empty* |
Katsura 7 TO(G) load* |
Katsura 7 TO(G) empty* |
---|---|---|---|---|
seq | 184.5 | 153.3 | ||
1 | 181.5 | 4% / 159.7 | 28418.6 | |
2 | 118.6 | s2.02 / p2.11 / 75.6 | p2.06 / 13760.0 | |
4 | 76.8 | s3.79 / p3.95 / 40.4 | 6256.9 | p4.56 / 6225.1 |
8 | 43.2 | s7.19 / p7.49 / 21.3 | 3240.7 | p8.56/ 3318.8 |
10 | 42.5 | |||
12 | 40.5 | 2288.1 | p9.90 / 2868.4 | |
14 | 31.2 | |||
16 | 51.9 | s8.19 / p8.54 / 18.7 | 5376.4 | p12.59 / 2256.1 |
Timings taken on a 16 CPU Intel Xeon SMP computer running
at 2.7 GHz and with 32 GB RAM.
JVM 1.4.2 started with AggressiveHeap and UseParallelGC.
*) timing taken with other load on the CPUs.
+) timing taken with no other load on the CPUs.
Speedup: s = relative to sequential,
p = relative to parallel with one thread / CPU.
Scaling from 8 to 16 CPUs is bad, but also observed on
non CA / GB Examples (Java and C/FORTRAN).
N vars = N+1 |
TermOrder | Seconds | TermOrder | Seconds |
---|---|---|---|---|
7 | G | 32044.204 | L | |
6 | G | 112.641 | L | |
5 | G | 4.195 | L | |
4 | G | 0.431 | L | 11.650 |
3 | G | 0.153 | L | 0.310 |
2 | G | 0.031 | L | 0.032 |
putParallel()
, removeParallel()
and
helper methods. Sequence numbers are generated and reduced polynomials
are only put to the pair list if corresponding pair number is in
correct (sequential) sequence.
The ordered list / queue pairsequence
(TreeMap/SortedMap)
keeps track of the polynomials not yet put to the pairlist.
ExecutableServer
) for remote execution
of objects implementing the RemoteExecutable
interface.
New setup for the distributed computation of GBs: the GB master now sends the client code to some ExecutableSevers based on a maschine file with host and port infos about the distributed environment.
Improved the PolynomialTokenizer so that it can read almost unedited
old MAS GB input files: **
exponents and parenthesis
around polynomials.
Lines starting with #
are treated as comments.
Comments (* *)
and parenthesis within polynomials are
still not supported.
Implemented a common driver for GB computations RunGB
.
Sequential, thread parallel and distributed computation can be selected
by command line parameters. The input is taken from a file. The number
of threads respectively the number of distributed clients can be specified.
For distributed execution the host and port information is taken from
a maschines file.
Usage: RunGB [seq|par|dist|cli] <file> #procs [machinefile]
Added methods putCount()
and remCount()
in OrderedPairlist
to count the number of polynomials
put and get from the pair data structure.
# Threads | 1 seq | 1 par | 2 par | 2 par | 3 par | 4 par | 5 par | 1 dist | 2 dist | 3 dist | 4 dist | 4 dist | 4 dist | 5 dist |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
# put | 22 | 22 | 43 | 26 | 28 | 28 | 28 | 22 | 25 | 28 | 37 | 40 | 33 | 27 |
# remove | 25 | 25 | 61 | 30 | 32 | 32 | 41 | 26 | 33 | 42 | 47 | 61 | 54 | 69 |
Timings @ 500 MHz on one CPU and one maschine and log4j level INFO are:
ca. 2.5 - 3.5 seconds for sequential GB,
ca. 2.5 - 6 seconds for parallel GB,
ca. 5.5 - 9 seconds plus 5 seconds sync time for distributed GB.
Network shuffling of polynomials seems to account for 3 seconds in
this example.
Problem uncovered: the distributed version of GB needs an avarage of 5 seconds to sync with all clients (on one maschine). This is way to much for execution times in the range of 2 to 8 seconds.
Redesign of DistributedList, now using TreeMap to keep the list
entries in proper sequence. As key a natural number is used, which is
assigned by the server to successive add() requests.
The server now also holds a copy of the list by itself. So the
retransmission of list elements to late arriving clients is possible.
This doubles the space required to store polynomials, but removes
initial delays to sync all clients to receive all list elements.
By retransmission the DistributedList synchronization delay during
DistributedGB could be removed.
However the problem of about 5 seconds delay in startup
of DistributedGB still remains. It is not visible where and why this
delay occurs.
Further improvements would be the removal of list elements or the
clearing of the list.
Next steps could be distributed HashMap or TreeMap.
An important improvement would be to keep serialized copies of the
list elements (polynomials) at the server and to
avoid many time serialization during broadcast.
mpirun
,
a daemon to run on a distributed system to work with the launcher,
solvable polynomials and non-commutative GBs.
With one thread the time is 30.6 h. Besides the better CPU speed, this makes a 5 % improvement on JDK 1.4 compared to the older timings from a JDK 1.3 and the new polynomial implementation.
Multiplication with ordered polynomials is about 8-10 times faster than the multiplication with unordered polynomials. Also the multiplication with semi-ordered polynomials (LinkedHashMap) with orderpreserving addition is about 7-8 times slower than multiplication with ordered polynomials.
All implementations are based on Map interface and classes. The Map maps exponent vectors (from some monoid) to coefficients (from some domain). This is in sync with the mathematical definition of multivariate polynomials as mappings from some monoid to some domain. Term orders are represented by a TermOrder class which provides the desired Comparator classes for the SortedMap implementation.
However the merging add/subtact implementation is a factor of
2 slower than the TreeMap implementation.
Complexity for a+b is
2*(length(a)+length(b))
for access and merging pre sorted polynomials and
2*length(a)+length(b)+length(b)*log2(length(a+b))
for TreeMap clone, access and insertion.
The merging multiplication implementation is by a factor of
10 slower than the TreeMap implementation.
Polynomial size was ~100 terms and the product contained ~8000 terms.
Complexity for a*b is
lab = length(a)*length(b) coefficient multiplications for both implementations
plus 2*length(a*b)*length(b) for merging summands, respectively
plus length(a)*length(b)*log2(length(a*b)) for TreeMap insertion.
Since for sparse polynomials length(a*b) = lab, the TreeMap complexity
is asymptotically better in this case:
2*length(a)*length(b)*length(b) =>= length(a)*length(b)*log2(length(a*b))
For dense polynomials with length(a*b) ~ length(a)[+length(b)], then
the LinkedHashMap complexity is asymptotically better:
2*length(a)*length(b) =<= length(a)*length(b)*log2(length(a*b))
Parallel computations with the Rose example are at 18 h with 4 threads on 2 Pentium 4 @ 2.4 GHz hyperthreading CPUs. With one thread the time is 40 h.
Does this influence the validity of criterion 3? Access to pairlist is synchronized. Pairs are marked as reduced as soon they are taken from the list. But the algorithm terminates only after all reductions of pairs have terminated. So criterion 3 holds.
New implementation of parallel version of GB. Removal of pairs is now also in parallel. But ordering of pair insertion is no more preserved
Timings are more stable and slightly better than that of sequential GB.
Todo: unit tests, comments, ...
Improved parallel version of GB.
Todo: CVS, comments, polynomial implementation using LinkedList, parallel GB simplify
With the improved algorithm the running time of the parallel GB becomes more stable and not slower than the sequential GB. However there is still no significant speedup.
# Threads | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 16 |
---|---|---|---|---|---|---|---|---|---|
# Reductions | 25 | 25 | 27 | 25 | 25 | 25 | 25 | 25 | 25 |
# Threads | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 16 |
---|---|---|---|---|---|---|---|---|---|
# Reductions | 22 | 24 | 30, 28, 24, 29 | 28 | 29 | 42 | 32 | 32 | 37 |
parallel results:
Trinks 7: mas 0.598 sec, jas 0.918 sec, jas par 0.955 sec
Trinks 6: mas 26.935 sec, jas 3.211 sec, jas par 3.656 sec
mas: including startup and gc time,
jas: excluding startup of jvm and including gc time,
jas par on single processor
timing on P-3@450
implemented edu.jas.arith.BigInteger which implements Coefficient, tested with IntPolynomial which extends Polynomial
todo: alternative Implementations, cleanup RatPolynomial, parallel GB, conversion RatPolynomial <--> IntPolynomial
second results (with new criterion 3 in jas):
Trinks 7: mas 0.598 sec, jas 1.159 sec
Trinks 6: mas 26.935 sec, jas 6.468 sec
mas: including startup and gc time,
jas: excluding startup of jvm and including gc time
third results (with new criterion 3 in jas and GBminimal):
Trinks 7: mas 0.598 sec, jas 0.918 sec
Trinks 6: mas 26.935 sec, jas 3.211 sec
mas: including startup and gc time,
jas: excluding startup of jvm and including gc time
timing on P-3@450
this makes for a factor of 8-9 better, all tests with output to files, startup of JVM is approx. 1.0-1.2 sec, most time is spent in BigInteger:
java -Xrunhprof:cpu=times,format=a CPU TIME (ms) BEGIN (total = 136) Thu Jan 2 18:33:53 2003 rank self accum count trace method 1 15,44% 15,44% 596610 21 java.math.MutableBigInteger.rightShift 2 13,24% 28,68% 582132 15 java.math.MutableBigInteger.difference 3 12,50% 41,18% 612760 19 java.math.MutableBigInteger.getLowestSetBit 4 9,56% 50,74% 2 9 java.lang.Object.wait 5 9,56% 60,29% 5271 22 java.math.MutableBigInteger.binaryGCD 6 6,62% 66,91% 612760 23 java.math.BigInteger.trailingZeroCnt 7 5,88% 72,79% 592152 18 java.math.BigInteger.bitLen 8 5,88% 78,68% 6018 20 java.math.MutableBigInteger.binaryGCD 9 5,15% 83,82% 578887 25 java.math.MutableBigInteger.normalize 10 4,41% 88,24% 550992 24 java.math.MutableBigInteger.primitiveRightShift 11 4,41% 92,65% 1 10 java.lang.Object.wait 12 3,68% 96,32% 582132 12 java.math.MutableBigInteger.compare 13 0,74% 97,06% 35965 13 edu.jas.poly.ExpVector.EVILCP 14 0,74% 97,79% 11612 14 java.math.BigInteger.divide 15 0,74% 98,53% 5866 11 java.math.MutableBigInteger.divide 16 0,74% 99,26% 9032 16 java.math.MutableBigInteger.divide 17 0,74% 100,00% 9032 17 java.math.BigInteger.divide CPU TIME (ms) END
first results (without criterion 3 in jas):
Trinks 7: mas 0.598 sec, jas 1.373 sec
Trinks 6: mas 26.935 sec, jas 30.935 sec
mas: including startup and gc time,
jas: excluding startup of jvm and including gc time.
timing on P-3@450
Implemented and testet ExpVector based on Javas int arrays.
Implemented and testet RatPolynomial based on Javas TreeMap.
static methods: DIRPDF, DIRPWR via toString, DIRPON, DIRPMC,
DIRPPR, DIRPSM, DIRRAS.
Consider replacing keySet/get by entrySet/getval where appropriate.
Can there be an generic Polynomial class?
This will be the end of list processing for Jas.
DIPRNGB needs DIRPDF, DIRPWR, DIRPON, DIRPMC, DIRPPR, DIRPSM.
These need RNDIF, RNDWR, RNINT, RNSIGN, ISRNONE, RNONE, RNZERO,
RNINV (and DIRPRP), RNPROD, RNSUM.
Last modified: Sun Jul 24 13:27:30 CEST 2005
$Id: jas-log.html,v 1.32 2005/03/29 17:20:04 kredel Exp $