weasel.com: Writing high performance software

Multi Processing and Multi Threading
Prime numbers and Memory Usage
Malloc and Memory Usage

Why bother?

Q:Why care about multi-processing and multi-threading?
A:For a long time it has been cheaper to buy two slow processors than one processor that is twice as fast.

It was clear to me around 1990 when the Micro-Vax-III was available with two processors. It was very clear to me, in 1998 when I bought my dual Pentium Pro system. It has been clear ever since. It became clear to Intel, when they realized they simply couldn't keep on increasing clock rates. Their processors eventually dissipated 130 watts, which was causing overheating problems. It isn't very difficult to make a two processor system, and it isn't much more difficult to make a four processor system. Around 2002 Intel introduced hyper-threading, and later on multi-core processors. This wasn't a new idea from Intel, as AMD had multi-core processors before them. Others have sold multi-core processors, though as far as I know AMD was the first with more than one general purpose processor on a single die.

Writing software to take advantage of the extra processors is, in general not as simple. The PC computer review sites on the internet have only lately tested multi processing performance of systems, as their most important applications (computer games) rarely take advantage of more than one processor. Fortunately 'servers' like web and file servers run many processes at once, and really care about multi processing performance. There are many programs such as graphics rendering, mathematical modeling (finite element analysis, circuit design, hydrodynamic flow analysis, etc.) that need lots of CPU resources and are generally able to use multiple processors or threads.

My early multi-processing

I wrote my sim program in 1988. It was quite slow. What it does is compares each line in a file with every other line in the file, using an inexact string comparison measuring 'edit distance'. I have optimized it as much as possible to minimize the number of comparisons needed. Depending on how it was configured, it could take years to finish. I started running it continuously on my Micro-Vax-II. I noticed that it used quite a bit of memory, and even when run at low priority would slow down the interactive use of my computer. I decided to run it on all the VAXes at work. But it would not do to simply start it up everywhere, as people would get pissed off. I wrote a script that would only start it up during times when people weren't around. The times were roughly run between 10pm and 6am. In addition, the script would check that there was enough free memory available. If the amount of free memory was too low, it indicated that there was some kind of activity on the computer, and so I would stop sim on that machine. Additionally, some computers were faster than others, and some computers had multiple processors. All of this had to be taken into account in order to distribute the work so everything would finish up as close to the same time as possible (minimizing the overall execution time). So early on, I was writing distributed software. It wasn't something that I was particularly interested in, it was just that I wanted my results as fast as possible.

My personal computers and performance

Well things change, and after 10 years at that company, I left. The new company didn't have VAXes, nor did it have a decent network. So my distributed software days were put on hold. The good news is that computers were getting cheaper, faster, and had more memory. I bought my first computer, a Gateway Pentium 120. Unfortunately, it didn't have multiple cpu's. I fixed that with my second computer I bought, which had an Intel Providence motherboard and two Intel Pentium Pro 200 mhz cpu's.

Now, I could run two copies of my sim program, and I would get roughly twice the throughput of a single copy. Eventually I bought a faster computer, which was a Gateway 6400 dual Pentium III 933 mhz system. Again, I ran two copies of my sim program and the Pentium III's were sure fast. Finally I bought a Asus PC-DL with two Intel Xeon 2.4 ghz processors (which had hyper-threading).

Since I had 4 virtual processors, and sim was taking up a lot of virtual memory, I decided to make sim multi-threaded. It is really simple to run four copies of sim. You set the parameters (like where to start and stop) and start them up. The problem was that each process was using about 100 mbytes of virtual memory, for a total of about 400 mbytes of virtual memory. If I threaded sim, it would only use about 100 mbytes of memory, which was a big win.

It is generally not trivial to thread a program that isn't designed for multi-threading. Fortunately, I designed sim for multi-processing, and it wasn't a big step to make it multi-threaded. I had to add a mutex (lock) for the output generating function. I had to add a function to generate the next line to process (along with a mutex for it). After some trivial testing, my multi-threaded sim (sim_t) was working perfectly.

But all is not rosy. For a variety of silly reasons, I do most of my development on windows using the cygwin package which provides Unix/Linux/POSIX support. And for some reasons which I fully do not understand, POSIX threads (pthreads) don't work very well on windows/cygwin. With two threads, things are pretty good, but with more than two threads, things slow down pretty quickly.

Multi threading on my computers, sim_t 10k line dataset

Below are execution times for sim_t on a 10k line dataset with a variety of threads on several machines which I have access to running on Linux and windows. All compiling was done with gcc, and execution on Windows XP was done using the cygwin portability tools. The number of threads is on the left, and the rest is the output of the Unix 'time' command.

AMD Athlon 2000+, 512mb ram, running mandrake 2006 Linux.
gcc-4.01, -O3 -march=pentium3 -mtune=pentium3

Threads         Time
1               239.65user 0.04system 4:00.17elapsed 99%CPU
2               239.69user 0.03system 3:59.87elapsed 99%CPU
3               239.55user 0.03system 3:59.65elapsed 99%CPU
4               239.92user 0.03system 4:00.01elapsed 99%CPU
6               239.82user 0.02system 3:59.88elapsed 99%CPU
8               239.58user 0.04system 3:59.64elapsed 99%CPU

Dual 933 mhz Pentium III system, 1.1gb ecc ram, running mandrake 2006 Linux.
gcc-4.01, -O3 -march=pentium3 -mtune=pentium3

Threads         Time
1               465.22user 0.05system 7:45.25elapsed 100%CPU
2               466.57user 0.07system 3:53.98elapsed 199%CPU
3               466.17user 0.09system 3:53.80elapsed 199%CPU
4               466.04user 0.10system 3:54.04elapsed 199%CPU
6               467.77user 0.07system 3:54.57elapsed 199%CPU
8               466.03user 0.12system 3:53.72elapsed 199%CPU
16              466.43user 0.18system 3:53.91elapsed 199%CPU
32              466.17user 0.16system 3:53.78elapsed 199%CPU
64              466.13user 0.18system 3:53.76elapsed 199%CPU
128             466.98user 0.10system 3:54.29elapsed 199%CPU
256             467.21user 0.11system 3:55.40elapsed 198%CPU

Dual Xeon, running at 2.4 ghz, 1gb ecc ram, running mandrake 10.2 Linux
gcc-3.4.3 -O3 -march=pentium4 -mtune=pentium4

Threads         Time
1               430.25user 0.05system 7:10.25elapsed 100%CPU
2               428.44user 0.04system 3:34.39elapsed 199%CPU
3               362.87user 0.02system 2:01.19elapsed 299%CPU
4               337.69user 0.05system 1:24.89elapsed 397%CPU
6               337.95user 0.05system 1:24.77elapsed 398%CPU
8               337.67user 0.06system 1:24.71elapsed 398%CPU
16              338.25user 0.06system 1:24.85elapsed 398%CPU
32              343.27user 0.19system 1:26.18elapsed 398%CPU
64              352.51user 0.14system 1:28.45elapsed 398%CPU
128             344.84user 0.17system 1:26.54elapsed 398%CPU
256             347.48user 0.12system 1:32.18elapsed 377%CPU

Dual Xeon, running at 2.4 ghz, 1gb ecc ram, running Microsoft Windows XP pro, sp2
gcc-3.4.4 -O3 -march=pentium4 -mtune=pentium4

Threads         Time
1               427.85user 0.07system 7:08.04elapsed 99%CPU
2               508.56user 0.12system 4:15.01elapsed 199%CPU
3               552.87user 0.09system 3:05.25elapsed 298%CPU
4               658.95user 0.14system 2:45.78elapsed 397%CPU
6               660.81user 0.07system 2:46.25elapsed 397%CPU
8               662.15user 0.09system 2:46.60elapsed 397%CPU
16              666.53user 0.12system 2:47.76elapsed 397%CPU

Sun T-2000 (16 thread), running Solaris 10, 8gb ecc ram
gcc 3.4.3 -O3

Threads         Time
1               real    21:18.3 user    21:18.2 sys         0.1
2               real    10:40.6 user    21:18.3 sys         0.1
3               real     7:08.0 user    21:18.4 sys         0.1
4               real     5:22.5 user    21:21.7 sys         0.1
6               real     4:15.9 user    25:21.2 sys         0.1
8               real     3:32.2 user    27:58.3 sys         0.1
16              real     3:15.2 user    51:19.6 sys         0.6
32              real     3:15.2 user    51:19.8 sys         1.3
64              real     3:15.3 user    51:19.8 sys         1.7

8 way Opteron server (2.8 ghz) running Linux x86/64,
unknown ram, unknown compiler

Threads         Time
1               132 seconds
8                16.62 seconds

Sun Sparc V890 8cpu, 16 core running Solaris,
unknown ram, unknown compiler

Threads         Time
1               372 seconds
16               23.42 seconds

Multi threading on my computers, sim_t 50k line dataset

Below are execution times for sim_t on 50k line dataset with a variety of threads on several machines which I have access to running on Linux and windows. This version has a much faster subroutine computing edit distance, so the numbers can't be compared with the above numbers. 50k lines will take roughly 25 times longer to run than 10k lines. All compiling was done with gcc, and execution on Windows XP was done using the cygwin portability tools. The number of threads is on the left, and the rest is the output of the Unix 'time' command.

Dual 933 mhz Pentium III system, 1.1gb ecc ram, running mandrake 2006 Linux.
gcc-4.01, -O3 -march=pentium3 -mtune=pentium3

Threads         Time
1               2016.48user 0.14system 33:36.54elapsed 100%CPU
2               2045.02user 0.52system 17:05.37elapsed 199%CPU
3               2047.85user 0.53system 17:06.77elapsed 199%CPU
4               2044.14user 0.76system 17:04.98elapsed 199%CPU
6               2044.15user 0.55system 17:05.27elapsed 199%CPU
8               2043.19user 0.74system 17:04.57elapsed 199%CPU

Dual Xeon prestonia, running at 2.4 ghz, 1gb ecc ram, running mandrake 10.2 Linux
gcc-3.4.3 -O3 -march=pentium4 -mtune=pentium4

Threads         Time
1               1087.54user 2.52system 18:09.94elapsed 100%CPU
2               1086.83user 4.66system 9:07.04elapsed 199%CPU
3               1068.36user 4.27system 5:58.51elapsed 299%CPU
4               1081.34user 4.15system 4:32.54elapsed 398%CPU
6               1062.23user 3.97system 4:27.74elapsed 398%CPU
8               1062.39user 3.99system 4:27.75elapsed 398%CPU
16              1074.38user 4.06system 4:30.72elapsed 398%CPU

Dual Xeon prestonia, running at 2.4 ghz, 1gb ecc ram, running Microsoft Windows XP pro, sp2
gcc-3.4.4 -O3 -march=pentium4 -mtune=pentium4

Threads         Time
1 (old edit distance code)
                3295.60user 1.41system 55:08.80elapsed

(new edit distance code)
1               1094.46user 0.59system 18:22.97elapsed 99%CPU
2               1106.14user 0.42system 9:18.41elapsed 198%CPU
3               1498.45user 0.59system 8:25.50elapsed 296%CPU
4               1850.98user 0.45system 7:51.58elapsed 392%CPU
6               1855.53user 0.85system 7:53.06elapsed 392%CPU
8               1876.17user 0.60system 7:57.41elapsed 393%CPU
16              1855.54user 0.87system 8:00.55elapsed 386%CPU

Sun T-2000 (32 thread), running Solaris 10, 8+gb ecc ram
gcc 3.4.3 -O3(not sure)

Threads         Time
1               20399 seconds (without the -ipunct, shouldn't matter)
8               real 8m37.107s user 67m33.427s sys 0m0.453s
16              real 5m30.350s user 85m7.349s sys 0m0.598s
32              real 4m52.742s user 149m49.604s sys 0m2.139s
64              real 4m52.859s user 149m49.473s sys 0m5.477s
512             real 4m53.197s user 149m49.838s sys 0m10.289s

Multi Threaded Performance

First it is clear that windows/cygwin doesn't scale well using POSIX threads. For 2 threads, scaling is 84% which is fine. For 3 threads 77%, for 4 or more threads 65%. Some have suggested that the Intel hyper-threading isn't as good as 'true' multiprocessing, which is true, but doesn't account for this dismal performance. Some have suggested that the windows scheduler isn't hyper-threading aware, which may also be true, but still doesn't explain this performance. I don't know why windows/cygwin sucks so badly at POSIX threading.

Now, lets take a look at Linux performance using the same hardware. Scaling is 100% for 2 threads. For 3 or more threads, the program runs even faster. I don't know why. This is in sharp contrast to Windows XP on the same hardware slowing down when more threads are added. Note that though there are 4 'processors' (dual Xeon with hyper-threading) 4, 6, and 8 thread performance is virtually identical.

Now, lets look at Sun's Niagara highly threaded CPU. The Sun T-2000 has a cpu processor chip that has either 4, 6, or 8 cores. Each core has 4 hardware threads. Therefore, there are 16, 24, or 32 hardware supported threads. Sun was kind enough to loan me a 16 thread machine. I compiled my program and ran the program on the sun. I used the supplied version of gcc, which was compiled for posix threads (which is good as my program uses posix threads). With one thread, the program took about 21 minutes 18 seconds to run, which is roughly 2.75 times slower than a pentium 933. Sun says the cpu is clocked at 1.0 ghz, so clearly their MIPS/MHZ isn't nearly as high as a pentium III. With 2 threads scaling is 100%, 3 or 4 threads 98%, with 6 threads scaling is 82%, with 8 threads scaling is 74%, with 16 threads scaling is 40%. Clearly something is wrong with the way the Sun threading scales. The Sun engineer I am in contact with told me my code had too much floating point instructions. (Later he said he may have been mistaken). I removed all known fp code, which has helped scaling about 1 or 2 percent to the above numbers. There is only one floating point processor for the whole chip, which stalls all the threads when it is running. I have modified my code to minimize fp operations. I can't imagine it using more than 2 or 3% fp code. I am running the tests now.

The Sun engineer got back to me. He told me that on his machine, one thread uses 780 million instructions on a single pipeline, which is over 50% efficiency (that is efficiency using their hardware) which peaks at 1200 million instructions per second. He also mentioned that most commercial apps are only around 30%. With two threads on each of his 8 cores, he got each thread using about 600 million instructions, which is 'very close to 100% efficiency'. So adding more threads won't increase performance. I suppose it is good that I can use 100% of their chip, but it seems to happen for me at 8 threads on my test machine. I think that the threading on the Sun is pretty similar to the Intel hyperthreading, in that they are both designed to hide memory latency and to make up for code that doesn't keep the processor busy. I think this is good for typical code, but I would hope people would tune their code to make it run as fast as possible. If they do, they get much less benefit from the 4 threads, as my app has demonstrated.

Their elapsed time of 195 seconds with 16 threads doesn't impress compared to 84 seconds for a dual xeon 2.4 ghz. Since a dual xeon 2.4 ghz machine is quite a bit cheaper than the sun t-2000, it is hard to justify for my application.

Some have suggested that my code might not thread very well, and that is causing the thread slowdown on the T-2000. I decided to run the code on my pentium III and xeon with lots of threads to see if there are indeed thread problems. I ran the test up to 256 threads successfully. There is a gradual slowdown of about 8% on the xeon when going from 4 to 256 threads. This isn't surprising, because there is some minor thread contention for writing output. Also it is possible that the linux kernel is introducing some minor thread switching overhead. Nobody would realistically use more threads than processors, let alone so many threads with so few processors, and on such a small dataset. Still, 8% slowdown when each virtual processor has 64 threads on it is clearly trivial. For the dual pentium III, the time for 2 threads is 3:53.98, and for 256 threads the time is 3:44.40, a slowdown of .42 seconds. This is even more trivial a slowdown than the xeon, and clearly insignificant.

Now lets look at some more machines the Sun engineer tested. It is a 8 socket 8 Opteron server running at 2.8 ghz. With one thread, it takes 132 seconds. This is 3.2 times faster than one thread on the xeon with a clock speed increase of only 17%. Clearly the Opteron gets more done per cycle than the xeon. With 8+ threads, it takes 16.62 seconds, which scales by 99+%. Yousa!

Sun Sparc V890 system with 16 cores. 1 thread takes 372 seconds, which is about 50% slower than the AMD 2000+, and a bit faster than the Pentium III 933 and Xeon 2.4 ghz single thread performance. At 16 threads, it takes 23.42 seconds, which is about 4 times faster than the dual xeon, and a bit slower than the 8 core Opteron system. The scaling factor is 99% which is really impressive.

Single Threaded Performance

Now, lets take a look at the 933 mhz Pentium 3 system. This system is a few years old, and only has two processors. Still, it scales perfectly. In addition, the runtime is only about 8% slower than the 2.4 ghz Xeon, despite the Xeon having a clock which is 2.5 times higher for the 10k line dataset test. For the 50k line test, runtime is roughly 83% slower than the 2.4 ghz Xeon. Why is this? I suspect that for the 10k line dataset test the Pentium 3 system is using the cache effectively. For the 50k test I suspect the cache isn't doing much good. The memory system of the Pentium 3 is much slower than the Xeon, and this is causing the slowdown. When running from the cache the Pentium 3 is almost as fast as the Xeon because the Xeon's (which is really a multiprocessor capable Pentium 4) pipeline is long, and it doesn't produce a high instruction throughput per clock. Now, if you run two threads on a single 2.4 ghz Xeon, the performance is 2.2 times higher than the 933 mhz Pentium, which puts the 2.4 ghz Xeon back in the running. So Intel clearly needed hyper-threading and faster memory speed to stay competitive. Of course if your app isn't threaded, a 1.0 ghz Pentium III will likely be very competitive for small datasets. This is likely a reason why Intel has given up on their 'netburst' architecture and switched to an architecture which is loosely based on the Pentium 3.

Looking at single processor performance, and using the Pentium III 933 as a baseline, it is interesting to note that the Xeon 2.4 is 8% faster, while the AMD 2000+ is 94% faster with the 10k line input. This is a clear indication that the Pentium 4 doesn't perform well compared to the AMD Athlon. With 50k line input things change. The Xeon 2.4 is now 84% faster than the pentium III 933. I suspect this is due the superior memory system of the Xeon when dealing with larger datasets.

No substitute for superior algorithm

When I wrote my sim program, I used the fasted algorithm I knew about. It turns out that in 1985 Esko Ukkonen of Finland came up with a significant improvement to the edit distance algorithm I was using. He showed if you restricted the search to distances of 'k' or less, the code would take O(k*m) to run, rather than O(n*m) (where n and m are the length of the strings) to run. This is quite a big timesaver. Thanks to the wonders of the internet, I found many references to his paper, including some who detailed the algorithm sufficiently for me to implement. Here are the new results for 10k line dataset.

Dual Xeon prestonia, running at 2.4 ghz, 1gb ecc ram, running mandrake 10.2 Linux
gcc-3.4.3 -O3 -march=pentium4 -mtune=pentium4

Threads         Time
1               75.70user 0.02system 1:15.73elapsed 99%CPU
2               77.73user 0.01system 0:39.06elapsed 199%CPU
3               73.55user 0.03system 0:24.80elapsed 296%CPU
4               70.89user 0.01system 0:18.49elapsed 383%CPU
6               70.34user 0.06system 0:18.52elapsed 380%CPU
8               70.41user 0.02system 0:18.38elapsed 383%CPU
16              70.89user 0.05system 0:19.28elapsed 367%CPU

So with the improved algorithm, my dual xeon 2.4 ghz system is only 16% slower than the 8 processor Opteron 2.8 ghz system running the old, far slower code.

Multi threading on my computers, sim_t full input, 390k lines

Summary:

(old algorithm)
Dual Xeon 2.4 ghz prestonia,          Win XP,   2 threads  96,844 sec

(new algorithm)
Dual pentium III 933 mhz,             Linux,    2 threads  45,798 sec
Dual Xeon 2.4 ghz prestonia,          Linux,    4 threads  11,494 sec
Dual Xeon 2.88 ghz prestonia,         Linux,    4 threads  10,273 sec
Xeon 2.88 ghz prestonia,              Linux,    4 threads  10,369 sec
Sun T-2000,                           Solaris,  8 threads  20,387 sec
Sun T-2000,                           Solaris, 32 threads  10,600 sec
Intel Core 2 Duo T7400 2.16 ghz,      Win XP,   1 thread   63,022 sec
Intel Core 2 Duo T7400 2.16 ghz,      Win XP,   2 threads  31,763 sec
AMD Turion ML-34 1.8 ghz,             Win XP,   1 thread   71,812 sec
AMD Turion ML-34 1.8 ghz,             Linux 64, 1 thread   65,8077 sec
AMD Turion ML-34 1.8 ghz 64           Linux 64, 1 thread   66,184 sec

(newer algorithm, @5% faster)
AMD Turion 1.8 ghz,                   Win XP,        1 thread   39,632 sec
AMD Turion 1.8 ghz,                   Man 2007.1,    1 thread   48,730 sec
AMD Turion 1.8 ghz,                   Man 2007.1 64, 1 thread   35,455 sec
AMD Turion 1.8 ghz,                   Man 2008.1,    1 thread   51,118 sec
AMD Turion 1.8 ghz,                   Man 2008.1 64, 1 thread   32,574 sec

(thanks to johwah)
AMD X2 3800+,     2.0 ghz,           RH Linux 64, 2 threads, 13,263 sec
AMD X2 4400+,     2.2 ghz,           RH Linux 64, 2 threads, 12,131 sec
dual Opteron 250, 2.4 ghz,           RH Linux 64, 2 threads, 11,495 sec
AMD X2 5200+,     2.6 ghz,           RH Linux 64, 2 threads, 11,163 sec
AMD X2 5600+,     2.8 ghz,           RH Linux 32, 2 threads, 21,168 sec
Opteron 2346      2.3 ghz            RH Linux 64, 4 threads,  5,990 sec

Intel E6600,      2.4 ghz,           RH Linux 32, 2 threads, 18,378 sec
Intel Xeon 5160,  3.0 ghz,           RH Linux 64, 2 threads, 10,803 sec
Intel Q6600,      2.4 ghz,           RH Linux 64, 4 threads,  6,805 sec
dual  Xeon 5160,  3.0 ghz,           RH Linux 64, 4 threads,  4,912 sec
Intel i7-920,     2.66 ghz, fedora 11, 64, 8 threads, gcc 4.4.1 4,828 sec
AMD PhenomII 940  3.0 ghz,  Man 2009.0 64, 4 threads, gcc 4.3.2, 4,086 sec

Newer analysis

(newest algorithm, 6/28/2008)

Single Core CPU's	Clock	OS	Bits	Threads	Seconds	Compiler
Pentium III	733 mhz	Mageia 5	32	1	90,536	gcc 4.8.2
Pentium III-M mobile	1.2 ghz	Mageia 6	32	1	47,754	gcc 5.5.0
Celeron M 370 D mobile	1.6 ghz	Fedora 29	32	1	33,805	gcc 8.2.1
Pentium M Banias mobile	1.6 ghz	Fedora 28	32	1	32,657	gcc 8.3.1
AMD Turion ML-34 mobile	1.8 ghz	Mageia 7	64	1	24,826	gcc 8.3.1

Pi Processors	Clock	OS	Bits	Threads	Seconds	Compiler
Raspberry Pi 2B	900 mhz	Fedora 25	32	4	33,134	gcc 6.3.1
Raspberry Pi 3B	1.20 ghz	OpenSuse	64	4	27,053	gcc 6.2.1
Raspberry Pi 3B	1.20 ghz	Raspberry OS	32	4	26,008	gcc 8.3.01
Raspberry Pi 4B	1.50 ghz	Gentoo	64	4	14,587	gcc 9.2.0
Raspberry Pi 4B	1.50 ghz	Kali	64	4	13,992	gcc 9.3.0
Raspberry Pi 4B	1.50 ghz	Rasp. Buster	32	4	6,808	gcc 8.3.0
Raspberry Pi 4B	1.50 ghz	linux 6.1.0	64	4	5,742	gcc 12.2.0
Raspberry Pi 5B	2.40 ghz	linux 6.5.0	64	4	2,455	gcc 13.2.0

Multi Core CPU's	Clock	OS	Bits	Threads	Seconds	Compiler
Dual Xeon Prestonia	2.60 ghz	Man 2005LE	32	4	10,486	gcc 3.4.3
Dual Xeon Prestonia	2.60 ghz	Man 2007.0	32	4	12,330
Dual Xeon Prestonia	2.60 ghz	Man 2008.1	32	4	10,437	gcc 3.4.3
Dual Xeon Prestonia	2.60 ghz	Man 2008.1	32	4	13,836	gcc 4.2.3
Intel N3350 mobile	1.1 ghz	Mageia 7	64	2	11,219	gcc 8.4.0
Celeron J1800	2.41-2.58 ghz	Fedora 31	64	2	11,202	gcc 9.2.1
Dual Xeon Nocona	2.80 ghz	Man 2008.1	64	4	10,971	gcc 4.2.3
Dual Xeon Nocona	2.80 ghz	Man 2008.1	64	4	10,954	gcc 3.4.3
i3-4050U mobile	1.7 ghz	Mageia 7	64	4	7,257	gcc 8.4.0
Pentium 3825U mobile	1.9 ghz	Mageia 7	64	4	5,799	gcc 8.4.0
Intel i3-2100	3.10 ghz	Mageia 6	64	4	4,183	gcc 8.3.1
AMD FX-4100	3.6 ghz	Mageia 6	64	4	4,159	gcc 8.4.0
Celeron J5005	1.5-2.8 ghz	Cygwin	64	4	4,192	gcc 9.3.0
Celeron J4105	1.5-2.5 ghz	Cygwin	64	4	3,905	gcc 9.3.0
Intel i7-920	2.66 ghz	Fedora 11	64	4	3,397	gcc 4.4.1
AMD PhenomII 940	3.0 ghz	Man 2009.0	64	4	3,362	gcc 4.3.2
AMD PhenomII 940	3.0 ghz	Mageia 7	64	4	3,544	gcc 8.4.0
Xeon e3-1220v3	3.10 ghz	Mageia 8	64	4	2,676	gcc 10.4.0
AMD FX-6300	3.5 ghz	Mageia 8	64	6	2,639	gcc 10.4.0
Xeon L5640	2.27 ghz	Mageia 8	64	6*	2,560	gcc 10.4.0
Dual Xeon L5420	2.5 ghz	Mageia 3	64	8	2,249	gcc 4.7.2
Xeon e3-1270v3	3.50 ghz	Mageia 8	64	4*	2,247	gcc 10.4.0
Xeon L5640	2.27 ghz	Mageia 8	64	12	2,207	gcc 10.4.0
AMD FX-8320E	3.2 ghz	Mageia 6	64	8	2,152	gcc 5.4.0
i3-1125G4 mobile	2.0+ ghz	Mageia 8	64	8	2,122	gcc 10.3.0
i7-6820HQ mobile	2.7 ghz	Cygwin	64	8	2,064	gcc 9.3.0
Dual Xeon L5520	2.27 ghz	Mageia 3	64	16	1,994	gcc 4.7.2
Xeon e3-1270v3	3.50 ghz	Mageia 8	64	8	1,774	gcc 10.4.0
i7-1265U mobile	??? ghz	Cygwin	64	12	1,439.22	gcc 11.4.0
Dual Xeon L5640	2.27 ghz	Mageia 5	64	24	1,082	gcc 4.9.2
Xeon w2135	3.7 ghz	Debian kernel 5.15	64	12	820.5	gcc 10.2.1
Intel i9-12900	2.4-5.1 ghz	Cygwin	64	24	446.22	gcc 11.4.0
AMD Ryzen 3900x	3.8 ghz	Cygwin	64	24	432.7	gcc 9.3.0
AMD EPYC 7302	3.0 ghz	Cygwin	64	32	348.83	gcc 11.4.0
AMD EPYC 7B13	2.25+ ghz	Mageia 9	64	128	100.10	gcc 12.3

I wanted to see what hyperthreading does for the L5640 xeon. This is an older processor with 6 cores. With 6 threads it takes 2560 seconds, but with 12 threads it takes 2207 seconds. This is an improvement of 16%, which is pretty minor. On the other hand, adding hyperthreading is pretty minor in terms of silicon area. For the newer e3-1270 xeon with 4 threads it takes 2247 seconds, but with 8 threads it takes 1774 seconds. This is an improvement of 21%.

I had my Supermicro X8DTL motherboard which first has two L5520 4 core processors and then I ungraded them to two L5640 6 core processors for about 5 years. I decided it was time to upgrade. I bought a Supermicro H11SSL motherboard with an AMD EPYC 7302 processor. I thought it would be about twice as fast as the L5640s. I was wrong. It is 2.8 times as fast!. The EPYC has a 3.0ghz clock, but it also has 128mb of cache. It is a monster, and I am very happy that I got it. It also uses about the same amount of power as the dual Xeon setup, so I am saving power. AMD will crush Intel.

Well, for my personal computers the results show that Mandriva Linux 2008.1 does not perform well. After much thrashing around, I have found the problem is the gcc-4.2.3. When I compile with gcc-3.4.3, performance is fine. I will try different compiler switches with gcc-4.2.3 to try to regain the lost performance. Now that I have a 64 bit xeon, I was able to run 32 and 64 bit tests. The Nocona doesn't run very quickly in 32 bit mode, but is significantly faster in 64 bit mode. I have no idea why. There is nothing in the code that would run faster with 64 bits. Given the high clock rate of the Nocona, I expected even better results. After I figure out how to make gcc-4.2.3 run quickly with my prestonia xeon, I will try to apply those results to my nocona xeon.

Others were kind enough to run my latest code, and got some interesting results. Using newer processors, the code seems almost twice as fast using a 64 bit compiler. Also with 4 threads the Q6600 is pretty fast and the dual Xeon 5160 is very fast.

I think these new results highlight the importance of testing code on specific hardware. I assumed that the Nocona would perform as well as the Prestonia. After all, the architectures were pretty similar, and the Nocona has more cache and a higher clock rate. Also, I had no idea that the O.S. could slow a program by 50% (Mandriva 2008.1 vs Mandriva 2007.0)

Older analysis

So the dual xeon takes 11,494 seconds with 4 threads. The dual pentium III 933 takes 45,798 seconds with 2 threads. The T-2000 takes 20,387 seconds with 8 threads and 10,600 seconds with 32 threads. The T-2000 is 8% faster. However, this is a 32 hardware thread machine, which retails for roughly $14,595.00. So the price/performance ratio is heavily biased towards the xeon processor. I am sure that AMD processors would have an even higher price/performance ratio.

Lets see how the newer CPU's did. The Turion took 65,807 seconds (71,812 seconds on windows XP) with one thread. The Core 2 Duo took 63,022 seconds with one thread. With two threads the Core 2 Duo took 31,763 seconds. Both my Dual Xeon and the Core 2 Duo have two physical processors, yet with hyperthreading of my Dual Xeon, it is almost three times faster than the Core 2 Duo (with the Xeon having a clock rate 11% higher). I had planned on buying a dual quad core xeon system, but it looks like it would be less than twice as fast as my current Xeon, based on the Core 2 Duo times. The Turion was almost as fast as a single core of the Core 2 Duo. Other benchmarks I have read show the Core 2 Duo being monstrously faster than previous chips. Perhaps those benchmarks are taking advantage of the MMX type processing, which I suspect gcc does not. It is interesting that using linux and a newer version of gcc resulted in about a 9% speedup on the Turion. This is why it is very important to benchmark your application, rather than relying on others benchmarks to determine which hardware to buy.

As for the old algorithm, it took 96844 seconds with two threads on the Xeon. Assuming good scaling, it would take 48422 seconds with four threads. So the new algorithm is over 4 times faster than the old one. I have seen algorithms that are faster still, and I am working on incorporating them into my program.

Prime numbers and Memory Usage

I always was fascinated by prime numbers. Of course it is easy to write a program to find all prime numbers up to a given number. Here it is:

for (int i = 2; i < max_number; i++) {
    may_be_prime = true;
    j = 2;
    while (may_be_prime == true && j < square_root(i))
        if i/j is integral {
            may_be_prime = false;
        }
        j++;
    }
    if (may_be_prime == true) {
        output "i is prime"
    }
}

This will work fine, and will crank out lots of prime numbers. There are a number of ways to improve it. First you only have to compute the square_root(i) once per i. Second, you only have to try values of 'j' that are prime. That is, it is pointless to try a j value of 4.

This leads to the Sieve of Eratosthenes (275-195 BC), who was a very smart fellow. The idea is to make an array of numbers starting at 1 or 2. First cross out every number divisible by 2 (above 2). So cross out 4, 6, 8, 10, 12, etc. Next find the lowest number above 2 that isn't crossed out (which would be 3). Cross out every number divisible by 3 (above 3). So cross out 6, 9, 12, 15, etc. This will give you all prime numbers below the square of the number you last crossed out.

It is easy to write a program to do this, and I did so when I was in college. As one would expect, it ran quite fast, until I wanted to get lots and lots of prime numbers. Then it ran really slowly. This shows the difference between an algorithm and an implementation of the algorithm.

It is clear you need an array to cross out values. A little bit of thought makes you realize that you only care if the number is crossed out or not, therefore the array can be an array of boolean values. So to get all primes below 1,000,000, you will need an array of 1,000,000 booleans. Since I was writing the code in Pascal, I used an array of boolean. Now a compiler will almost always allocate the machines native word size for each element of such an array. This is because it is quick to access such an array.

The reason that the program started to run slowly was because the computer was running out of memory. Since this was a VAX/VMS system which had virtual memory, the program would continue, but would thrash. Thrashing is when memory accesses would cause lots of page faulting. When I ran out of memory that was physically allocated, it un-allocate the first page physically allocated, and allocate a new page. Since I was making linear passes through the array, there would be quite a bit of memory un-allocating and allocating. Since the new memory comes from the page file (which is on the hard drive) many memory accesses required reading/writing the hard drive, hence the term thrashing. What is worse, since this was a time-sharing machine, the computer was running out of memory for everyone, slowing down all users of the machine.

Clearly a solution had to be found. The first and easiest solution was to change the array declaration. When I declared the array to be a packed array of boolean, the compiler allocated only a single bit of memory for each boolean value. This was great, as it improved the number of primes I could generate with a given amount of memory by a factor of 32. The next solution was to simply not store numbers divisible by 2. None of them (other than 2) are prime, and so there is no need to deal with them. This sped up the execution, and what is better, cut the memory usage by a factor of 2. The downside was it was a bit harder to generate indexes into the array to mark elements. It isn't very difficult however. The next step is not to store numbers divisible by 3. This saves a lesser amount of memory, and makes generating indexes even harder, though not so tough. This allowed me to find all prime numbers less than 100,000,000 on a machine with very little memory.

Fast forward to today. Rather than having a VAX 11-780, I have a dual processor 2.4 ghz Xeon. Rather than a few mega-bytes of memory, I have a giga-byte of memory. I recoded the program in C, as Pascal compilers are not very common anymore (to the detriment of software engineering). The simple prime number generator will generate all the primes below 10 million in 48 seconds. Using a simple bit array it will generate all primes below 10 million is 0.405 seconds, and all primes below 4,000,000,000 in 270 seconds. Using a bit array that doesn't store even numbers, I can generate all primes below 4,000,000,000 in 128 seconds.

Malloc and Memory Usage

I became interested in fortune-cookie style quotes. I gathered up quite a bunch of them. I ended up with some that were similar to others. I wanted to be able to get rid of the similar ones, so I wrote a program called sim in 1988.

For sim, I read the fortune cookie file into memory, allocating memory for each line. If I modified the line, I made a copy of it, leaving the original alone so I could write it out if needed. I allocated memory for each modified line.

Well, it turns out that malloc has overhead. Malloc allocates extra bytes for its own uses. When I started running sim on a Microsoft Windows system, I realized that all malloc sizes were not equal. After a bit of experimenting and a bit of reading, I realized that Microsoft used a power of two allocator. That means that memory allocated size was always a power of 2, so if you allocate 80 bytes, Microsoft will allocate 128 bytes of memory.

Since memory was very expensive, I modified how I used malloc. I wrote my own heap manager. I would allocate 512 kbytes, and then hand out pieces of that memory when my heap manager was called. This saved me quite a bit of memory (which was the same as quite a bit of money). The only downside was I didn't have a way to free the memory when all pieces of this memory were freed. This was not ideal, but it was quite workable. I have since stopped using my heap managed malloc, because memory is cheap. If I had problems that required lots of small pieces of memory, I might start to use it again.

If you have comments or suggestions, Email me at turbo-www@weasel.com

Created with gnu emacs and template-toolkit, not some sissy HTML editor.

No Java or javascript needed to view my web pages. They both have significant security issues.

Multi Processing and Multi Threading

Why bother?

My early multi-processing

My personal computers and performance

Multi threading on my computers, sim_t 10k line dataset

Multi threading on my computers, sim_t 50k line dataset

Multi Threaded Performance

Single Threaded Performance

No substitute for superior algorithm

Multi threading on my computers, sim_t full input, 390k lines

Newer analysis Updated Sep-01-2023

Older analysis

Prime numbers and Memory Usage

Malloc and Memory Usage

Why bother?

My early multi-processing

My personal computers and performance

Multi threading on my computers, sim_t 10k line dataset

Multi threading on my computers, sim_t 50k line dataset

Multi Threaded Performance

Single Threaded Performance

No substitute for superior algorithm

Multi threading on my computers, sim_t full input, 390k lines

Newer analysis

Older analysis

Prime numbers and Memory Usage

Malloc and Memory Usage

Home