Building Python 2.7 on 64bit Linux, with Intel compilers

Python 2.7.3+ (2.7:15fd0b4496e0+, Sep 30 2012, 16:31:04)
[GCC Intel(R) C++ gcc 4.6 mode] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

For this, my first documented build, I will be using Intel’s Composer XE compiler suite to compile the latest development build of Python 2. This same Python build is continued to be used with numpy, scipy, matplotlib, zope, and loads of other Python extensions..

preamble

I have developed with the Python programming language since I begun programming. For those of you who aren’t familiar with it, it’s a clean-looking language that uses indentation to organise blocks of code. I think it was the first language to use “white space” for structuring code and this, I find, makes it very easy to read and write.

why python?

When development time is limiting, and is more important than a short run-time, Python is a great language of choice. It can do so much, whilst maintaining a clean and clear syntax. It also has great documentation, both online and in the interactive console, and there are loads of tutorials and examples out there. I started learning Python – and programming in general – by going through The Basics, on Alan Gauld’s web site. I was also at University getting taught the language by some very clever geeks. There are definitely more recent tutorials out there, which have evolved with the language, but as with a lot of things, “The only knowledge is experience”. That’s one of Einstein’s, that. If you’re not sold – I’m no salesman, believe me – a great article was published back in 2000, on linuxjournal. It’s still relevant today, and does a good job in necessitating a language like Python, in the face of other languages that have been around for longer.

Python 2 is now considered a very stable and feature-complete language. The core development team are now focusing on Python 3, so there will be no more major changes to the latest Python 2 version (2.7.3). Probably the only changes will be for minor bugfixes, or perhaps to introduce compatibility to a new hardware platform. If it isn’t broke, why fix it? Python 3 is being very actively developed though.

why Intel’s compilers?

On a side note – blogs are meant for expressing opinion, yea? – I couldn’t really recommend Intel’s compilers over the GNU GCC compiler suite, purely because of the cost. Intel’s compilers are not cheap! They are independently reported to have some performance gains for Intel processors, especially for numerically intensive programs, which originally enticed me in to buy them. I did manage to get them massively reduced from Polyhedron, whilst being a student, which won’t be for much longer. My opinion on them? I figure they’d be great for C, C++ and Fortran developers, who test and optimise their code whilst writing it. When compiling open source code though, downloaded from github or wherever, usually the code has only been tested with the GNU Compiler Collection, doesn’t work out of the box (tar archive) without first tinkering with the compiler options, and it’s much harder to find any help online. On Linux systems, distributors do a cracking job of compiling open source software, and it’s often very hard to get any performance improvements over the versions you can get pre-compiled, from your default package manager. You can probably compile a newer version, but within 3 months time, a system update will usually get you a newer, and stabler version.

The Set Up

Hardware

Modern Intel processor. I’ve seen reports that performance of Intel’s libraries on the Celeron processor are particularly poor, as the optimised libraries weren’t optimised for that architecture. AMD chips are much less likely to see any benefits from using Intel’s compiler suite. Everything else you need, you should already have.

Software

I am currently running the x86_64 build of Debian Linux, with KDE desktop and Ubuntu / Canonical software repositories. I’ve upgraded this system for the last few years, since about Linux 2.5, and am now on 3.2.0-32. For the following steps I used the programs: icc, icpc, xild and xiar; from Intel’s Composer XE 2013 compiler suite. I first did this with Composer XE2011 about 8 months ago, and there haven’t been any noticeable any compiler option flag changes. The only two libraries I explicitly linked against were libirc.so and libimf.so, which are installed by default into: /opt/intel/composerxe/lib/intel64.

to the terminal…

I keep my source code in /usr/local/src. But let’s assume you want to keep yours in your personal Download directory.. Open up the terminal.

Make the directory if you need, and cd into it.

> mkdir -p Download/src 
> cd Download/src

I like to be on the cutting edge, which also helps if I’m to give anything back, so I download the Python source code with mercurial.

> hg clone http://hg.python.org/cpython -r 2.7 
> cd cpython

It’s probably a better idea to just download a stable version from python.org.

These commands could be used instead:-

> wget 'http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz'
> tar xzf Python-2.7.3.tgz
> cd Python-2.7.3

configure

Now it’s time to configure the build, specifying various options to use the Intel compiler and libraries, and optimise the binaries as much as possible. For more information on the options passed to configure (those with “–” preceding them), enter  ./configure –help.

The paths to the Intel directories are the defaults, so should be the same for most people. If you installed the compiler suite elsewhere, change the CPPFLAGS, LDFLAGS and –with-cxx-main options, accordingly.

We want to enable a shared build, and some other features that aren’t enabled by default, but should work on my system in case I ever need them. The ‘–enable-unicode=ucs4’ option enables a larger byte size for unicode characters. I think this is required to support the thousands of characters, as used in some Far Eastern languages. I’ve also had problems with essential extension modules in the past, as they only worked with the ucs4 unicode format. Anyway, here we go:-

> ./configure --with-libc=-lirc \
         --enable-shared --enable-unicode=ucs4 \
         --without-gcc --with-libm=-limf \
         --with-cxx-main=/opt/intel/composerxe/bin/icpc \
         --with-threads --enable-ipv6 --with-signal-module \
         CC=`which icc` CXX=`which icpc` \
         LD=`which xild` AR=`which xiar` \
         LIBS="-lpthread -limf -lirc" \
         CFLAGS="-fp-model strict -fp-model source -O3 -xHost -ipo -prec-div -prec-sqrt" \
         LDFLAGS="-L/opt/intel/comproserxe/lib/intel64 -L/usr/lib/x86_64-linux-gnu -ipo" \
         CPPFLAGS="-I/opt/intel/composerxe/include -I/opt/intel/include/intel64/lp64 -I/usr/include/x86_64-linux-gnu" \
         CPP="/opt/intel/composerxe/bin/icc -E"

The CFLAGS are all quite specific to Intel’s `icc` compiler, and were designed to turn on all speed optimisations, whilst keeping as much mathematical accuracy as possible. If you’re really keen for performance, we could add other CFLAGS options here to turn on “profile guided optimisation”. I’ll leave that for another article, but basically it’s a two-step compile process; an intermediate build needs to be run, so that profile information is generated, and saved to some directory. The second build then reads this profile information to look for parts of the code that are most often run; these are then heavily optimised in the latter build.

If you want to go ahead and figure out the options for yourself, checkout icc’s man page, with the command:-

> man icc

(Search for an option by pressing forward slash (/), type your search expression and press return. Or just scroll with the arrow keys)
At this point, if you get an error message saying
“No manual entry for icc”
then your command line environment is not set up correctly. Specifically, your MANPATH environment variable isn’t right. See the official Intel documentationfor detailed instructions on setting up PATH, LD_LIBRARY_PATH, and other environment variables, to automatically load into your command line environment. If these weren’t set, you’ll need to do so before re-running the above `configure` command.

make

I have 12 visible CPU cores on my machine, so lets use some of them.

> make -j10

That allows up to 10 separate compile commands to run in parallel whilst building Python. It still takes a couple minutes…
At the end, I get the following message:-

Python build finished, but the necessary bits to build these modules were not found:
bsddb185           dl           imageop
sunaudiodev
To find the necessary bits, look in setup.py in detect_modules() for the module's name.

I don’t mind so much about those missing modules, so I’ll continue to configure.

test

Unfortunately, the tests only run one after the other, and can’t currently take advantage of multiple processor cores.. Still, it’s worth testing the build before deploying it.

> make test

This takes a long time – a good 10 to 20 minutes – and actually runs all the tests twice; once on the source code, and once on the compiled bytecode (the .pyc files).

Test results are in!

350 tests OK.
2 tests failed:
    test_cmath test_gdb
2 tests altered the execution environment:
    test_distutils test_subprocess
36 tests skipped:
    test_aepack test_al test_applesingle test_bsddb185 test_bsddb3
    test_cd test_cl test_codecmaps_cn test_codecmaps_hk
    test_codecmaps_jp test_codecmaps_kr test_codecmaps_tw test_curses
    test_dl test_gl test_imageop test_imgfile test_kqueue
    test_linuxaudiodev test_macos test_macostools test_msilib
    test_ossaudiodev test_scriptpackages test_smtpnet
    test_socketserver test_startfile test_sunaudiodev test_timeout
    test_tk test_ttk_guionly test_urllib2net test_urllibnet
    test_winreg test_winsound test_zipfile64
Those skips are all expected on linux2.
make: *** [test] Error 1

Two errors. Not bad! But a cmath error isn’t good. I’ll want to use that. Scroll up, and I see:-

test_cmath
test test_cmath failed -- Traceback (most recent call last):
  File "/usr/local/src/pysrc/cpython/Lib/test/test_cmath.py", line 352, in test_specific_values
    msg=error_message)
  File "/usr/local/src/pysrc/cpython/Lib/test/test_cmath.py", line 94, in rAssertAlmostEqual
    'got {!r}'.format(a, b))
AssertionError: acos0000: acos(complex(0.0, 0.0))
Expected: complex(1.5707963267948966, -0.0)
Received: complex(1.5707963267948966, 0.0)
Received value insufficiently close to expected value.

Not so bad after all. Minus nought should be equal to plus nought, for all I care. The test_gdb error isn’t great, but gdb still seems to work for me anyway when debugging Python extension builds.

install

> sudo make install
Enter password: <enter your password>

done

This probably isn’t anything out of the ordinary, for anyone familiar with using GNU’s autotools.

Benchmark

How could I end an article without a good, honest benchmark? After all that effort, you’d wanna know that it was worth it… I wanna know that it was worth it!

The official benchmark suite is available through mercurial…

> hg clone http://hg.python.org/benchmarks/
> cd benchmarks

… Read the README. Says to run:- …

python perf.py -r -b default /control/python /test/python

In this case, I’ll test against the system, GCC-built Python (at /usr/bin/python) against my custom, Intel-built python at /usr/local/bin/python.

python perf.py -r -b default /usr/bin/python /usr/local/bin/python
### 2to3 ###
Min: 6.396400 -> 7.684480: 1.20x slower
Avg: 6.494806 -> 7.780486: 1.20x slower
Significant (t=-30.36)
Stddev: 0.06763 -> 0.06628: 1.0204x smaller
Timeline: http://tinyurl.com/9245ok5

### django ###
Min: 0.436923 -> 0.523592: 1.20x slower
Avg: 0.440950 -> 0.529066: 1.20x slower
Significant (t=-188.61)
Stddev: 0.00262 -> 0.00387: 1.4753x larger
Timeline: http://tinyurl.com/8pjkk54

### slowpickle ###
Min: 0.303753 -> 0.385738: 1.27x slower
Avg: 0.305808 -> 0.387953: 1.27x slower
Significant (t=-463.81)
Stddev: 0.00117 -> 0.00133: 1.1333x larger
Timeline: http://tinyurl.com/9zz9k7k

### slowspitfire ###
Min: 0.306511 -> 0.348436: 1.14x slower
Avg: 0.310810 -> 0.358866: 1.15x slower
Significant (t=-61.68)
Stddev: 0.00481 -> 0.00613: 1.2725x larger
Timeline: http://tinyurl.com/8rt3nxb

### slowunpickle ###
Min: 0.141021 -> 0.184594: 1.31x slower
Avg: 0.142123 -> 0.185807: 1.31x slower
Significant (t=-320.99)
Stddev: 0.00110 -> 0.00080: 1.3791x smaller
Timeline: http://tinyurl.com/8kut8eo

### spambayes ###
Min: 0.145338 -> 0.169606: 1.17x slower
Avg: 0.146368 -> 0.170417: 1.16x slower
Significant (t=-189.87)
Stddev: 0.00085 -> 0.00094: 1.1102x larger
Timeline: http://tinyurl.com/8d4jdeu

The following not significant results are hidden, use -v to show them:
nbody.

Well, that’s not nearly as good as I was hoping… Consistently ~20 to 30% slower! Well, I’ll give it another go another day. I have once outperformed the default build using icc, but only by profiling the code first. I’m sure there are also ways to profile a build with GCC, and therefore make a fair test, but I haven’t done that before though, and it wouldn’t be fair to compare a profiled build against the system build, which I don’t think was profiled…
If you’ve any other suggestions as to better compile options, I’m all ears!
Well that was long. I’ll try and keep my blogs more concise in future (save me some time too!).

Advertisements
Tagged , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: