Search found 11 matches

by hjmangalam
Thu Apr 10, 2014 10:31 am
Forum: Framework
Topic: sporadic crashing on Centos/6.4/64b
Replies: 1
Views: 2800

sporadic crashing on Centos/6.4/64b

I am a sysadmin, not an Opensees user, trying to help a user discover why her Opensees jobs are failing on our cluster.

She's submitting 105 OpenSees (version 2.4.1_r5363, also 2.4.3_r5695) jobs to our gridengine scheduler (sending jobs to several 64b CentOS6.4 hosts, each with 512GB RAM).

Her qsub script is here: <http://pastie.org/9062325>


A few to several of the 105 jobs failed very early in the run:

Here is the start of the output:
==========================================================
Start time is: Wed Apr 9 12:32:30 PDT 2014
hostname is: compute-7-9.local


OpenSees -- Open System For Earthquake Engineering Simulation
Pacific Earthquake Engineering Research Center -- 2.4.3 (rev 5634)

(c) Copyright 1999-2013 The Regents of the University of California
All Rights Reserved
(Copyright and Disclaimer @ http://www.berkeley.edu/OpenSees/copyright.html)


Five Longitudinal Stiffness from acute (max) corner to obtuse (min) corner:
1328.3368505926462
1328.3368505926462
1328.3368505926462
1328.3368505926462
1328.3368505926462
Finished creating gravity load superstructure...
CTestNormDispIncr::test() - iteration: 1 current Norm: 5.12118 (max: 1e-08, Norm deltaR: 44586.4)
CTestNormDispIncr::test() - iteration: 2 current Norm: 0.0981256 (max: 1e-08, Norm deltaR: 989.217)
CTestNormDispIncr::test() - iteration: 3 current Norm: 0.00235124 (max: 1e-08, Norm deltaR: 1.78051)
CTestNormDispIncr::test() - iteration: 4 current Norm: 6.60416e-06 (max: 1e-08, Norm deltaR: 5.25307e-05)
CTestNormDispIncr::test() - iteration: 5 current Norm: 1.03438e-09 (max: 1e-08, Norm deltaR: 4.44241e-05)

Ground Motion: dt= 0.005000, NumPts= 10064, TmaxAnalysis= 50.32
*** glibc detected *** OpenSees: free(): invalid next size (normal): 0x00000000025c6620 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x76166)[0x2b224f456166]
/lib64/libc.so.6(+0x78c93)[0x2b224f458c93]
OpenSees(sp_coletree+0x1a1)[0xf802c1]
==========================================================

(for the complete dump, see <http://pastie.org/9061984>)

All the jobs fail due to that kind of error:
*** glibc detected *** OpenSees: free(): invalid next size (normal): 0x00000000033d6000 ***

which implies that there's a garbage value being fed to free().

the input file (B22_28_0_45.tcl) contains this:
===========================================
996 $ cat B22_28_0_45.tcl
wipe
set GMskew 0
set iM 28
set GMinter 0
set skew 45
source BF1U_Analyzer_22.tcl
============================================
and the file referenced above (BF1U_Analyzer_22.tcl) can be found here:
<http://pastie.org/9062108>


The runs do not fail on a single host - the failures are spread among the 3 hosts that the jobs are running on.
# fails: hosts
7 : compute-7-2.local
23 : compute-7-3.local
1 : compute-7-9.local

ie 7 jobs failed on compute-7-2, etc. As well, the number of aborted runs per submission changes. On a second run, only 7 jobs aborted and of those only 3 were in common:


So there is something that is not replicable exactly (the same inputs do not always cause a failure), but is replicable over machines, and over runs (Always get a few runs that fail).

Before I try to debug further, would it be possible for the Opensees devs to run the code thru valgrind or other memory debugger to try to find this explosive free()?

I can tar the entire dir up if it would be helpful for you to see all examples of the failures and successes.

I'll try to catch a crash inside of valgrind as well.

Thanks
Harry

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
by hjmangalam
Wed Jul 03, 2013 8:50 am
Forum: Feature Requests/Future Directions
Topic: How about a std configure / make system?
Replies: 1
Views: 8795

How about a std configure / make system?

After patching the Makefile.def a few times, and remembering that it took several days to (fail to) compile this previously, I'm going to politely, plaintively suggest that perhaps OpenSees should switch to a more standard environment flag-respecting build system, either cmake, or the GNU build chain. Many of the build files seem to have been written in the mid 00s and haven't been updated.

I really respect the contribution of open source developers and especially this system, but I can't rationalize spending a week building this thing.

The final straw was this:

In file included from commands.cpp:245:0:
/data/apps/sources/OpenSees/SRC/system_of_eqn/linearSOE/sparseGEN/SuperLU.h:48:23: fatal error: slu_ddefs.h: No such file or directory
compilation terminated.

OK - in the OTHER subdir, there are no less than 6 SuperLU subdirs.:
AMD BLAS CSPARSE LAPACK SuperLU SuperLU_3.0 SuperLU_DIST_2.0 SuperLU_MT UMFPACK
ARPACK CBLAS ITPACK METIS SuperLU_4.1 SuperLU_DIST_2.5 Triangle tetgen1.4.3

the slu_ddefs.h file that;s not being found is in:

$ find . -name slu_ddefs.h
./SuperLU_4.1/SRC/slu_ddefs.h

The OTHER dir that seems to be related to specifying this is in Makefile.def:
84 SUPERLUdir = $(HOME)/OpenSees/OTHER/SuperLU

but when that is changed or symlinked to SuperLU_4.1 it still fails.

If slu_ddefs.h id copied to SuperLU/SRC, it still fails.

For a system that seems to have ambitions of very widespread use, and has designed a pretty glitzy web site in support of it, it is among the most difficult packages to build that I've seen.

Now, I could be stupid, I could be incompetent, I could be illiterate, but I've managed to install hundreds of ill-behaved apps for Linux and while I'm sure that I could slog thru it, there are other more pressing demands on my time, so with great regret, I am telling the engineering group that wants OpenSees that they will not be getting it installed on our cluster unless there is a mor reliable install process made available.
by hjmangalam
Wed Jul 03, 2013 8:24 am
Forum: Feature Requests/Future Directions
Topic: remove -fforce-mem from Makefile.def
Replies: 0
Views: 7653

remove -fforce-mem from Makefile.def

has been obsolete for several years from gcc. ignored or causes build failure.

Makefile.def:178: -fforce-addr -fforce-mem -finline-functions \
^^^^^^^^^^
by hjmangalam
Wed Jul 03, 2013 8:06 am
Forum: Documentation
Topic: How to dowload source for building on a Linux cluster
Replies: 3
Views: 5119

Re: How to dowload source for building on a Linux cluster

Well, after searching the site for 30m, I haven't been able to find it. Could you please document this? This is central to the process of getting the source code. You provide a Web interface to the source, but no svn repo link.

The usual way to download source is by tarball. If you don't want to supply a version tarball, fine, but then PLEASE supply the svn link, preferably the whole command so we sysadmins don't have to spend an hour working thru all the docs, BB postings, google searching, etc.

Ah, OK found it (not in the download area, but in the developer area:
http://opensees.berkeley.edu/OpenSees/developer/svn.php

the command is:

svn co svn://opensees.berkeley.edu/usr/local/svn/OpenSees/trunk OpenSees

Sheesh. Would it be so hard to put that line on the download page?
by hjmangalam
Wed Jun 08, 2011 9:13 am
Forum: Parallel Processing
Topic: OpenSeesSP Build error
Replies: 26
Views: 27744

Re: OpenSeesSP Build error

Hi there,

This part:
===
/home/numubuntu/OpenSees/SRC/tcl/tkMain.o: In function `Tk_MainOpenSees(int, char**, int (*)(Tcl_Interp*), Tcl_Interp*)':
tkMain.cpp:(.text+0x468): undefined reference to `Tk_MainLoop'
tkMain.cpp:(.text+0x4f1): undefined reference to `TkpDisplayWarning'
tkMain.cpp:(.text+0x578): undefined reference to `TkpDisplayWarning'
/home/numubuntu/OpenSees/SRC/tcl/tkAppInit.o: In function `Tcl_AppInit':
tkAppInit.cpp:(.text+0x12): undefined reference to `Tk_Init'
tkAppInit.cpp:(.text+0x1c): undefined reference to `Tk_SafeInit'
tkAppInit.cpp:(.text+0x21): undefined reference to `Tk_Init'
===
is probably due to a missing / misreferenced 'tk.h' The makefile is chock full of these problems. You may have to explicitly edit it into the dir-specific makefile or set your own environment variables (ie in Makefile.def, set "gcc = gcc -I/path/to/tk.h" or some similarly crude horror.

The following one:
===
/home/numubuntu/OpenSees/SRC/tcl/commands.o: In function `getNP(void*, Tcl_Interp*, int, char const**)':
commands.cpp:(.text+0xc03): undefined reference to `theMachineBroker'
/home/numubuntu/OpenSees/SRC/tcl/commands.o: In function `getPID(void*, Tcl_Interp*, int, char const**)':
commands.cpp:(.text+0xc43): undefined reference to `theMachineBroker'
/home/numubuntu/OpenSees/SRC/tcl/commands.o: In function `record(void*, Tcl_Interp*, int, char const**)':
===
is due to a misreferenced include file in the indicated dir: There are several refs to 'MachineBroker.h' which have not been set correctly, either due to overwriting CPPFLAGS or ignoring it or something - I wasn't willing to debug this. cd to /home/numubuntu/OpenSees/SRC/tcl and correct the followin entries to point to the right path:

(the line numbers aren't correct to the original src bc I added the corrected lines to my copy (and these are commented out). You can probably ignore the MPI_-prefixed includes unless you're trying to build the parallel version.
===
commands.cpp:345:// #include <MachineBroker.h>
commands.cpp:416:// #include <MachineBroker.h>
commands.cpp:428://#include <MachineBroker.h>
mpiMain.cpp:153:#include <MPI_MachineBroker.h>
mpiMainTest.cpp:34:#include <MPI_MachineBroker.h>
mpiParameterMain.cpp:154:#include <MPI_MachineBroker.h>
tclMain.cpp:87:#include <MachineBroker.h>
===
for example, the following lines are my corrected lines
===
commands.cpp:346:#include "/home/hmangala/build/OpenSees/SRC/actor/machineBroker/MachineBroker.h"
commands.cpp:417:#include "/home/hmangala/build/OpenSees/SRC/actor/machineBroker/MachineBroker.h"
commands.cpp:429:#include "/home/hmangala/build/OpenSees/SRC/actor/machineBroker/MachineBroker.h"
tclMain.cpp:88:#include "/home/hmangala/build/OpenSees/SRC/actor/machineBroker/MachineBroker.h"
===
This has to be done again and again and again ....
by hjmangalam
Wed Jun 08, 2011 7:03 am
Forum: Parallel Processing
Topic: OpenSeesSP Build error
Replies: 26
Views: 27744

Re: OpenSeesSP Build error

My apologies if you took this as a criticism of you - it certainly wasn't - it was just an puff of steam from an overheated sysadmin. This is not my area of expertise or interest so I have no particular interest in it - I'm just trying to help a user get the program compiled but it's looking like that's not going to happen. I've compiled thousands of src packages over the course of 25 years of geekhood and this one wins as being the most troublesome.

Again, thanks for the help you've given.

hjm
by hjmangalam
Tue Jun 07, 2011 2:22 pm
Forum: Parallel Processing
Topic: OpenSeesSP Build error
Replies: 26
Views: 27744

Re: OpenSeesSP Build error

Thank you, hsafti,

I see you used tcl/tk 8.5 - I was using 8.4 (but have access to 8.55). I'll try that. I went in and edited some include lines to point them to the include files (in the distribution!) and got further, but am now failing with what looks like an internal compiler error(!)
=====
commands.cpp:6712: warning: deprecated conversion from string constant to ‘char*’
commands.cpp: In function ‘int sectionForce(void*, Tcl_Interp*, int, const char**)’:
commands.cpp:6661: error: unable to find a register to spill in class ‘AREG’
commands.cpp:6661: error: this is the insn:
(insn 151 154 147 12 commands.cpp:6631 (parallel [
(set (reg:DI 2 cx [140])
(const_int 0 [0x0]))
(set (reg:DI 3 bx [138])
(plus:DI (ashift:DI (reg:DI 2 cx [140])
(const_int 3 [0x3]))
(reg:DI 3 bx [138])))
(set (mem/s/c:BLK (reg:DI 3 bx [138]) [0 a+8 S72 A64])
(const_int 0 [0x0]))
(use (reg:DI 6 bp [139]))
(use (reg:DI 2 cx [140]))
]) 858 {*rep_stosdi_rex64} (expr_list:REG_DEAD (reg:DI 6 bp [139])
(expr_list:REG_UNUSED (reg:DI 2 cx [140])
(expr_list:REG_UNUSED (reg:DI 3 bx [138])
(nil)))))
commands.cpp:6661: confused by earlier errors, bailing out
=====

Is there a recommended or required gcc/g++ version? I'm using 4.4 (CentOS 5.5)

I do have some Ubuntu machines on which I can try to compile, but this is getting ridiculous.. autoconf may be ugly, but it works.
by hjmangalam
Tue Jun 07, 2011 11:01 am
Forum: Parallel Processing
Topic: OpenSeesSP Build error
Replies: 26
Views: 27744

Re: OpenSeesSP Build error

Any hope of getting that Makefile.def?

I've been getting lots of replies regarding my posts but all of them are spam.

(hopeful) Thanks in advance
Harry
by hjmangalam
Fri Jun 03, 2011 2:58 pm
Forum: Parallel Processing
Topic: OpenSeesSP Build error
Replies: 26
Views: 27744

Re: OpenSeesSP Build error

Giving up on the parallel build due to build dead ends, but the serial version has hit the same wall.

The build still starts failing with our good friend 'theMachineBroker':
...
459 make[2]: Leaving directory `/home/hmangala/build/OpenSees/SRC/tcl'
460 /home/hmangala/build/OpenSees/SRC/tcl/commands.o: In function `getNP(void*, Tcl_Interp*, int, char const**)':
461 commands.cpp:(.text+0xc33): undefined reference to `theMachineBroker'
462 /home/hmangala/build/OpenSees/SRC/tcl/commands.o: In function `getPID(void*, Tcl_Interp*, int, char const**)':
463 commands.cpp:(.text+0xc73): undefined reference to `theMachineBroker'
464 /home/hmangala/build/OpenSees/SRC/tcl/commands.o: In function `record(void*, Tcl_Interp*, int, char const**)':
465 commands.cpp:(.text+0x14a5): undefined reference to `theDomain'
466 /home/hmangala/build/OpenSees/SRC/tcl/commands.o: In function `domainChange(void*, Tcl_Interp*, int, char const**)':
467 commands.cpp:(.text+0x14c5): undefined reference to `theDomain'
468 /home/hmangala/build/OpenSees/SRC/tcl/commands.o: In function `OpenSeesExit(void*, Tcl_Interp*, int, char const**)':
...

even tho in the Makefile.def:
PROGRAMMING_MODE = SEQUENTIAL
and in the [OpenSees/SRC/actor] dir
'make' completes successfully.

There are clearly missing directives that should point to the includes and object files that live in that dir tree.
by hjmangalam
Thu Jun 02, 2011 2:15 pm
Forum: Parallel Processing
Topic: OpenSeesSP Build error
Replies: 26
Views: 27744

Re: OpenSeesSP Build error

OK some more info. Due to all the hardcoded paths and vars, I accidentally commented out some stuff that should not have been.
to wit, the 1st 3 lines that start with $(FE) below

PETSC_LIB = /apps/petsc/3.1-p8/lib/libpetsc.a \
$(FE)/system_of_eqn/linearSOE/petsc/PetscSOE.o \
$(FE)/system_of_eqn/linearSOE/petsc/PetscSolver.o \
$(FE)/system_of_eqn/linearSOE/petsc/PetscSparseSeqSolver.o \
$(HOME)/OpenSees/OTHER/LAPACK/dgebak.o \
$(HOME)/OpenSees/OTHER/LAPACK/dgebal.o \
$(HOME)/OpenSees/OTHER/LAPACK/dgeev.o \

Howeer, the code in this dir:
$(FE)/system_of_eqn/linearSOE/petsc
does appear to be compiling, hence the missing obj files that the Makefile requires:


make[2]: Nothing to be done for `tk'.
make[2]: Leaving directory `/home/hmangala/build/OpenSees/SRC/tcl'
g++: /home/hmangala/build/OpenSees/SRC/system_of_eqn/linearSOE/petsc/PetscSOE.o: No such file or directory
g++: /home/hmangala/build/OpenSees/SRC/system_of_eqn/linearSOE/petsc/PetscSolver.o: No such file or directory
g++: /home/hmangala/build/OpenSees/SRC/system_of_eqn/linearSOE/petsc/PetscSparseSeqSolver.o: No such file or directory
g++: /usr/lib/libtcl8.4.a: No such file or directory
g++: /usr/lib/libtk8.4.a: No such file or directory
make[1]: *** [tk] Error 1

in the Makefile.def, we have at least 2 vars: PETSCINC and PETSC_INC which seem to point to the same thing. As well as including a lot of hard-coded author's paths, what do these things actually intend to point to?

And if they're intended to point to the Petsc include dir, why do I still get:

...
/home/hmangala/build/OpenSees/SRC -I/home/hmangala/build/OpenSees/OTHER/SuperLU_3.0/SRC -I/home/hmangala/build/OpenSees/SRC/package -I/home/hmangala/build/OpenSees/SRC/../OTHER/AMD -c PetscSOE.cpp -o PetscSOE.o
In file included from PetscSOE.cpp:32:
PetscSOE.h:44:22: error: petscksp.h: No such file or directory
PetscSOE.cpp:34:22: error: petscvec.h: No such file or directory
PetscSOE.h:96: error: ‘Mat’ does not name a type
PetscSOE.h:97: error: ‘Vec’ does not name a type
PetscSOE.h:99: error: ‘PetscTruth’ does not name a type
PetscSolver.h:51: error: expected `)' before ‘method’
PetscSolver.h:52: error: expected `)' before ‘method’

...

Is one of those vars supposed to point to a particular FILE instead of the dir?
by hjmangalam
Thu Jun 02, 2011 1:38 pm
Forum: Parallel Processing
Topic: OpenSeesSP Build error
Replies: 26
Views: 27744

Re: OpenSeesSP Build error

Same problem here. On both CentOS 5.5 and Ubuntu 10.04.

The arch is x86_64 for a Linux cluster, using openmpi 1.4.3
I built the PETSc new on CentOS, used the repo version for Ubuntu.
I tried tcl/tk 8.5.5 and also 8.4

I could send you my Makefile.def but I'd rather not clutter this venue. It's available here:
<http://moo.nac.uci.edu/~hjm/Makefile.def>
(The starting file was the MAKES/Makefile.def.LINUX_CLUSTER)
If you think I would have better luck with another one, please let me know.

the following is an extract of 'out', generated by:
make &> out
[code]
...
458 LIBRARIES BUILT ... NOW LINKING OpenSees PROGRAM
459 make[1]: Entering directory `/home/hmangala/build/OpenSees/SRC/tcl'
460 make[1]: Nothing to be done for `tcl'.
461 make[1]: Leaving directory `/home/hmangala/build/OpenSees/SRC/tcl'
462 make[1]: Entering directory `/home/hmangala/build/OpenSees/SRC/modelbuilder/tcl'
463 Makefile:30: warning: overriding commands for target `tcl'
464 Makefile:13: warning: ignoring old commands for target `tcl'
465 Makefile:41: warning: overriding commands for target `tk'
466 Makefile:21: warning: ignoring old commands for target `tk'
467 make[2]: Entering directory `/home/hmangala/build/OpenSees/SRC/tcl'
468 make[2]: Nothing to be done for `tk'.
469 make[2]: Leaving directory `/home/hmangala/build/OpenSees/SRC/tcl'
470 /home/hmangala/build/OpenSees/SRC/tcl/commands.o: In function `getNP(void*, Tcl_Interp*, int, char const**)':
471 commands.cpp:(.text+0xc33): undefined reference to `theMachineBroker'
472 /home/hmangala/build/OpenSees/SRC/tcl/commands.o: In function `getPID(void*, Tcl_Interp*, int, char const**)':
473 commands.cpp:(.text+0xc73): undefined reference to `theMachineBroker'
...
[/code]
So what is the missing link/package? Obviously, others have had this problem on similar machines.
opensees is the only project that Google returns when searching for 'themachinebroker' , so it implies this is not part of another package but specific to OpenSees.

The full content of the output is here:
<http://moo.nac.uci.edu/~hjm/OpenSees.makefile.out>

There are also a number of 'Unknown target' errors similar to:
[code]
104 make[3]: Entering directory `/home/hmangala/build/OpenSees/SRC/material/uniaxial'
105 Unknown target HardeningMaterial2.o, try: make help
106 Unknown target KinematicHardening.o, try: make help
107 Unknown target TclKinematicHardening.o, try: make help
108 Unknown target TclNewUnixialMaterial.o, try: make help
109 Unknown target PenaltyMaterial.o, try: make help
110 Unknown target WrappedMaterial.o, try: make help
111 Unknown target SecantMaterial.o, try: make help
112 Unknown target ConfinedConcrete02.o, try: make help
113 make[4]: Entering directory `/home/hmangala/build/OpenSees/SRC/material/uniaxial/fedeas'
114 Unknown target Steel2.o, try: make help
115 Unknown target Concr2.o, try: make help
116 make[4]: Leaving directory `/home/hmangala/build/OpenSees/SRC/material/uniaxial/fedeas'
[/code]

These don't seem to be quite as serious, but they are odd