10. Debug#
10.1. 常见问题#
10.1.1. DIVERGED_LINEAR_SOLVE#
The errors are like this.
File "/home/yzz/firedrake/src/firedrake/firedrake/adjoint/solving.py", line 50, in wrapper
output = solve(*args, **kwargs)
File "/home/yzz/firedrake/src/firedrake/firedrake/solving.py", line 129, in solve
_solve_varproblem(*args, **kwargs)
File "/home/yzz/firedrake/src/firedrake/firedrake/solving.py", line 161, in _solve_varproblem
solver.solve()
File "/home/yzz/firedrake/src/firedrake/firedrake/adjoint/variational_solver.py", line 75, in wrapper
out = solve(self, **kwargs)
File "/home/yzz/firedrake/src/firedrake/firedrake/variational_solver.py", line 278, in solve
solving_utils.check_snes_convergence(self.snes)
File "/home/yzz/firedrake/src/firedrake/firedrake/solving_utils.py", line 139, in check_snes_convergence
raise ConvergenceError(r"""Nonlinear solve failed to converge after %d nonlinear iterations.
firedrake.exceptions.ConvergenceError: Nonlinear solve failed to converge after 0 nonlinear iterations.
Reason:
DIVERGED_LINEAR_SOLVE
Reasons for this:
You equation is not closed. May be you write wrong boundary conditions. Check the boundary condition carefully.
External package?
The resulting system is singular? (Maybe)
…
We can add flag -ksp_error_if_not_converged
to make PETSc print more infomation on the error. Below is an example of error DIVERGED_LINEAR_SOLVE
Exmaple 1 (Error of MUMPS)
Reference:
Doc of MUMPS: https://graal.ens-lyon.fr/MUMPS/index.php?page=doc
MATSOLVERMUMPS: https://petsc.org/main/manualpages/Mat/MATSOLVERMUMPS/
Below is an example on error occured in package mumps, we can look up the meaning of the error message in doc of mumps
[63]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[63]PETSC ERROR: Error in external library
[63]PETSC ERROR: Error reported by MUMPS in numerical factorization phase: INFOG(1)=-9, INFO(2)=26
[63]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[63]PETSC ERROR: Petsc Development GIT revision: v3.4.2-38777-g979cc68729 GIT Date: 2022-06-22 21:18:23 +0100
[63]PETSC ERROR: ../hsolver/hsolver.py on a default named AMAs4 by z2yang Tue Nov 8 16:03:18 2022
[63]PETSC ERROR: Configure options PETSC_DIR=/home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc PETSC_ARCH=default --download-ptscotch --with-zlib --download-hwloc --with-c2html=0 --download-eigen="/home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/eigen-3.3.3.tgz " --download-mpich --download-hdf5 --with-fortran-bindings=0 --with-64-bit-indices --download-bison --with-cxx-dialect=C++11 --download-metis --download-openblas --download-openblas-make-options="'USE_THREAD=0 USE_LOCKING=1 USE_OPENMP=0'" --download-pastix --download-mumps --with-shared-libraries=1 --with-scalar-type=complex --download-cmake --download-scalapack --with-debugging=0 --download-netcdf --download-superlu_dist --download-suitesparse --download-pnetcdf
[63]PETSC ERROR: #1 MatFactorNumeric_MUMPS() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1664
[63]PETSC ERROR: #2 MatLUFactorNumeric() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/mat/interface/matrix.c:3177
[63]PETSC ERROR: #3 PCSetUp_LU() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/pc/impls/factor/lu/lu.c:135
[63]PETSC ERROR: #4 PCSetUp() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/pc/interface/precon.c:993
[63]PETSC ERROR: #5 KSPSetUp() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/ksp/interface/itfunc.c:407
[63]PETSC ERROR: #6 KSPSolve_Private() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/ksp/interface/itfunc.c:843
[63]PETSC ERROR: #7 KSPSolve() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/ksp/interface/itfunc.c:1078
The doc of mumps on INFOG(1) = -9
and INCTL(14)
:
The main internal real/complex workarray S is too small. If INFO(2) is positive, then the number
of entries that are missing in S at the moment when the error is raised is available in INFO(2).
If INFO(2) is negative, then its absolute value should be multiplied by 1 million. If an error –9
occurs, the user should increase the value of ICNTL(14) before calling the factorization (JOB=2)
again, except if LWK_USER is provided LWK_USER should be increased.
ICNTL(14) corresponds to the percentage increase in the estimated working space.
Phase: accessed by the host both during the analysis and the factorization phases.
Default value: between 20 and 35 (which corresponds to at most 35 % increase) and depends on
the number of MPI processes. It is set to 5 % with SYM=1 and one MPI process.
Related parameters: ICNTL(23)
Remarks: When significant extra fill-in is caused by numerical pivoting, increasing ICNTL(14)
may help
We can set mumps’ parameter through -mat_mumps_icntl_<num>
, such as -mat_mumps_icntl_14 40
,
see manual page on MATSOLVERMUMPS
and doc of MUMPS for details.
Tips
添加标志 ksp_view
, ksp_monitor
, ksp_converged_reason
, ksp_error_if_not_converged
.
Example 2 (Mumps)
petsc4py.PETSc.Error: error code 76
[0] SNESSolve() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/snes/interface/snes.c:4693
[0] SNESSolve_KSPONLY() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/snes/impls/ksponly/ksponly.c:48
[0] KSPSolve() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/ksp/interface/itfunc.c:1070
[0] KSPSolve_Private() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/ksp/interface/itfunc.c:824
[0] KSPSetUp() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/ksp/interface/itfunc.c:405
[0] PCSetUp() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/pc/interface/precon.c:994
[0] PCSetUp_LU() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/pc/impls/factor/lu/lu.c:120
[0] MatLUFactorNumeric() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/mat/interface/matrix.c:3215
[0] MatFactorNumeric_MUMPS() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1683
[0] Error in external library
[0] Error reported by MUMPS in numerical factorization phase: INFOG(1)=-10, INFO(2)=4464
INFOG(1)=-10: Numerically singular matrix. INFO(2) holds the number of eliminated pivots.
Add option -mat_mumps_icntl_24 1
may fix this error.
ICNTL(24) controls the detection of ``null pivot rows''.
Phase: accessed by the host during the factorization phase
Possible variables/arrays involved: PIVNUL LIST
Possible values :
0: Nothing done. A null pivot row will result in error INFO(1)=-10.
1: Null pivot row detection.
Other values are treated as 0.
Default value: 0 (no null pivot row detection)
10.1.2. Currently no support for ReferenceGrad in CoefficientDerivative#
在使用高阶网格时, 不能对单元边界法向进行求导. 对高阶网格的单元边界法向求导会有如下错误:
ufl.log.UFLException: Currently no support for ReferenceGrad in CoefficientDerivative.
from firedrake import *
def get_s12(mesh, u, v):
n = FacetNormal(mesh)
s1 = dot(n, dot(n, grad(grad(grad(u)))))
s2 = dot(n, dot(n, grad(grad(grad(v)))))
return s1, s2
def get_s12_v2(mesh, u, v):
n = FacetNormal(mesh)
s1 = dot(n, grad(dot(n, grad(grad(u)))))
s2 = dot(n, grad(dot(n, grad(grad(v)))))
return s1, s2
def test_grad_n(high_order_mesh=True, fun=get_s12):
p = 3
N = 10
mesh = RectangleMesh(N, N, 1, 1)
if high_order_mesh:
V = VectorFunctionSpace(mesh, 'CG', 2)
coords = Function(V).interpolate(mesh.coordinates)
mesh = Mesh(coords)
V = FunctionSpace(mesh, 'CG', p)
u, v = TrialFunction(V), TestFunction(V)
s1, s2 = fun(mesh, u, v)
a = inner(grad(u), grad(v))*dx + inner(s1('+'), s2('+'))*dS + inner(s1('-'), s2('-'))*dS
L = inner(Constant(0), v)*dx
sol = Function(V)
prob = LinearVariationalProblem(a, L, sol)
solver = LinearVariationalSolver(prob)
solver.solve()
for f in [get_s12, get_s12_v2]:
for hom in [False, True]:
try:
test_grad_n(high_order_mesh=hom, fun=f)
print(f.__name__, f': High order mesh: {hom}, ', "TEST OK!")
except Exception as e:
print(f.__name__, f': High order mesh: {hom}, ', "TEST ERROR: ", e)
firedrake:WARNING OMP_NUM_THREADS is not set or is set to a value greater than 1, we suggest setting OMP_NUM_THREADS=1 to improve performance
get_s12 : High order mesh: False, TEST OK!
get_s12 : High order mesh: True, TEST OK!
get_s12_v2 : High order mesh: False, TEST OK!
get_s12_v2 : High order mesh: True, TEST OK!
10.1.3. PyErr_Occurred#
python: src/petsc4py.PETSc.c:348918: __Pyx_PyCFunction_FastCall: Assertion `!PyErr_Occurred()' failed.
This may caused by your python code (with pragrammer error, such as undefined variables) called by PETSc
Tip
在程序开始添加如下代码, 可能会有更详细信息
from firedrake.petsc import PETSc
PETSc.Sys.popErrorHandler()
10.1.4. PETSc Error Code#
https://petsc.org/release/manualpages/Sys/PetscErrorCode/
{
PETSC_SUCCESS = 0,
PETSC_ERR_BOOLEAN_MACRO_FAILURE = 1, /* do not use */
PETSC_ERR_MIN_VALUE = 54, /* should always be one less then the smallest value */
PETSC_ERR_MEM = 55, /* unable to allocate requested memory */
PETSC_ERR_SUP = 56, /* no support for requested operation */
PETSC_ERR_SUP_SYS = 57, /* no support for requested operation on this computer system */
PETSC_ERR_ORDER = 58, /* operation done in wrong order */
PETSC_ERR_SIG = 59, /* signal received */
PETSC_ERR_FP = 72, /* floating point exception */
PETSC_ERR_COR = 74, /* corrupted PETSc object */
PETSC_ERR_LIB = 76, /* error in library called by PETSc */
PETSC_ERR_PLIB = 77, /* PETSc library generated inconsistent data */
PETSC_ERR_MEMC = 78, /* memory corruption */
PETSC_ERR_CONV_FAILED = 82, /* iterative method (KSP or SNES) failed */
PETSC_ERR_USER = 83, /* user has not provided needed function */
PETSC_ERR_SYS = 88, /* error in system call */
PETSC_ERR_POINTER = 70, /* pointer does not point to valid address */
PETSC_ERR_MPI_LIB_INCOMP = 87, /* MPI library at runtime is not compatible with MPI user compiled with */
PETSC_ERR_ARG_SIZ = 60, /* nonconforming object sizes used in operation */
PETSC_ERR_ARG_IDN = 61, /* two arguments not allowed to be the same */
PETSC_ERR_ARG_WRONG = 62, /* wrong argument (but object probably ok) */
PETSC_ERR_ARG_CORRUPT = 64, /* null or corrupted PETSc object as argument */
PETSC_ERR_ARG_OUTOFRANGE = 63, /* input argument, out of range */
PETSC_ERR_ARG_BADPTR = 68, /* invalid pointer argument */
PETSC_ERR_ARG_NOTSAMETYPE = 69, /* two args must be same object type */
PETSC_ERR_ARG_NOTSAMECOMM = 80, /* two args must be same communicators */
PETSC_ERR_ARG_WRONGSTATE = 73, /* object in argument is in wrong state, e.g. unassembled mat */
PETSC_ERR_ARG_TYPENOTSET = 89, /* the type of the object has not yet been set */
PETSC_ERR_ARG_INCOMP = 75, /* two arguments are incompatible */
PETSC_ERR_ARG_NULL = 85, /* argument is null that should not be */
PETSC_ERR_ARG_UNKNOWN_TYPE = 86, /* type name doesn't match any registered type */
PETSC_ERR_FILE_OPEN = 65, /* unable to open file */
PETSC_ERR_FILE_READ = 66, /* unable to read from file */
PETSC_ERR_FILE_WRITE = 67, /* unable to write to file */
PETSC_ERR_FILE_UNEXPECTED = 79, /* unexpected data in file */
PETSC_ERR_MAT_LU_ZRPVT = 71, /* detected a zero pivot during LU factorization */
PETSC_ERR_MAT_CH_ZRPVT = 81, /* detected a zero pivot during Cholesky factorization */
PETSC_ERR_INT_OVERFLOW = 84,
PETSC_ERR_FLOP_COUNT = 90,
PETSC_ERR_NOT_CONVERGED = 91, /* solver did not converge */
PETSC_ERR_MISSING_FACTOR = 92, /* MatGetFactor() failed */
PETSC_ERR_OPT_OVERWRITE = 93, /* attempted to over write options which should not be changed */
PETSC_ERR_WRONG_MPI_SIZE = 94, /* example/application run with number of MPI ranks it does not support */
PETSC_ERR_USER_INPUT = 95, /* missing or incorrect user input */
PETSC_ERR_GPU_RESOURCE = 96, /* unable to load a GPU resource, for example cuBLAS */
PETSC_ERR_GPU = 97, /* An error from a GPU call, this may be due to lack of resources on the GPU or a true error in the call */
PETSC_ERR_MPI = 98, /* general MPI error */
PETSC_ERR_RETURN = 99, /* PetscError() incorrectly returned an error code of 0 */
PETSC_ERR_MAX_VALUE = 100, /* this is always the one more than the largest error code */
/*
do not use, exist purely to make the enum bounds equal that of a regular int (so conversion
to int in main() is not undefined behavior)
*/
PETSC_ERR_MIN_SIGNED_BOUND_DO_NOT_USE = INT_MIN,
PETSC_ERR_MAX_SIGNED_BOUND_DO_NOT_USE = INT_MAX
} PETSC_ERROR_CODE_ENUM_NAME;
10.2. 调试 Python 代码#
运行中抛出异常, 定位出错代码, 检查相关的变量是否有异常值存在. 例如在 Jupyter notebook 中, %debug
可打开调试器, 检查相关变量.
10.3. 调试 C 代码 (gdb)#
由于 firedrake 基于 PETSc 进行网格管理和线性方程组求解, 有时出错会在 PETSc 中, 例如运行如下代码:
TODO: 找个示例, 这个示例不行
# filename: test.py
import sys
import petsc4py
petsc4py.init(sys.argv)
from petsc4py import PETSc
if PETSc.COMM_WORLD.rank == 0:
PETSc.Vec().create(comm=PETSc.COMM_SELF).view()
出错信息如下:
$ python test.py
Vec Object: 1 MPI process
type not yet set
Traceback (most recent call last):
File "test.py", line 7, in <module>
PETSc.Vec().create(comm=PETSc.COMM_SELF).view()
File "PETSc/Vec.pyx", line 140, in petsc4py.PETSc.Vec.view
petsc4py.PETSc.Error: error code 56
[0] VecView() at /home/yzz/software/firedrake-mini-petsc/src/petsc/src/vec/vec/interface/vector.c:715
[0] No support for this operation for this object type
[0] No method view for Vec of type (null)
这时可以使用 gdb 等调试工具.
10.3.1. gdb 命令行说明#
gdb [options] --args executable-file [inferior-arguments ...]
10.3.2. 参数 (options)#
-x file
: 从文件中读取 gdb 命令-ex COMMAND
: 执行 gdb 命令--args exe [exe-args]
传递参数给 exe--pid <pid>
调试正在运行的程序
10.3.3. gdb 命令:#
bt
: 查看函数调用栈run
: 运行可执行文件l
: 查看代码p
: 打印变量
10.3.4. 示例 (调试 test.py
)#
$ gdb -ex run --args $(which python3) test.py
10.3.5. gdb 控制命令#
一下命令可以保存 gdb 调试过程到文件, 可用于提交 issue.
输出执行的 gdb 命令
set trace-commands on
ref https://sourceware.org/gdb/onlinedocs/gdb/Messages_002fWarnings.html关闭分页
set pagination off
ref https://sourceware.org/gdb/onlinedocs/gdb/Screen-Size.html设置日志文件,并开启日志
set logging file my.logs
,set logging enable on
ref https://sourceware.org/gdb/download/onlinedocs/gdb/Logging-Output.html
10.3.6. gdb 的 python 插件#
在 ubuntu 上安装 python3-dbg
后, 文件夹 /usr/share/gdb/auto-load/usr/bin/
下会有如下插件
$ ls /usr/share/gdb/auto-load/usr/bin/
python3.10-dbg-gdb.py python3.10dm-gdb.py python3.10-gdb.py python3.10m-gdb.py
其中定义了用于显示 python 调用栈的命令: py-bt
.
在启动 gdb 调试时如果, gdb 没有自动加载该插件时(为什么没有自动加载), 可以手动加载:
source /usr/share/gdb/auto-load/usr/bin/python3.10m-gdb.py
或者把该文件添加进 gdb 的初值化文件 $HOME/.config/gdb/gdbinit
或当前目录下的 .gdbinit
:
source /usr/share/gdb/auto-load/usr/bin/python3.10m-gdb.py
10.3.7. Using commands from file#
新建文件 cmd.txt
内容如下
source /usr/share/gdb/auto-load/usr/bin/python3.10m-gdb.py
set trace-commands on
set pagination off
set logging file my.logs
set logging enable on
py-bt
exit
使用 -x
参数运行 gdb
自动运行上面命令并退出.
gdb -x cmd.txt -p <pid>
10.4. 并行程序调试#
10.4.1. PETSc
的参数 -start_in_debugger
#
Reference:
可以选择使用 PETSc
的参数 -start_in_debugger
给每个进程启动调试器如下:
mpiexec -n 3 $(which python) test.py -start_in_debugger
默认会启动多个 xterm 窗口.
Tip
修改 xterm 窗口显示效果
参考: http://www.futurile.net/2016/06/14/xterm-setup-and-truetype-font-configuration/
$ cat ~/.Xdefaults
xterm*faceName: Monospace
xterm*faceSize: 12
xterm*foreground: rgb:a8/a8/a8
xterm*background: rgb:00/00/00
10.4.2. 工具 tmux-mpi
#
Reference:
Tips of Firedrake Wiki: firedrakeproject/firedrake
另外我们也可以选择使用工具 tmux-mpi
.
10.4.2.1. 安装 tmux-mpi#
安装 tmux
sudo apt-get install tmux
安装 dtach (tmux-mpi 依赖)
先编译 dtach, 然后拷贝二进制文件到某个在 PATH 中的路径, 如 $HOME/bin.
git clone https://github.com/crigler/dtach cd dtach ./configure make mkdir -p $HOME/bin cp dtach $HOME/bin export PATH=$PATH:$HOME/bin
运行
which dtach
确认安装是否成功安装 tmux-mpi
使用 pip 安装
pip install --upgrade --no-cache-dir git+https://github.com/wrs20/tmux-mpi@master
10.4.2.2. 调试命令#
启动调试器
tmux-mpi 3 gdb -ex run --args $(which python) test.py
Attach 到相应的的伪终端, 每个进程一个窗口. (这里是 tmux 的一个 session, 有多个 window)
tmux attach -t tmux-mpi
使用 gdb 调试命令调试
10.5. 编译 pyx 文件#
10.5.1. firedrake 中的 pyx 文件#
Recompile dmcommon.pyx
after making modifications to it
python setup.py build_ext --inplace