Debug

10. Debug#

10.1. 常见问题#

10.1.1. DIVERGED_LINEAR_SOLVE#

The errors are like this.

File "/home/yzz/firedrake/src/firedrake/firedrake/adjoint/solving.py", line 50, in wrapper
    output = solve(*args, **kwargs)
  File "/home/yzz/firedrake/src/firedrake/firedrake/solving.py", line 129, in solve
    _solve_varproblem(*args, **kwargs)
  File "/home/yzz/firedrake/src/firedrake/firedrake/solving.py", line 161, in _solve_varproblem
    solver.solve()
  File "/home/yzz/firedrake/src/firedrake/firedrake/adjoint/variational_solver.py", line 75, in wrapper
    out = solve(self, **kwargs)
  File "/home/yzz/firedrake/src/firedrake/firedrake/variational_solver.py", line 278, in solve
    solving_utils.check_snes_convergence(self.snes)
  File "/home/yzz/firedrake/src/firedrake/firedrake/solving_utils.py", line 139, in check_snes_convergence
    raise ConvergenceError(r"""Nonlinear solve failed to converge after %d nonlinear iterations.
firedrake.exceptions.ConvergenceError: Nonlinear solve failed to converge after 0 nonlinear iterations.
Reason:
   DIVERGED_LINEAR_SOLVE

Reasons for this:

You equation is not closed. May be you write wrong boundary conditions. Check the boundary condition carefully.
External package?
The resulting system is singular? (Maybe)
…

We can add flag -ksp_error_if_not_converged to make PETSc print more infomation on the error. Below is an example of error DIVERGED_LINEAR_SOLVE

Exmaple 1 (Error of MUMPS)

Reference:

Doc of MUMPS: https://graal.ens-lyon.fr/MUMPS/index.php?page=doc
MATSOLVERMUMPS: https://petsc.org/main/manualpages/Mat/MATSOLVERMUMPS/

Below is an example on error occured in package mumps, we can look up the meaning of the error message in doc of mumps

[63]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[63]PETSC ERROR: Error in external library
[63]PETSC ERROR: Error reported by MUMPS in numerical factorization phase: INFOG(1)=-9, INFO(2)=26
[63]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[63]PETSC ERROR: Petsc Development GIT revision: v3.4.2-38777-g979cc68729  GIT Date: 2022-06-22 21:18:23 +0100
[63]PETSC ERROR: ../hsolver/hsolver.py on a default named AMAs4 by z2yang Tue Nov  8 16:03:18 2022
[63]PETSC ERROR: Configure options PETSC_DIR=/home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc PETSC_ARCH=default --download-ptscotch --with-zlib --download-hwloc --with-c2html=0 --download-eigen="/home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/eigen-3.3.3.tgz " --download-mpich --download-hdf5 --with-fortran-bindings=0 --with-64-bit-indices --download-bison --with-cxx-dialect=C++11 --download-metis --download-openblas --download-openblas-make-options="'USE_THREAD=0 USE_LOCKING=1 USE_OPENMP=0'" --download-pastix --download-mumps --with-shared-libraries=1 --with-scalar-type=complex --download-cmake --download-scalapack --with-debugging=0 --download-netcdf --download-superlu_dist --download-suitesparse --download-pnetcdf
[63]PETSC ERROR: #1 MatFactorNumeric_MUMPS() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1664
[63]PETSC ERROR: #2 MatLUFactorNumeric() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/mat/interface/matrix.c:3177
[63]PETSC ERROR: #3 PCSetUp_LU() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/pc/impls/factor/lu/lu.c:135
[63]PETSC ERROR: #4 PCSetUp() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/pc/interface/precon.c:993
[63]PETSC ERROR: #5 KSPSetUp() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/ksp/interface/itfunc.c:407
[63]PETSC ERROR: #6 KSPSolve_Private() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/ksp/interface/itfunc.c:843
[63]PETSC ERROR: #7 KSPSolve() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/ksp/interface/itfunc.c:1078

The doc of mumps on INFOG(1) = -9 and INCTL(14):

The main internal real/complex workarray S is too small. If INFO(2) is positive, then the number
of entries that are missing in S at the moment when the error is raised is available in INFO(2).
If INFO(2) is negative, then its absolute value should be multiplied by 1 million. If an error –9
occurs, the user should increase the value of ICNTL(14) before calling the factorization (JOB=2)
again, except if LWK_USER is provided LWK_USER should be increased.

ICNTL(14) corresponds to the percentage increase in the estimated working space.
Phase: accessed by the host both during the analysis and the factorization phases.
Default value: between 20 and 35 (which corresponds to at most 35 % increase) and depends on
               the number of MPI processes. It is set to 5 % with SYM=1 and one MPI process.
Related parameters: ICNTL(23)
Remarks: When significant extra fill-in is caused by numerical pivoting, increasing ICNTL(14)
         may help

We can set mumps’ parameter through -mat_mumps_icntl_<num>, such as -mat_mumps_icntl_14 40, see manual page on MATSOLVERMUMPS and doc of MUMPS for details.

Tips

添加标志 ksp_view, ksp_monitor, ksp_converged_reason, ksp_error_if_not_converged.

Example 2 (Mumps)

petsc4py.PETSc.Error: error code 76
[0] SNESSolve() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/snes/interface/snes.c:4693
[0] SNESSolve_KSPONLY() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/snes/impls/ksponly/ksponly.c:48
[0] KSPSolve() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/ksp/interface/itfunc.c:1070
[0] KSPSolve_Private() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/ksp/interface/itfunc.c:824
[0] KSPSetUp() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/ksp/interface/itfunc.c:405
[0] PCSetUp() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/pc/interface/precon.c:994
[0] PCSetUp_LU() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/pc/impls/factor/lu/lu.c:120
[0] MatLUFactorNumeric() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/mat/interface/matrix.c:3215
[0] MatFactorNumeric_MUMPS() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1683
[0] Error in external library
[0] Error reported by MUMPS in numerical factorization phase: INFOG(1)=-10, INFO(2)=4464

INFOG(1)=-10: Numerically singular matrix. INFO(2) holds the number of eliminated pivots.

Add option -mat_mumps_icntl_24 1 may fix this error.

ICNTL(24) controls the detection of ``null pivot rows''.
    Phase: accessed by the host during the factorization phase
    Possible variables/arrays involved: PIVNUL LIST
    Possible values :
        0: Nothing done. A null pivot row will result in error INFO(1)=-10.
        1: Null pivot row detection.
    Other values are treated as 0.
    Default value: 0 (no null pivot row detection)

10.1.2. Currently no support for ReferenceGrad in CoefficientDerivative#

在使用高阶网格时, 不能对单元边界法向进行求导. 对高阶网格的单元边界法向求导会有如下错误：

ufl.log.UFLException: Currently no support for ReferenceGrad in CoefficientDerivative.

from firedrake import *

def get_s12(mesh, u, v):
    n = FacetNormal(mesh)
    s1 = dot(n, dot(n, grad(grad(grad(u)))))
    s2 = dot(n, dot(n, grad(grad(grad(v)))))
    return s1, s2

def get_s12_v2(mesh, u, v):
    n = FacetNormal(mesh)
    s1 = dot(n, grad(dot(n, grad(grad(u)))))
    s2 = dot(n, grad(dot(n, grad(grad(v)))))
    return s1, s2

def test_grad_n(high_order_mesh=True, fun=get_s12):
    p = 2
    N = 4

    mesh = RectangleMesh(N, N, 1, 1)
        
    if high_order_mesh:
        V = VectorFunctionSpace(mesh, 'CG', 2)
        coords = Function(V).interpolate(mesh.coordinates)
        mesh = Mesh(coords)

    V = FunctionSpace(mesh, 'CG', p)

    u, v = TrialFunction(V), TestFunction(V)

    s1, s2 = fun(mesh, u, v)
    a = inner(grad(u), grad(v))*dx + inner(s1('+'), s2('+'))*dS + inner(s1('-'), s2('-'))*dS
    L = inner(Constant(0), v)*dx

    sol = Function(V)
    prob = LinearVariationalProblem(a, L, sol)
    solver = LinearVariationalSolver(prob)

    solver.solve()

for f in [get_s12, get_s12_v2]:
    for hom in [False, True]:
        try:
            test_grad_n(high_order_mesh=hom, fun=f)
            print(f.__name__, f': High order mesh: {hom}, ', "TEST OK!")
        except Exception as e:
            print(f.__name__, f': High order mesh: {hom}, ', "TEST ERROR: ", e)

get_s12 : High order mesh: False,  TEST OK!

get_s12 : High order mesh: True,  TEST OK!

get_s12_v2 : High order mesh: False,  TEST OK!

get_s12_v2 : High order mesh: True,  TEST OK!

10.1.3. PyErr_Occurred#

python: src/petsc4py.PETSc.c:348918: __Pyx_PyCFunction_FastCall: Assertion `!PyErr_Occurred()' failed.

This may caused by your python code (with pragrammer error, such as undefined variables) called by PETSc

Tip

在程序开始添加如下代码, 可能会有更详细信息

from firedrake.petsc import PETSc
PETSc.Sys.popErrorHandler()

10.1.4. PETSc Error Code#

https://petsc.org/release/manualpages/Sys/PetscErrorCode/

 {
  PETSC_SUCCESS                   = 0,
  PETSC_ERR_BOOLEAN_MACRO_FAILURE = 1, /* do not use */

  PETSC_ERR_MIN_VALUE = 54, /* should always be one less then the smallest value */

  PETSC_ERR_MEM            = 55, /* unable to allocate requested memory */
  PETSC_ERR_SUP            = 56, /* no support for requested operation */
  PETSC_ERR_SUP_SYS        = 57, /* no support for requested operation on this computer system */
  PETSC_ERR_ORDER          = 58, /* operation done in wrong order */
  PETSC_ERR_SIG            = 59, /* signal received */
  PETSC_ERR_FP             = 72, /* floating point exception */
  PETSC_ERR_COR            = 74, /* corrupted PETSc object */
  PETSC_ERR_LIB            = 76, /* error in library called by PETSc */
  PETSC_ERR_PLIB           = 77, /* PETSc library generated inconsistent data */
  PETSC_ERR_MEMC           = 78, /* memory corruption */
  PETSC_ERR_CONV_FAILED    = 82, /* iterative method (KSP or SNES) failed */
  PETSC_ERR_USER           = 83, /* user has not provided needed function */
  PETSC_ERR_SYS            = 88, /* error in system call */
  PETSC_ERR_POINTER        = 70, /* pointer does not point to valid address */
  PETSC_ERR_MPI_LIB_INCOMP = 87, /* MPI library at runtime is not compatible with MPI user compiled with */

  PETSC_ERR_ARG_SIZ          = 60, /* nonconforming object sizes used in operation */
  PETSC_ERR_ARG_IDN          = 61, /* two arguments not allowed to be the same */
  PETSC_ERR_ARG_WRONG        = 62, /* wrong argument (but object probably ok) */
  PETSC_ERR_ARG_CORRUPT      = 64, /* null or corrupted PETSc object as argument */
  PETSC_ERR_ARG_OUTOFRANGE   = 63, /* input argument, out of range */
  PETSC_ERR_ARG_BADPTR       = 68, /* invalid pointer argument */
  PETSC_ERR_ARG_NOTSAMETYPE  = 69, /* two args must be same object type */
  PETSC_ERR_ARG_NOTSAMECOMM  = 80, /* two args must be same communicators */
  PETSC_ERR_ARG_WRONGSTATE   = 73, /* object in argument is in wrong state, e.g. unassembled mat */
  PETSC_ERR_ARG_TYPENOTSET   = 89, /* the type of the object has not yet been set */
  PETSC_ERR_ARG_INCOMP       = 75, /* two arguments are incompatible */
  PETSC_ERR_ARG_NULL         = 85, /* argument is null that should not be */
  PETSC_ERR_ARG_UNKNOWN_TYPE = 86, /* type name doesn't match any registered type */

  PETSC_ERR_FILE_OPEN       = 65, /* unable to open file */
  PETSC_ERR_FILE_READ       = 66, /* unable to read from file */
  PETSC_ERR_FILE_WRITE      = 67, /* unable to write to file */
  PETSC_ERR_FILE_UNEXPECTED = 79, /* unexpected data in file */

  PETSC_ERR_MAT_LU_ZRPVT = 71, /* detected a zero pivot during LU factorization */
  PETSC_ERR_MAT_CH_ZRPVT = 81, /* detected a zero pivot during Cholesky factorization */

  PETSC_ERR_INT_OVERFLOW   = 84,
  PETSC_ERR_FLOP_COUNT     = 90,
  PETSC_ERR_NOT_CONVERGED  = 91,  /* solver did not converge */
  PETSC_ERR_MISSING_FACTOR = 92,  /* MatGetFactor() failed */
  PETSC_ERR_OPT_OVERWRITE  = 93,  /* attempted to over write options which should not be changed */
  PETSC_ERR_WRONG_MPI_SIZE = 94,  /* example/application run with number of MPI ranks it does not support */
  PETSC_ERR_USER_INPUT     = 95,  /* missing or incorrect user input */
  PETSC_ERR_GPU_RESOURCE   = 96,  /* unable to load a GPU resource, for example cuBLAS */
  PETSC_ERR_GPU            = 97,  /* An error from a GPU call, this may be due to lack of resources on the GPU or a true error in the call */
  PETSC_ERR_MPI            = 98,  /* general MPI error */
  PETSC_ERR_RETURN         = 99,  /* PetscError() incorrectly returned an error code of 0 */
  PETSC_ERR_MAX_VALUE      = 100, /* this is always the one more than the largest error code */

  /*
    do not use, exist purely to make the enum bounds equal that of a regular int (so conversion
    to int in main() is not undefined behavior)
  */
  PETSC_ERR_MIN_SIGNED_BOUND_DO_NOT_USE = INT_MIN,
  PETSC_ERR_MAX_SIGNED_BOUND_DO_NOT_USE = INT_MAX
} PETSC_ERROR_CODE_ENUM_NAME;

10.2. 调试 Python 代码#

运行中抛出异常, 定位出错代码, 检查相关的变量是否有异常值存在. 例如在 Jupyter notebook 中, %debug 可打开调试器, 检查相关变量.

10.3. 调试 C 代码 (gdb)#

由于 firedrake 基于 PETSc 进行网格管理和线性方程组求解, 有时出错会在 PETSc 中, 例如运行如下代码:

TODO: 找个示例, 这个示例不行

# filename: test.py
import sys
import petsc4py
petsc4py.init(sys.argv)
from petsc4py import PETSc
if PETSc.COMM_WORLD.rank == 0:
    PETSc.Vec().create(comm=PETSc.COMM_SELF).view()

出错信息如下:

$ python test.py
Vec Object: 1 MPI process
  type not yet set
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    PETSc.Vec().create(comm=PETSc.COMM_SELF).view()
  File "PETSc/Vec.pyx", line 140, in petsc4py.PETSc.Vec.view
petsc4py.PETSc.Error: error code 56
[0] VecView() at /home/yzz/software/firedrake-mini-petsc/src/petsc/src/vec/vec/interface/vector.c:715
[0] No support for this operation for this object type
[0] No method view for Vec of type (null)

这时可以使用 gdb 等调试工具.

10.3.1. gdb 命令行说明#

gdb [options] --args executable-file [inferior-arguments ...]

10.3.2. 参数 (options)#

-x file: 从文件中读取 gdb 命令
-ex COMMAND: 执行 gdb 命令
--args exe [exe-args] 传递参数给 exe
--pid <pid> 调试正在运行的程序

10.3.3. gdb 命令:#

bt: 查看函数调用栈
run: 运行可执行文件
l: 查看代码
p: 打印变量

10.3.4. 示例 (调试 `test.py`)#

$ gdb  -ex run --args $(which python3) test.py

10.3.5. gdb 控制命令#

一下命令可以保存 gdb 调试过程到文件, 可用于提交 issue.

输出执行的 gdb 命令 set trace-commands on
ref https://sourceware.org/gdb/onlinedocs/gdb/Messages_002fWarnings.html
关闭分页 set pagination off
ref https://sourceware.org/gdb/onlinedocs/gdb/Screen-Size.html
设置日志文件，并开启日志 set logging file my.logs, set logging enable on
ref https://sourceware.org/gdb/download/onlinedocs/gdb/Logging-Output.html

10.3.6. gdb 的 python 插件#

在 ubuntu 上安装 python3-dbg 后, 文件夹 /usr/share/gdb/auto-load/usr/bin/ 下会有如下插件

$ ls /usr/share/gdb/auto-load/usr/bin/
python3.10-dbg-gdb.py  python3.10dm-gdb.py  python3.10-gdb.py  python3.10m-gdb.py

其中定义了用于显示 python 调用栈的命令: py-bt.

在启动 gdb 调试时如果, gdb 没有自动加载该插件时(为什么没有自动加载), 可以手动加载:

source /usr/share/gdb/auto-load/usr/bin/python3.10m-gdb.py

或者把该文件添加进 gdb 的初值化文件 $HOME/.config/gdb/gdbinit 或当前目录下的 .gdbinit:

source /usr/share/gdb/auto-load/usr/bin/python3.10m-gdb.py

10.3.7. Using commands from file#

新建文件 cmd.txt 内容如下

source /usr/share/gdb/auto-load/usr/bin/python3.10m-gdb.py
set trace-commands on
set pagination off
set logging file my.logs
set logging enable on
py-bt
exit

使用 -x 参数运行 gdb 自动运行上面命令并退出.

gdb -x cmd.txt -p <pid>

10.4. 并行程序调试#

10.4.1. `PETSc` 的参数 `-start_in_debugger`#

Reference:

可以选择使用 PETSc 的参数 -start_in_debugger 给每个进程启动调试器如下:

mpiexec -n 3 $(which python) test.py  -start_in_debugger

默认会启动多个 xterm 窗口.

Tip

修改 xterm 窗口显示效果

参考: http://www.futurile.net/2016/06/14/xterm-setup-and-truetype-font-configuration/

$ cat ~/.Xdefaults
xterm*faceName: Monospace
xterm*faceSize: 12
xterm*foreground: rgb:a8/a8/a8
xterm*background: rgb:00/00/00

10.4.2. 工具 `tmux-mpi`#

Reference:

firedrakeproject/firedrake
Tips of Firedrake Wiki: firedrakeproject/firedrake

另外我们也可以选择使用工具 tmux-mpi.

10.4.2.1. 安装 tmux-mpi#

安装 tmux
```
sudo apt-get install tmux
```
安装 dtach (tmux-mpi 依赖)

先编译 dtach, 然后拷贝二进制文件到某个在 PATH 中的路径, 如 $HOME/bin.
```
git clone https://github.com/crigler/dtach
cd dtach
./configure
make
mkdir -p $HOME/bin
cp dtach $HOME/bin
export PATH=$PATH:$HOME/bin
```
运行 which dtach 确认安装是否成功

安装 tmux-mpi

使用 pip 安装

pip install --upgrade --no-cache-dir git+https://github.com/wrs20/tmux-mpi@master

10.4.2.2. 调试命令#

启动调试器

tmux-mpi 3 gdb -ex run --args $(which python) test.py

Attach 到相应的的伪终端, 每个进程一个窗口. (这里是 tmux 的一个 session, 有多个 window)
```
tmux attach -t tmux-mpi
```
使用 gdb 调试命令调试

10.5. 编译 pyx 文件#

10.5.1. firedrake 中的 pyx 文件#

Recompile dmcommon.pyx after making modifications to it

python setup.py build_ext --inplace

Debug

Contents

10. Debug#

10.1. 常见问题#

10.1.1. DIVERGED_LINEAR_SOLVE#

10.1.2. Currently no support for ReferenceGrad in CoefficientDerivative#

10.1.3. PyErr_Occurred#

10.1.4. PETSc Error Code#

10.2. 调试 Python 代码#

10.3. 调试 C 代码 (gdb)#

10.3.1. gdb 命令行说明#

10.3.2. 参数 (options)#

10.3.3. gdb 命令:#

10.3.4. 示例 (调试 test.py)#

10.3.5. gdb 控制命令#

10.3.6. gdb 的 python 插件#

10.3.7. Using commands from file#

10.4. 并行程序调试#

10.4.1. PETSc 的参数 -start_in_debugger#

10.4.2. 工具 tmux-mpi#

10.4.2.1. 安装 tmux-mpi#

10.4.2.2. 调试命令#

10.5. 编译 pyx 文件#

10.5.1. firedrake 中的 pyx 文件#

10.3.4. 示例 (调试 `test.py`)#

10.4.1. `PETSc` 的参数 `-start_in_debugger`#

10.4.2. 工具 `tmux-mpi`#