9. Debug#

9.1. 常见问题#

9.1.1. DIVERGED_LINEAR_SOLVE#

The errors are like this.

File "/home/yzz/firedrake/src/firedrake/firedrake/adjoint/solving.py", line 50, in wrapper
    output = solve(*args, **kwargs)
  File "/home/yzz/firedrake/src/firedrake/firedrake/solving.py", line 129, in solve
    _solve_varproblem(*args, **kwargs)
  File "/home/yzz/firedrake/src/firedrake/firedrake/solving.py", line 161, in _solve_varproblem
    solver.solve()
  File "/home/yzz/firedrake/src/firedrake/firedrake/adjoint/variational_solver.py", line 75, in wrapper
    out = solve(self, **kwargs)
  File "/home/yzz/firedrake/src/firedrake/firedrake/variational_solver.py", line 278, in solve
    solving_utils.check_snes_convergence(self.snes)
  File "/home/yzz/firedrake/src/firedrake/firedrake/solving_utils.py", line 139, in check_snes_convergence
    raise ConvergenceError(r"""Nonlinear solve failed to converge after %d nonlinear iterations.
firedrake.exceptions.ConvergenceError: Nonlinear solve failed to converge after 0 nonlinear iterations.
Reason:
   DIVERGED_LINEAR_SOLVE

Reasons for this:

  1. You equation is not closed. May be you write wrong boundary conditions. Check the boundary condition carefully.

  2. External package?

  3. The resulting system is singular? (Maybe)

We can add flag -ksp_error_if_not_converged to make PETSc print more infomation on the error. Below is an example of error DIVERGED_LINEAR_SOLVE

Exmaple 1 (Error of MUMPS)

Reference:

  1. Doc of MUMPS: https://graal.ens-lyon.fr/MUMPS/index.php?page=doc

  2. MATSOLVERMUMPS: https://petsc.org/main/manualpages/Mat/MATSOLVERMUMPS/

Below is an example on error occured in package mumps, we can look up the meaning of the error message in doc of mumps

[63]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[63]PETSC ERROR: Error in external library
[63]PETSC ERROR: Error reported by MUMPS in numerical factorization phase: INFOG(1)=-9, INFO(2)=26
[63]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[63]PETSC ERROR: Petsc Development GIT revision: v3.4.2-38777-g979cc68729  GIT Date: 2022-06-22 21:18:23 +0100
[63]PETSC ERROR: ../hsolver/hsolver.py on a default named AMAs4 by z2yang Tue Nov  8 16:03:18 2022
[63]PETSC ERROR: Configure options PETSC_DIR=/home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc PETSC_ARCH=default --download-ptscotch --with-zlib --download-hwloc --with-c2html=0 --download-eigen="/home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/eigen-3.3.3.tgz " --download-mpich --download-hdf5 --with-fortran-bindings=0 --with-64-bit-indices --download-bison --with-cxx-dialect=C++11 --download-metis --download-openblas --download-openblas-make-options="'USE_THREAD=0 USE_LOCKING=1 USE_OPENMP=0'" --download-pastix --download-mumps --with-shared-libraries=1 --with-scalar-type=complex --download-cmake --download-scalapack --with-debugging=0 --download-netcdf --download-superlu_dist --download-suitesparse --download-pnetcdf
[63]PETSC ERROR: #1 MatFactorNumeric_MUMPS() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1664
[63]PETSC ERROR: #2 MatLUFactorNumeric() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/mat/interface/matrix.c:3177
[63]PETSC ERROR: #3 PCSetUp_LU() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/pc/impls/factor/lu/lu.c:135
[63]PETSC ERROR: #4 PCSetUp() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/pc/interface/precon.c:993
[63]PETSC ERROR: #5 KSPSetUp() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/ksp/interface/itfunc.c:407
[63]PETSC ERROR: #6 KSPSolve_Private() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/ksp/interface/itfunc.c:843
[63]PETSC ERROR: #7 KSPSolve() at /home/z2yang/opt/firedrake-env/firedrake-complex-int64/src/petsc/src/ksp/ksp/interface/itfunc.c:1078

The doc of mumps on INFOG(1) = -9 and INCTL(14):

The main internal real/complex workarray S is too small. If INFO(2) is positive, then the number
of entries that are missing in S at the moment when the error is raised is available in INFO(2).
If INFO(2) is negative, then its absolute value should be multiplied by 1 million. If an error –9
occurs, the user should increase the value of ICNTL(14) before calling the factorization (JOB=2)
again, except if LWK_USER is provided LWK_USER should be increased.
ICNTL(14) corresponds to the percentage increase in the estimated working space.
Phase: accessed by the host both during the analysis and the factorization phases.
Default value: between 20 and 35 (which corresponds to at most 35 % increase) and depends on
               the number of MPI processes. It is set to 5 % with SYM=1 and one MPI process.
Related parameters: ICNTL(23)
Remarks: When significant extra fill-in is caused by numerical pivoting, increasing ICNTL(14)
         may help

We can set mumps’ parameter through -mat_mumps_icntl_<num>, such as -mat_mumps_icntl_14 40, see manual page on MATSOLVERMUMPS and doc of MUMPS for details.

Tips

添加标志 ksp_view, ksp_monitor, ksp_converged_reason, ksp_error_if_not_converged.

Example 2 (Mumps)

petsc4py.PETSc.Error: error code 76
[0] SNESSolve() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/snes/interface/snes.c:4693
[0] SNESSolve_KSPONLY() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/snes/impls/ksponly/ksponly.c:48
[0] KSPSolve() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/ksp/interface/itfunc.c:1070
[0] KSPSolve_Private() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/ksp/interface/itfunc.c:824
[0] KSPSetUp() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/ksp/interface/itfunc.c:405
[0] PCSetUp() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/pc/interface/precon.c:994
[0] PCSetUp_LU() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/ksp/pc/impls/factor/lu/lu.c:120
[0] MatLUFactorNumeric() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/mat/interface/matrix.c:3215
[0] MatFactorNumeric_MUMPS() at /home/yzz/firedrake/real-int32-mkl-debug/src/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1683
[0] Error in external library
[0] Error reported by MUMPS in numerical factorization phase: INFOG(1)=-10, INFO(2)=4464
INFOG(1)=-10: Numerically singular matrix. INFO(2) holds the number of eliminated pivots.

Add option -mat_mumps_icntl_24 1 may fix this error.

ICNTL(24) controls the detection of ``null pivot rows''.
    Phase: accessed by the host during the factorization phase
    Possible variables/arrays involved: PIVNUL LIST
    Possible values :
        0: Nothing done. A null pivot row will result in error INFO(1)=-10.
        1: Null pivot row detection.
    Other values are treated as 0.
    Default value: 0 (no null pivot row detection)

9.1.2. Currently no support for ReferenceGrad in CoefficientDerivative#

在使用高阶网格时, 不能对单元边界法向进行求导. 对高阶网格的单元边界法向求导会有如下错误:

ufl.log.UFLException: Currently no support for ReferenceGrad in CoefficientDerivative.
from firedrake import *

def get_s12(mesh, u, v):
    n = FacetNormal(mesh)
    s1 = dot(n, dot(n, grad(grad(grad(u)))))
    s2 = dot(n, dot(n, grad(grad(grad(v)))))
    return s1, s2

def get_s12_v2(mesh, u, v):
    n = FacetNormal(mesh)
    s1 = dot(n, grad(dot(n, grad(grad(u)))))
    s2 = dot(n, grad(dot(n, grad(grad(v)))))
    return s1, s2

def test_grad_n(high_order_mesh=True, fun=get_s12):
    p = 3
    N = 10

    mesh = RectangleMesh(N, N, 1, 1)
        
    if high_order_mesh:
        V = VectorFunctionSpace(mesh, 'CG', 2)
        coords = Function(V).interpolate(mesh.coordinates)
        mesh = Mesh(coords)

    V = FunctionSpace(mesh, 'CG', p)

    u, v = TrialFunction(V), TestFunction(V)

    s1, s2 = fun(mesh, u, v)
    a = inner(grad(u), grad(v))*dx + inner(s1('+'), s2('+'))*dS + inner(s1('-'), s2('-'))*dS
    L = inner(Constant(0), v)*dx

    sol = Function(V)
    prob = LinearVariationalProblem(a, L, sol)
    solver = LinearVariationalSolver(prob)

    solver.solve()

for f in [get_s12, get_s12_v2]:
    for hom in [False, True]:
        try:
            test_grad_n(high_order_mesh=hom, fun=f)
            print(f.__name__, f': High order mesh: {hom}, ', "TEST OK!")
        except Exception as e:
            print(f.__name__, f': High order mesh: {hom}, ', "TEST ERROR: ", e)
get_s12 : High order mesh: False,  TEST OK!
get_s12 : High order mesh: True,  TEST OK!
get_s12_v2 : High order mesh: False,  TEST OK!
get_s12_v2 : High order mesh: True,  TEST ERROR:  Currently no support for ReferenceGrad in CoefficientDerivative.

9.1.3. PyErr_Occurred#

python: src/petsc4py.PETSc.c:348918: __Pyx_PyCFunction_FastCall: Assertion `!PyErr_Occurred()' failed.

This may caused by your python code (with pragrammer error, such as undefined variables) called by PETSc

Tip

在程序开始添加如下代码, 可能会有更详细信息

from firedrake.petsc import PETSc
PETSc.Sys.popErrorHandler()

9.1.4. PETSc Error Code#

https://petsc.org/release/manualpages/Sys/PetscErrorCode/

 {
  PETSC_SUCCESS                   = 0,
  PETSC_ERR_BOOLEAN_MACRO_FAILURE = 1, /* do not use */

  PETSC_ERR_MIN_VALUE = 54, /* should always be one less then the smallest value */

  PETSC_ERR_MEM            = 55, /* unable to allocate requested memory */
  PETSC_ERR_SUP            = 56, /* no support for requested operation */
  PETSC_ERR_SUP_SYS        = 57, /* no support for requested operation on this computer system */
  PETSC_ERR_ORDER          = 58, /* operation done in wrong order */
  PETSC_ERR_SIG            = 59, /* signal received */
  PETSC_ERR_FP             = 72, /* floating point exception */
  PETSC_ERR_COR            = 74, /* corrupted PETSc object */
  PETSC_ERR_LIB            = 76, /* error in library called by PETSc */
  PETSC_ERR_PLIB           = 77, /* PETSc library generated inconsistent data */
  PETSC_ERR_MEMC           = 78, /* memory corruption */
  PETSC_ERR_CONV_FAILED    = 82, /* iterative method (KSP or SNES) failed */
  PETSC_ERR_USER           = 83, /* user has not provided needed function */
  PETSC_ERR_SYS            = 88, /* error in system call */
  PETSC_ERR_POINTER        = 70, /* pointer does not point to valid address */
  PETSC_ERR_MPI_LIB_INCOMP = 87, /* MPI library at runtime is not compatible with MPI user compiled with */

  PETSC_ERR_ARG_SIZ          = 60, /* nonconforming object sizes used in operation */
  PETSC_ERR_ARG_IDN          = 61, /* two arguments not allowed to be the same */
  PETSC_ERR_ARG_WRONG        = 62, /* wrong argument (but object probably ok) */
  PETSC_ERR_ARG_CORRUPT      = 64, /* null or corrupted PETSc object as argument */
  PETSC_ERR_ARG_OUTOFRANGE   = 63, /* input argument, out of range */
  PETSC_ERR_ARG_BADPTR       = 68, /* invalid pointer argument */
  PETSC_ERR_ARG_NOTSAMETYPE  = 69, /* two args must be same object type */
  PETSC_ERR_ARG_NOTSAMECOMM  = 80, /* two args must be same communicators */
  PETSC_ERR_ARG_WRONGSTATE   = 73, /* object in argument is in wrong state, e.g. unassembled mat */
  PETSC_ERR_ARG_TYPENOTSET   = 89, /* the type of the object has not yet been set */
  PETSC_ERR_ARG_INCOMP       = 75, /* two arguments are incompatible */
  PETSC_ERR_ARG_NULL         = 85, /* argument is null that should not be */
  PETSC_ERR_ARG_UNKNOWN_TYPE = 86, /* type name doesn't match any registered type */

  PETSC_ERR_FILE_OPEN       = 65, /* unable to open file */
  PETSC_ERR_FILE_READ       = 66, /* unable to read from file */
  PETSC_ERR_FILE_WRITE      = 67, /* unable to write to file */
  PETSC_ERR_FILE_UNEXPECTED = 79, /* unexpected data in file */

  PETSC_ERR_MAT_LU_ZRPVT = 71, /* detected a zero pivot during LU factorization */
  PETSC_ERR_MAT_CH_ZRPVT = 81, /* detected a zero pivot during Cholesky factorization */

  PETSC_ERR_INT_OVERFLOW   = 84,
  PETSC_ERR_FLOP_COUNT     = 90,
  PETSC_ERR_NOT_CONVERGED  = 91,  /* solver did not converge */
  PETSC_ERR_MISSING_FACTOR = 92,  /* MatGetFactor() failed */
  PETSC_ERR_OPT_OVERWRITE  = 93,  /* attempted to over write options which should not be changed */
  PETSC_ERR_WRONG_MPI_SIZE = 94,  /* example/application run with number of MPI ranks it does not support */
  PETSC_ERR_USER_INPUT     = 95,  /* missing or incorrect user input */
  PETSC_ERR_GPU_RESOURCE   = 96,  /* unable to load a GPU resource, for example cuBLAS */
  PETSC_ERR_GPU            = 97,  /* An error from a GPU call, this may be due to lack of resources on the GPU or a true error in the call */
  PETSC_ERR_MPI            = 98,  /* general MPI error */
  PETSC_ERR_RETURN         = 99,  /* PetscError() incorrectly returned an error code of 0 */
  PETSC_ERR_MAX_VALUE      = 100, /* this is always the one more than the largest error code */

  /*
    do not use, exist purely to make the enum bounds equal that of a regular int (so conversion
    to int in main() is not undefined behavior)
  */
  PETSC_ERR_MIN_SIGNED_BOUND_DO_NOT_USE = INT_MIN,
  PETSC_ERR_MAX_SIGNED_BOUND_DO_NOT_USE = INT_MAX
} PETSC_ERROR_CODE_ENUM_NAME;

9.2. 调试 Python 代码#

运行中抛出异常, 定位出错代码, 检查相关的变量是否有异常值存在. 例如在 Jupyter notebook 中, %debug 可打开调试器, 检查相关变量.

9.3. 调试 C 代码 (gdb)#

由于 firedrake 基于 PETSc 进行网格管理和线性方程组求解, 有时出错会在 PETSc 中, 例如运行如下代码:

TODO: 找个示例, 这个示例不行

# filename: test.py
import sys
import petsc4py
petsc4py.init(sys.argv)
from petsc4py import PETSc
if PETSc.COMM_WORLD.rank == 0:
    PETSc.Vec().create(comm=PETSc.COMM_SELF).view()

出错信息如下:

$ python test.py
Vec Object: 1 MPI process
  type not yet set
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    PETSc.Vec().create(comm=PETSc.COMM_SELF).view()
  File "PETSc/Vec.pyx", line 140, in petsc4py.PETSc.Vec.view
petsc4py.PETSc.Error: error code 56
[0] VecView() at /home/yzz/software/firedrake-mini-petsc/src/petsc/src/vec/vec/interface/vector.c:715
[0] No support for this operation for this object type
[0] No method view for Vec of type (null)

这时可以使用 gdb 等调试工具.

9.3.1. gdb 命令行说明#

gdb [options] --args executable-file [inferior-arguments ...]

9.3.2. 参数 (options)#

  1. -x file: 从文件中读取 gdb 命令

  2. -ex COMMAND: 执行 gdb 命令

  3. --args exe [exe-args] 传递参数给 exe

  4. --pid <pid> 调试正在运行的程序

9.3.3. gdb 命令:#

  1. bt: 查看函数调用栈

  2. run: 运行可执行文件

  3. l: 查看代码

  4. p: 打印变量

9.3.4. 示例 (调试 test.py)#

$ gdb  -ex run --args $(which python3) test.py

9.3.5. gdb 控制命令#

一下命令可以保存 gdb 调试过程到文件, 可用于提交 issue.

  1. 输出执行的 gdb 命令 set trace-commands on
    ref https://sourceware.org/gdb/onlinedocs/gdb/Messages_002fWarnings.html

  2. 关闭分页 set pagination off
    ref https://sourceware.org/gdb/onlinedocs/gdb/Screen-Size.html

  3. 设置日志文件,并开启日志 set logging file my.logs, set logging enable on
    ref https://sourceware.org/gdb/download/onlinedocs/gdb/Logging-Output.html

9.3.6. gdb 的 python 插件#

在 ubuntu 上安装 python3-dbg 后, 文件夹 /usr/share/gdb/auto-load/usr/bin/ 下会有如下插件

$ ls /usr/share/gdb/auto-load/usr/bin/
python3.10-dbg-gdb.py  python3.10dm-gdb.py  python3.10-gdb.py  python3.10m-gdb.py

其中定义了用于显示 python 调用栈的命令: py-bt.

在启动 gdb 调试时如果, gdb 没有自动加载该插件时(为什么没有自动加载), 可以手动加载:

source /usr/share/gdb/auto-load/usr/bin/python3.10m-gdb.py

或者把该文件添加进 gdb 的初值化文件 $HOME/.config/gdb/gdbinit 或当前目录下的 .gdbinit:

source /usr/share/gdb/auto-load/usr/bin/python3.10m-gdb.py

9.3.7. Using commands from file#

新建文件 cmd.txt 内容如下

source /usr/share/gdb/auto-load/usr/bin/python3.10m-gdb.py
set trace-commands on
set pagination off
set logging file my.logs
set logging enable on
py-bt
exit

使用 -x 参数运行 gdb 自动运行上面命令并退出.

gdb -x cmd.txt -p <pid>

9.4. 并行程序调试#

9.4.1. PETSc 的参数 -start_in_debugger#

Reference:

  1. https://petsc.org/main/manualpages/Sys/PetscInitialize/

  2. https://petsc.org/main/manualpages/Sys/PetscSetDebugTerminal/

可以选择使用 PETSc 的参数 -start_in_debugger 给每个进程启动调试器如下:

mpiexec -n 3 $(which python) test.py  -start_in_debugger

默认会启动多个 xterm 窗口.

Tip

修改 xterm 窗口显示效果

参考: http://www.futurile.net/2016/06/14/xterm-setup-and-truetype-font-configuration/

$ cat ~/.Xdefaults
xterm*faceName: Monospace
xterm*faceSize: 12
xterm*foreground: rgb:a8/a8/a8
xterm*background: rgb:00/00/00

9.4.2. 工具 tmux-mpi#

Reference:

  1. firedrakeproject/firedrake

  2. Tips of Firedrake Wiki: firedrakeproject/firedrake

另外我们也可以选择使用工具 tmux-mpi.

9.4.2.1. 安装 tmux-mpi#

  1. 安装 tmux

    sudo apt-get install tmux
    
  2. 安装 dtach (tmux-mpi 依赖)

    先编译 dtach, 然后拷贝二进制文件到某个在 PATH 中的路径, 如 $HOME/bin.

    git clone https://github.com/crigler/dtach
    cd dtach
    ./configure
    make
    mkdir -p $HOME/bin
    cp dtach $HOME/bin
    export PATH=$PATH:$HOME/bin
    

    运行 which dtach 确认安装是否成功

  3. 安装 tmux-mpi

    使用 pip 安装

    pip install --upgrade --no-cache-dir git+https://github.com/wrs20/tmux-mpi@master
    

9.4.2.2. 调试命令#

  1. 启动调试器

    tmux-mpi 3 gdb -ex run --args $(which python) test.py
    
  2. Attach 到相应的的伪终端, 每个进程一个窗口. (这里是 tmux 的一个 session, 有多个 window)

    tmux attach -t tmux-mpi
    
  3. 使用 gdb 调试命令调试

9.5. 编译 pyx 文件#

9.5.1. firedrake 中的 pyx 文件#

Recompile dmcommon.pyx after making modifications to it

python setup.py build_ext --inplace