gitlab常见问题总结

看起来有好多要学要做

Posted by 大狗 on October 11, 2021

gitlab常见问题总结

记录一些遇到的问题和解决方法

1.1 push镜像失败

push到阿里云服务器时候失败,有两种可能的原因:

  1. 没登录,需要登陆下
  2. image的名字过于长了,简单说就是本来image的名字应该是repo_name/namespace/image_name:tag,如果image_name特别长比方说是/xxx/yyy/zzz/ddd/aaa/eee/ccc这种,name就会出现push失败的情况。提示requested access to the resource is denied。这种情况属于阿里云本身的问题

The push refers to repository [registry.qcraftai.com/qcraft/sim_server]
...
21639b09744f: Waiting
78220a8ee18f: Waiting
c65cd6950943: Waiting
29c579ad1c5a: Waiting
767a7c7801b5: Waiting
a24b8da85c42: Waiting
denied: requested access to the resource is denied

1.2 本地format.sh格式化过了,ci还是不过

错误原因是dev docker过旧,新版本的format.sh有变化,所以更新dev docker即可

[ OK ] Congrats, commit author check pass
[ OK ] Done buildifier /builds/root/qcraft/offboard/dashboard/BUILD
[INFO] Done formatting /builds/root/qcraft/offboard/dashboard/health.proto
[ OK ] Done buildifier /builds/root/qcraft/offboard/dashboard/services/health/BUILD
[INFO] Done formatting /builds/root/qcraft/offboard/dashboard/services/health/health.cc
[INFO] Done formatting /builds/root/qcraft/offboard/dashboard/services/health/health.h
[INFO] Done formatting /builds/root/qcraft/offboard/dashboard/services/health/health_client.cc
[INFO] Done formatting /builds/root/qcraft/offboard/dashboard/sim_server_main.cc
[ OK ] Done formatting /builds/root/qcraft/production/k8s/offboard/dashboard/sim_server_v2/deploy.sh
[ERROR] Format issue found, please run "scripts/format.sh --git" before commit
 offboard/dashboard/services/health/health_client.cc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

1.3 Job一直pending

job一直在pending,任务很忙,一直拿不到?目前看起来有什么可能的原因呢?有多种可能:

  • master发布升级命令,然后pod开始升级,包括拉取镜像,挂载设备等等。如果镜像仓库有问题,一直拉不下来,那么就会一直pending,比方说2021/11/11号早上,阿里云仓库出了问题,镜像一直拉不下来。这里多说一句,CI的镜像是在master里面通过对CN_IMG/USA_IMG指定的
Waiting for pod gitlab-runner/runner-u15fg-a7-project-4-concurrent-0tw4c2 to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod gitlab-runner/runner-u15fg-a7-project-4-concurrent-0tw4c2 to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod gitlab-runner/runner-u15fg-a7-project-4-concurrent-0tw4c2 to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod gitlab-runner/runner-u15fg-a7-project-4-concurrent-0tw4c2 to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"

1.4 代码保护报错

代码保护导致的错误,代码保护软件似乎有个问题,即白名单加多了以后就不生效了,然后gitlab-runner拉代码就会出先下面的错误。今天虽然加了代码但是没生效的原因是因为gitlab-runner上的客户端死了,所以没同步配置。代码就没拉下来。

[580s] can not find refs/pipelines/90402, retry later ...
fatal: unable to access 'https://gitlab-cn.qcraftai.com/root/qcraft.git/': error:1408F10B:SSL routines:ssl3_get_record:wrong version number
[585s] can not find refs/pipelines/90402, retry later ...
fatal: unable to access 'https://gitlab-cn.qcraftai.com/root/qcraft.git/': error:1408F10B:SSL routines:ssl3_get_record:wrong version number
[590s] can not find refs/pipelines/90402, retry later ...
fatal: unable to access 'https://gitlab-cn.qcraftai.com/root/qcraft.git/': error:1408F10B:SSL routines:ssl3_get_record:wrong version number
[595s] can not find refs/pipelines/90402, retry later ...
fatal: unable to access 'https://gitlab-cn.qcraftai.com/root/qcraft.git/': error:1408F10B:SSL routines:ssl3_get_record:wrong version number
[600s] can not find refs/pipelines/90402, retry later ...
can not find refs/pipelines/90402, exit!
Cleaning up file based variables
00:01
ERROR: Job failed: command terminated with exit code 1

1.5 sim-server的奇怪错误

今天gitlab部署sim-server出错了,然后报错的地方和原因八杆子打不着,原来是配置文件写错哦了

           - name: GITLAB_ACCESS_TOKEN
              valueFrom:  //这里多写了一行
              valueFrom:
                secretKeyRef:
                  name: gitlab-access-token

具体报错信息在这里

...
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  3m7s                default-scheduler  Successfully assigned staging/sim-server-grey-6bc7997545-mx4t8 to cn-zhangjiakou.172.20.2.197
  Warning  Unhealthy  90s                 kubelet            Liveness probe failed: OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused "process_linux.go:101: executing setns process caused \"exit status 1\"": unknown
  Warning  Unhealthy  84s (x2 over 94s)   kubelet            Readiness probe failed:
  Normal   Started    67s (x3 over 99s)   kubelet            Started container sim-server
  Warning  Unhealthy  60s                 kubelet            Liveness probe failed:
  Warning  Unhealthy  59s                 kubelet            Readiness probe errored: rpc error: code = Unknown desc = container not running (b47a56e00cddda91ee7e446a77ece77d8c6955320dd11a031a3d8c71372ab3d4)
  Warning  BackOff    44s (x5 over 82s)   kubelet            Back-off restarting failed container
  Normal   Pulling    28s (x4 over 3m5s)  kubelet            Pulling image "registry.qcraftai.com/global/sim_server:c8ad37ec94ebfda0a08106f77210cb92ed67387d"
  Normal   Pulled     28s (x4 over 101s)  kubelet            Successfully pulled image "registry.qcraftai.com/global/sim_server:c8ad37ec94ebfda0a08106f77210cb92ed67387d"
  Normal   Created    27s (x4 over 100s)  kubelet            Created container sim-server

看起来是你不小心编辑了/home/qcraft/.aws/credentials 这个文件导致的,要么想办法还原,要么重新setup一次,删了重新跑aws configure

有人碰到过这个问题吗?初始化tools报错
[INFO] Start goofys mounting from /qcraftroaddata to /media/s3/run_data_2
2021/12/20 11:02:22.424581 main.FATAL Unable to mount file system, see syslog for details
查看syslog信息是
Dec 20 11:18:28 yanguodong /usr/local/bin/goofys[7687]: main.ERROR Unable to setup backend: SharedConfigLoadError: failed to load config file, /home/qcraft/.aws/credentials#012caused by: INIParseError: invalid state with ASTKind {completed_stmt {0 NONE 0 []} false [{section_stmt {1 STRING 0 [78 111 110 101]} true []}]} and TokenType {4 NONE 0 [58]}
Dec 20 11:18:28 yanguodong /usr/local/bin/goofys[7687]: main.FATAL Mounting file system: Mount: initialization failed

1.6 安装升级cuda和nvidia驱动

我们的cuda的runtime library已经是11.3了,但是新建的centos7没有装驱动,需要手动装驱动。参考的链接是https://blog.csdn.net/JimmyOrigin/article/details/112972883

Events:
  Type     Reason     Age    From               Message
  ----     ------     ----   ----               -------
  Normal   Scheduled  2m36s  default-scheduler  Successfully assigned gitlab-runner/runner-fcakjyg6-project-4-concurrent-0dgs2w to cn-gpu03016ack
  Normal   Pulled     2m29s  kubelet            Container image "registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-58ba2b95" already present on machine
  Normal   Created    2m29s  kubelet            Created container init-permissions
  Normal   Started    2m29s  kubelet            Started container init-permissions
  Normal   Pulling    2m28s  kubelet            Pulling image "registry.qcraftai.com/global/qcraft-ci:dev-libgit-20211214_0012"
  Normal   Pulled     91s    kubelet            Successfully pulled image "registry.qcraftai.com/global/qcraft-ci:dev-libgit-20211214_0012" in 56.658518838s
  Normal   Created    81s    kubelet            Created container build
  Warning  Failed     81s    kubelet            Error: failed to start container "build": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.3, please update your driver to a newer version, or use an earlier cuda container: unknown
  Normal   Pulled     81s    kubelet            Container image "registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-58ba2b95" already present on machine
  Normal   Created    81s    kubelet            Created container helper
  Normal   Started    81s    kubelet            Started container helper

centos7系统安装gpu机器驱动和cuda:

# 以root用户登录进centos7.9系统,删除旧的nvidia驱动
yum remove nvidia*
# 需要重启才能卸载当前正在运行的nvidia mod
shutdown -r now
# 大部分的驱动安装都说要禁用nouveau,但是我使用默认的centos7检查的时候lsmod nouveau直接就没有,所以我们省略了这一步
# 下载nvidia驱动文件

# 更新依赖组件和包
yum update
yum groupinstall "Development Tools"
yum install kernel-devel epel-release
# 确定提示出来的内核的版本一致,不一致使用yum -y upgrade kernel kernel-devel
uname -r
rpm -q kernel-devel
# 运行驱动,执行安装.安装32bit nvidia驱动时选择no,update your x configuration选择yes。剩下的都ok即可
chmod +x ./NVIDIA-Linux-x86_64-470.94.run 
./NVIDIA-Linux-x86_64-470.94.run

# 查看匹配的cuda版本
nvidia smi
#执行结果见下
[root@cn-gpu03016ack ~]# nvidia-smi
Tue Dec 21 13:06:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.94       Driver Version: 470.94       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:00:08.0 Off |                  N/A |
| 15%   41C    P0    64W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
# 下载对应的CUDA版本
wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda_11.4.0_470.42.01_linux.run

# 进行安装,这里有几点要注意: 出现一个文档,这里要输入accept。然后CUDA installer里面组件选择去掉驱动的选中,我们已经装了驱动了,然后再Install
sh cuda_11.4.0_470.42.01_linux.run

# 添加CUDA到环境变量
export PATH=/usr/local/cuda-11.4/bin:$PATH
export LD_LIBRARY_PATH=$LDLIBRARY_PATH:/usr/local/cuda-11.4/lib64
source ~/.bashrc

# 测试cuda指令,可以看到cuda已经安装成功,版本为11.4
nvcc -V

# 第一步的时候把nvidia-docker啥的都删除了,所以需要重新添加repo源
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
# 更新 yum cache
yum clean expire-cache
# 安装nvidia-docker
yum install -y nvidia-docker2
# 重启docker
systemctl restart docker

最后整理出来的总共的初始化脚本如下:

#!/bin/bash

set -e

function add_current_script_to_rc_local() {
  script_path=$(realpath $0)
  echo "Adding boot shell $script_path"
  echo "$script_path" >> /etc/rc.d/rc.local
}

function remove_current_script_from_rc_local() {
  script_path=$(realpath $0)
  echo "Removing boot shell $script_path"
  sed -i "s#$script_path##g" /etc/rc.d/rc.local
}

function remove_nvidia_driver() {
  echo "Remove existing driver"
  echo "Remove existing nvidia driver container cli"
  yum remove -y nvidia*

  # Use default nvidia-uninstall bin to perform uninstall
  if [[ -x "/usr/bin/nvidia-uninstall" ]]; then
    echo "Remove existing nvidia driver using nvidia-uninstall"
    /usr/bin/nvidia-uninstall --silent
  fi
}

function create_nvidia_cuda_folder() {
  echo "Create folder /root/nvidia_cuda"
  if [[ ! -d "/root/nvidia_cuda" ]]; then
    mkdir "/root/nvidia_cuda"
  fi
}

function get_nvidia_driver_version() {
  echo $(nvidia-smi --query-gpu=driver_version --format=csv,noheader)
}

function download_nvidia_11_4_0_470_42_01() {
  wget -O cuda_11.4.0_470.42.01_linux.run "https://qcraft-images.oss-cn-zhangjiakou-internal.aliyuncs.com/cuda_470_42_01/cuda_11.4.0_470.42.01_linux.run"
  cuda_md5=$(md5sum ./cuda_11.4.0_470.42.01_linux.run | awk '{print $1}')
  if [[ $cuda_md5 == "cbcc1bca492d449c53ab51c782ffb0a2" ]]; then
    echo "Download cuda successfull" >> /root/nvidia_cuda/install.log
  else
    echo "Download cuda failed" >> /root/nvidia_cuda/install.log
    exit 1
  fi
}

function install_driver_needed_files() {
  echo "Installing needed header files" >> /root/nvidia_cuda/install.log
  yum update -y
  yum groupinstall -y "Development Tools"
  yum install -y kernel-devel epel-release
}

function install_nvidia_cuda_driver() {
  driver_installed=$(get_nvidia_driver_version)
  if [[ $driver_installed == "470.42.01" ]]; then
    echo "Already install nvidia driver 470.42.01, Abort install" >> /root/nvidia_cuda/install.log
    exit 0
  else
    echo "Planning to install 470.42.01 cuda & driver" >> /root/nvidia_cuda/install.log
    install_driver_needed_files
    download_nvidia_11_4_0_470_42_01
    echo "Start to install 470.42.01 cuda & driver" >> /root/nvidia_cuda/install.log
    chmod +x ./cuda_11.4.0_470.42.01_linux.run
    ./cuda_11.4.0_470.42.01_linux.run --silent
    echo "Success installed 470.42.01 cuda & driver" >> /root/nvidia_cuda/install.log
  fi
}

function install_nvidia_docker_2() {
  echo "Installing nvidia-docker 2" >> /root/nvidia_cuda/install.log
  distribution=$(
    . /etc/os-release
    echo $ID$VERSION_ID
  ) && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
  yum clean expire-cache
  yum install -y nvidia-docker2
  echo "Installed nvidia-docker 2" >> /root/nvidia_cuda/install.log
}

function overwrite_daemon_json() {
  wget -O daemon.json "https://qcraft-images.oss-cn-zhangjiakou-internal.aliyuncs.com/cuda_470_42_01/daemon.json"
  json_md5=$(md5sum ./daemon.json | awk '{print $1}')
  if [[ $json_md5 == "8f8be065977394c8c75c0b4c23a2258d" ]]; then
    echo "Download daemon.json successfull" >> /root/nvidia_cuda/install.log
  else
    echo "Download daemon.json failed" >> /root/nvidia_cuda/install.log
    exit 1
  fi
  mv -f daemon.json /etc/docker/daemon.json
}

function modify_docker_sock_permission() {
  if [[ -f "/var/run/docker.sock" ]]; then
    echo "Modify docker.sock permission to 666"
    chmod 666 /var/run/docker.sock
  fi
}

if [ ! -d "/root/nvidia_cuda" ]; then
  remove_nvidia_driver
  create_nvidia_cuda_folder
  add_current_script_to_rc_local
  echo "Finish fisrt stage: remove current nvidia driver & add current stage to boot" >> /root/nvidia_cuda/install.log
  shutdown -r now
else
  remove_current_script_from_rc_local
  cd /root/nvidia_cuda
  install_nvidia_cuda_driver
  install_nvidia_docker_2
  overwrite_daemon_json
  echo "Finish second stage: install nvidia driver & cuda & nvidia docker 2 & daemon.json" >> /root/nvidia_cuda/install.log
fi

modify_docker_sock_permission

echo "Install nvidia driver & cuda success" >> /root/nvidia_cuda/install.log

这里的报错虽然是driver name nasplugin.csi.alibabacloud.com not found,但是实际上是运行在每个node上的daemonset没有启动,因此需要启动相对应的守护进程集。之所以出现这个问题是因为在上一步我更新驱动文件的时候将所有的nvidia相关的组件/驱动全删除了,也就罢nvidia-docker2也给删除了,因此日志里面报错:

Warning  FailedCreatePodSandBox  4m2s (x4 over 4m5s)    kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "csi-plugin-n8wgn": Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/238bd4e5cf01dd61498969daacae01454eb59624e9e11732d4bd8aa356fcbaec/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown

表现出来的报错信息为

Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    3m9s                  default-scheduler  Successfully assigned gitlab-runner/runner-cm89m7vp-project-4-concurrent-0drj7g to cn-gpu03016ack
  Warning  FailedMount  2m6s (x8 over 3m10s)  kubelet            MountVolume.MountDevice failed for volume "gitlab-runner-pvc-bazel-distdir" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name nasplugin.csi.alibabacloud.com not found in the list of registered CSI drivers
  Warning  FailedMount  2m6s (x8 over 3m10s)  kubelet            MountVolume.MountDevice failed for volume "gitlab-runner-pvc-bazel-repo-cache" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name nasplugin.csi.alibabacloud.com not found in the list of registered CSI drivers
  Warning  FailedMount  2m6s (x8 over 3m10s)  kubelet            MountVolume.MountDevice failed for volume "gitlab-runner-pvc-qcraft-maps-china" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name nasplugin.csi.alibabacloud.com not found in the list of registered CSI drivers
  Warning  FailedMount  67s                   kubelet            Unable to attach or mount volumes: unmounted volumes=[bazel-distdir qcraft-maps-china bazel-repo-cache], unattached volumes=[bazel-distdir qcraft-maps-china default-token-g2v9p docksock logs hosthostname aws repo bazel-repo-cache scripts]: timed out waiting for the condition

即使安装了完了诸如nvidia & cuda & nvidia-docker,依然会发现Job没有使用gpu运行,这个时候需要修改/etc/docker/daemon.json文件下面的文件.

还有一个要注意的地方,即使添加了gpu也不一定代表着test会一定rerun,

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "registry-mirror": [
             "https://registry.docker-cn.com"
    ],
        "exec-opts": ["native.cgroupdriver=systemd"],
        "live-restore": true,
        "log-driver": "json-file",
        "log-opts": {
                "max-size": "50m",
                "max-file": "5"
        },
        "bip": "169.254.123.1/24",
        "registry-mirrors": ["https://pqbap4ya.mirror.aliyuncs.com"]
}

报错信息为

  PASSED  ] 0 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] list_add_test.ListAdd
[  FAILED  ] list_add_test.ListAddHalf
 2 FAILED TESTS
================================================================================
FAIL: //onboard/nets/custom_ops:gen_coordinates_test (see /home/qcrafter/.cache/bazel/_bazel_qcrafter/b7b2ac012bd759fe3fc931a5a52099ce/execroot/com_qcraft/bazel-out/k8-opt/testlogs/onboard/nets/custom_ops/gen_coordinates_test/test.log)
INFO: From Testing //onboard/nets/custom_ops:gen_coordinates_test:
==================== Test output for //onboard/nets/custom_ops:gen_coordinates_test:
[NVBLAS] No Gpu available
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is NOT set : relying on default config filename 'nvblas.conf'
[NVBLAS] Cannot open default config file 'nvblas.conf'
[NVBLAS] Config parsed
[NVBLAS] CPU Blas library need to be provided

这里还有一个奇怪的问题,就是job报错找不到GPU,然后运行失败,但是在docker里面执行却没有问题执行成功。具体参考这个job 。为什么出这个错误?为此我把bazel run -c opt //onboard/nets/custom_ops:multiply_value_test这个扔到那个job里面。问题出现了bazel run是成功的,而bazel test是失败的。。。。有个类似的链接https://github.com/bazelbuild/rules_nodejs/issues/2325 为什么出现这个问题?因为指定了具体用不用cpu_only_flag,为了能够先跳过这个job,先

bazel run -c opt //onboard/nets/custom_ops:multiply_value_test

bazel test –cache_test_results=no -c opt –config=nolint $CPU_ONLY_PARAM –test_tag_filters=hxn – //…

是这两个commit

# Docker 内部运行的结果
qcrafter@runner-yeb5xrzz-project-4-concurrent-0st2sc:/qcraft$ bazel run -c opt //onboard/nets/custom_ops:multiply_value_test
INFO: Invocation ID: 99fdcd7d-047b-4086-a572-722da639841f
INFO: Analyzed target //onboard/nets/custom_ops:multiply_value_test (9 packages loaded, 234 targets configured).
INFO: Found 1 target...
Target //onboard/nets/custom_ops:multiply_value_test up-to-date:
  bazel-bin/onboard/nets/custom_ops/multiply_value_test
INFO: Elapsed time: 6.002s, Critical Path: 0.36s
INFO: 196 processes: 65 remote cache hit, 130 internal, 1 processwrapper-sandbox.
INFO: Build completed successfully, 196 total actions
INFO: Build completed successfully, 196 total actions
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //onboard/nets/custom_ops:multiply_value_test
-----------------------------------------------------------------------------
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is NOT set : relying on default config filename 'nvblas.conf'
[NVBLAS] Cannot open default config file 'nvblas.conf'
[NVBLAS] Config parsed
[NVBLAS] CPU Blas library need to be provided
Running main() from gmock_main.cc
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from multiply_value_test
[ RUN      ] multiply_value_test.MultiplyValue
[       OK ] multiply_value_test.MultiplyValue (117 ms)
[ RUN      ] multiply_value_test.MultiplyValueHalf
[       OK ] multiply_value_test.MultiplyValueHalf (0 ms)
[----------] 2 tests from multiply_value_test (117 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (117 ms total)
[  PASSED  ] 2 tests.

# 在CI节点里面执行job的日志
FAIL: //onboard/nets/custom_ops:multiply_value_test (see /home/qcrafter/.cache/bazel/_bazel_qcrafter/b7b2ac012bd759fe3fc931a5a52099ce/execroot/com_qcraft/bazel-out/k8-opt/testlogs/onboard/nets/custom_ops/multiply_value_test/test.log)
INFO: From Testing //onboard/nets/custom_ops:multiply_value_test:
==================== Test output for //onboard/nets/custom_ops:multiply_value_test:
[NVBLAS] No Gpu available                这个失败的原因实际上可能是编译build导致的
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is NOT set : relying on default config filename 'nvblas.conf'
[NVBLAS] Cannot open default config file 'nvblas.conf'
[NVBLAS] Config parsed
[NVBLAS] CPU Blas library need to be provided
Running main() from gmock_main.cc
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from multiply_value_test
[ RUN      ] multiply_value_test.MultiplyValue
onboard/nets/custom_ops/multiply_value_test.cc:55: Failure
The difference between result[i] and kOutputResult[i] is 7350900951613439, which exceeds kMaxDiffFloat, where
result[i] evaluates to 7350900951613440,
kOutputResult[i] evaluates to 1, and
kMaxDiffFloat evaluates to 9.9999999747524271e-07.
onboard/nets/custom_ops/multiply_value_test.cc:55: Failure
The difference between result[i] and kOutputResult[i] is 4, which exceeds kMaxDiffFloat, where
result[i] evaluates to 3.0677225980998895e-41,
kOutputResult[i] evaluates to 4, and
kMaxDiffFloat evaluates to 9.9999999747524271e-07.
onboard/nets/custom_ops/multiply_value_test.cc:55: Failure
The difference between result[i] and kOutputResult[i] is 5.9999999997455671, which exceeds kMaxDiffFloat, where
result[i] evaluates to 2.544326138664843e-10,
kOutputResult[i] evaluates to 6, and
kMaxDiffFloat evaluates to 9.9999999747524271e-07.
onboard/nets/custom_ops/multiply_value_test.cc:55: Failure
The difference between result[i] and kOutputResult[i] is 4, which exceeds kMaxDiffFloat, where
...
kOutputResultHalf[i] evaluates to 18, and
kMaxDiffHalf evaluates to 0.00039999998989515007.
[  FAILED  ] multiply_value_test.MultiplyValueHalf (0 ms)
[----------] 2 tests from multiply_value_test (1 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (1 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] multiply_value_test.MultiplyValue
[  FAILED  ] multiply_value_test.MultiplyValueHalf
 2 FAILED TESTS

1.7 aliyun机器怎么格式化数据盘为xfs

数据盘为/dev/vdb,默认是ext4的格式

#给机器打掉label,避免相关的任务/pod/daemonset被调度这个机器上

#一路umount,去掉所有正在使用vdb的挂载,执行完了还是不能做挂载或者格式化,因为有进程再用,需要重启
umount -l /dev/vdb
umount -l /var/lib/container
umount -l /var/lib/kubelet/
umount -l /var/lib/docker
#注释掉原先挂载/dev/vdb的内容,里面有很多挂载的配置,避免重启后还是会挂载
vim /etc/fstab 
#重启
shutdown -r now
#格式化创建xfs的文件系统
mkfs.xfs -f /dev/vdb
#,把下面命令写入到/etc/fstab里面,保证正确的挂载
/dev/vdb /var/lib/container xfs defaults 0 0
/var/lib/container/kubelet /var/lib/kubelet none defaults,bind 0 0
/var/lib/container/docker /var/lib/docker none defaults,bind 0 0
#重启

# 这个时候vdb已经是xfs的了,可以注册到阿里云的机器里面了
# 首先进入节点节点,选择机器后批量移除,然后选择手动添加节点,需要每一台每一台操作,在具体的机器上执行拷贝的命令


#给机器打上对应的label

1.8 怎么计算需要机器

这个问题一般是和实际的job运行时长相关的,实际上计算起来并不困难:

  • 计算相关任务的运行时间,统计近几天的job数量和运行时间,计算一个平均时长t
  • 计算任务对资源的需求从而计算出目前的单台机器的qps,一般是qps=(并发数)/(运行时间)
  • 统计每天job运行的时间分布范围,这里的运行时间是正常的平均时间,这里可以按照28定律估算需要的qps,简单来说就是每天的20%的时间,跑了80%的任务,所以一般是all-qps=(job count * 0.8)/(总运行时间*0.2)
  • 使用all-qps/qps计算需要的机器数量,再留出30%的富余。最终得出机器总量。

1.9 怎么拆分计算的runner的个数

面临的问题主要有三个:

  1. 如何按照所有job的运行时候的资源区分出来不同的runner(的个数)?
  2. 如果按照第一个问题正确地设置了不同的runner,那么如何设置每个runner的job运行时候的资源呢?
  3. 前两个问题都解决的情况下,怎么设置每个runner的并发数量呢?

前两个问题感觉是一致的,是个散点图的聚类分类问题,类别确定了就可以计算。首先需要收集历史数据pod对于内存的需求,然后画出散点图。这个时候就可以开始按照范围进行筛选了,筛选完了之后1h内统计出来,min/midum/large = 2/1/3,使用加权平均,和取最大值同时考虑,内存request拆分为:0.5G,7G,17G。关于第三个问题,设置runner的并发数量,实际上是和资源的数量相关联的,有两种分析手段:

  • 统计一下一个下午并发的job的数量和种类,看应该设置runner的并发数量多少
  • 单纯按照机器支持的最大量来计算同时能够跑多少

实际上这里面的问题是,第二种方法是纯贪心,两个结合到一起更合理一些。统计了我们任务发现我们的job实际上分歧特别大,按照small:medium:large的比例和运行时间来简单加权,计算出来的比例为15:30:60,经过一个繁忙的一下午最后发现可以容纳为15:30:70的容量,基本就打到了极限

目前我们的资源利用率不够高,大概四五台机器的请求值为90%,其真正的使用值只有50%查出来了30%的余地

1.10 git的奇怪问题

我们有一个巨大的地图仓库,里面混杂着大量的git lfs文件,这个仓库非常大,因此我拉取了重点关注的单个分支即clone的时候挤上了–single-branch的选项。这两天遇到了一个奇怪的问题,执行git push 会报错

Locking support detected on remote "origin". Consider enabling it with:
  $ git config lfs.https://xxxxxxx.git/info/lfs.locksverify true
ref HEAD:: missing object: 003d38e4b3d22ee0610d9ec800bca7ce3cf70ef9
Uploading LFS objects: 100% (125882/125882), 16 GB | 3.9 MB/s, done.
error: failed to push some refs to 'xxxxxxxxx.git'
git push [MAP-CI-DAILY/20221126_0252] failed.

查了一部分issue:https://github.com/git-lfs/git-lfs/issues/3587, https://stackoverflow.com/questions/70923109/git-lfs-missing-object-on-push-even-if-it-shouldnt-be

执行了下面的命令,修复了单次的push,但是第二天又出现了问题,简单来说发现是只要拉single-branch就会出现这个问题

 git repack -adf

怎么办。。。头太大了,最后我切换了新版本的git并且除了拉取单个分支,并且拉取了所有的分支之后,再次进行track就没问题了

结尾

唉,尴尬

狗头的赞赏码.jpg