本文绝大部分命令都是在sudo -i
切换到root权限后操作的。
1   测试项目 2   VideoPose3d 2.1   训练推理自定义视频 设置过程参见:Inference in the wild
2.1.1   安装ffmpeg 2.1.2   安装detectron2 配置流程参照:Installation
1 2 3 4 5 6 7 8 9 pip3.7 install 'git+https://github.com/facebookresearch/detectron2.git@d779ea63faa54fe42b9b4c280365eaafccb280d6' git clone -b d779ea63faa54fe42b9b4c280365eaafccb280d6 https://github.com/facebookresearch/detectron2.git python3.7 -m pip install -e detectron2
2.1.3   下载预训练模型 1 2 3 cd checkpointwget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_detectron_coco.bin
2.1.4   推理2D关键点 在 inference 文件夹中新建两个文件夹 input_directory 和 output_directory,input_directory用来存放需要处理的自定义视频,output_directory用来存放程序生成的每个视频的2D关键点数据文件,后缀名是.npz。 将需要处理自定义视频放入VideoPose3D/inference/input_directory/中,切记需要先放入视频后再执行后面的命令。若不提前放入视频,最后执行run.py是会出现报错:keyError:output.mp4。其中「推理2D关键点」步骤中不会有ffmpeg相关视频处理的日志输出,「创建自定义数据集」步骤中不会显示出处理了xxx.npz文件和处理了多少帧frame,只输出saving和done这两句简单的日志,并且这种情况在data目录中生成data_2d_custom_myvideos.npz的数据集文件只有798字节,不到1KB的大小,属于无效文件。 执行命令推理2D关键点1 2 3 4 5 6 cd inferencepython3.7 infer_video_d2.py \ --cfg COCO-Keypoints/keypoint_rcnn_R_101_FPN_3x.yaml \ --output-dir output_directory \ --image-ext mp4 \ input_directory
创建自定义数据集:程序根据VideoPose3D/inference/output_directory/的每个视频的2D关键点数据文件.npz,在data目录下生成自定义的data_2d_custom_myvideos.npz文件。1 2 3 cd datapython3.7 prepare_data_2d_custom.py -i /home/nvidia/VideoPose3D/inference/output_directory/ -o myvideos cd ..
渲染自定义视频并导出坐标:output.mp4是在VideoPose3D根目录下,不是inference/output_directory/。1 2 3 python3.7 run.py -d custom -k myvideos -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_detectron_coco.bin --render --viz-subject test_video.mp4 --viz-action custom --viz-camera 0 --viz-video /home/nvidia/VideoPose3D/inference/input_directory/test_video.mp4 --viz-output output.mp4 --viz-size 6
2.2   训练推理h36m视频 数据集设置步骤参考:Dataset setup
准备数据文件:在data目录下新建h36m文件夹,传入3d数据集的压缩包,目录结构图如下1 2 3 4 5 6 7 8 9 data/ └── h36m/ ├── Poses_D3_Positions_S1.tgz ├── Poses_D3_Positions_S5.tgz ├── Poses_D3_Positions_S6.tgz ├── Poses_D3_Positions_S7.tgz ├── Poses_D3_Positions_S8.tgz ├── Poses_D3_Positions_S9.tgz └── Poses_D3_Positions_S11.tgz
数据处理1 2 3 4 5 6 7 8 cd /home/nvidia/data/h36m/for file in *.tgz; do tar -xvzf $file ; done cd /home/nvidia/datapip3.7 install cdflib python3.7 prepare_data_h36m.py --from-source-cdf /home/nvidia/data/h36m/
下载预训练模型1 2 3 4 5 mkdir checkpointcd checkpointwget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_cpn.bin wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_humaneva15_detectron.bin cd ..
测试Human3.6M模型数据1 python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin
3   HRFAE 官方要求的依赖库(注意torch需要是cuda版本的,因为源码是使用cuda设备的方法)
Python 3.7 Pytorch 1.1 Numpy Opencv TensorboardX Tensorboard_logger 别人测试成功的依赖库版本:参见成功运行HRFAE面部年龄编辑
Python 3.7.13 Pytorch 1.10.2 Numpy 1.21.5 Opencv 4.6.0 Tensorboard 1.14.0 TensorboardX Tensorboard-logger 3.1   预训练模型的配置步骤 安装依赖库1 pip3.7 install TensorboardX Tensorboard_logger
修改test.py代码 下载预训练模型1 2 3 4 5 cd ./logs/001./download.sh
运行命令生成结果1 2 3 cd HRFAEpython3.7 test.py --config 001 --target_age 40
3.2   进程被Killed的故障排查 现象:终端输出一句「Killed」后,进程就闪退退出了。
原因:这是由于pytroch程序占用过多内存,引起系统OOM(Out of memory)机制。
解决方式:无。在我的尝试方法中,尝试了增加虚拟内存的大小至13GB,仍然是无法解决,依然是触发OOM了(即使我对pytorch进程禁用系统OOM killer,并观察内存占用情况,物理4GB全部吃满,虚拟内存zram占用到3.6GB后就整个系统卡死了)。推测可能还因为显存不足。
内存管理的相关命令:
1 2 3 4 5 6 7 8 9 sysctl -a| grep vm.swappiness echo 10 > /proc/sys/vm/swappinesstop echo -17 > /proc/$PID /oom_adj
查看系统是否主动杀掉进程的日志排查:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 sudo dmesg | tail -7dmesg | egrep -i -B100 'killed process' egrep -i 'killed process' /var/log/messages egrep -i -r 'killed process' /var/log journalctl -xb | egrep -i 'killed process' nano /etc/rsyslog.d/50-default.conf
3.3   自训练模型的配置步骤 3.4   Windows x86版环境搭建 安装miniconda3:跳转下载 创建并切换新的虚拟环境1 2 conda create -n hrfae python=3.7 conda activate hrfae
先下载并安装torch和torchvision的cuda版whl(建议先关闭科学上网,避免浪费科学上网的流量),使用pip install
即可:torch-1.13.1+cu117-cp37-cp37m-win_amd64.whl 和torchvision-0.14.1+cu117-cp37-cp37m-win_amd64.whl 。别使用conda install torch
安装torch,否则易出现一些版本依赖的问题。例如:解决python urllib3 v2.0 only supports OpenSSL 1.1.1+, currently
。这是因为openSSL版本太低,可以安装低版本解决:pip install urllib3==1.23 -i https://pypi.tuna.tsinghua.edu.cn/simple
。 之后一一安装其他库,如果conda install
找不到库,就使用pip install
,例如Tensorboard-logger
。 4   硬件型号 NVIDIA Jetson TX2 NX,16GB eMMC,额外挂载一个120G的固态盘。
5   烧录系统 5.1   烧录前准备 一台Ubuntu 18.00 LTS系统的PC(其他的jetson设备不能作为刷机主机,必须用ubuntu电脑。也不能用vbox或者vmware之类的虚拟机创建ubuntu刷机主机,因为虚拟机系统无法处理usb等底层的驱动,这会导致刷机时一直提示未正确进入刷机模式,但实际机器已经真正进入到刷机模式的) 一条micro-usb数据线 根据硬件型号选择对应版本 的驱动包(BSP包) 和示例根文件系统 (目前适配TX2 NX的最新版为R32.7.4 )。 5.2   烧录步骤 在刷机主机ubuntu系统中启用ssh。 通过WinSCP 将驱动包(BSP包)和示例根文件系统复制至刷机主机系统中的家目录/home
。 刷机命令1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 sudo apt-get install qemu-user-staticsudo apt-get install pythonsudo tar -jxvf Jetson_Linux_R32.7.4_aarch64.tbz2cd Linux_for_Tegra/rootfssudo tar -jxpf ../../Tegra_Linux_Sample-Root-Filesystem_R32.7.4_aarch64.tbz2cd ..sudo ./apply_binaries.shcd Linux_for_Tegra/toolssudo ./l4t_create_default_user.sh -u nvidia -p nvidiacd Linux_for_Tegrasudo ./flash.sh jetson-xavier-nx-devkit-tx2-nx mmcblk0p1
5.3   烧录后无法进入到桌面环境 Jetson报错无法进到桌面环境: Failed to start nvpmode1 server. 和Failed to start load kernel modules
无桌面环境的条件下进入到终端操作的两种方式:
ssh连接虚拟ip:jetson的usb有虚拟ip,可以通过ssh远程连接,用上位机通过USB数据线(有的USB线不能传数据只能充电,因为只有电源线和地线)连接jetson。打开上位机远程登陆开发板,登陆ip为:192.168.55.1:ssh nvidia@192.168.55.1
按ctrl+alt+F1-F6等组合键切换命令窗口1-6 Failed to start nvpmode1 server一般是桌面环境出问题引起的,重装桌面环境即可。
1 2 sudo apt-get install nvidia-l4t-x11sudo reboot
Failed to start load kernel modules一般配置方面出问题,重新运行更新配置。
1 2 3 4 5 6 7 sudo -iapt-get update dpkg --configure -a apt-get dist-upgrade apt-get -f install reboot
5.4   为TX2 NX这种老硬件安装Ubuntu 20.04 为Jetson TX2 NX或者Jetson Nano这种老硬件安装Ubuntu 20.04(官方最新只支持Ubuntu 18.04)的两种方式:
6   为开发板设置SSD固态为系统盘 NVMe SSD固态硬盘仅作为系统盘(rootfs和用户区),系统的启动引导依然是通过SD卡或者内置EMMC的存储,比如升级设备树dtb仍然是在SD卡或者EMMC中。
格式化硬盘:打开软件列表搜索disk,打开ubuntu自带的Disks工具,选择识别到的SSD,按Ctrl+F对硬盘进行快速格式化,点击Format(不覆盖已存在的数据),点击Format(只是弹窗查看确认该操作影响到的设备),输入密码提权操作。默认最大分区,直接下一步。分区名字填ssd,其他选项默认(Type:ext4),点击create创建。点击分区左下角的三角符号(▶)进行挂载(状态变化:Not Mounted -> Mounted at /media/nvidia/ssd)。 下载系统盘转换程序的源码:git clone https://github.com/jetsonhacks/rootOnNVMe.git
。如果克隆仓库失败,访问网页下载压缩包,再解压通过WinSCP上传到开发板家目录(上传后记得将文件的权限设置为0755,否则运行程序时会提示无权限)。1 2 3 4 5 6 cd rootOnNVMe./copy-rootfs-ssd.sh ./setup-service.sh sudo reboot
运行命令df -h
:可以看到挂载根目录的分区大小已经是SSD固态硬盘的120G的容量,不是以前EMMC的16G。 7   Jetson TX2更换软件源 备份并编辑软件源
1 2 3 4 5 6 sudo cp /etc/apt/sources.list /etc/apt/sources.list.baksudo vim /etc/apt/sources.listsudo apt update
清华源:
1 2 3 4 5 6 7 8 deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ xenial-updates main restricted universe multiverse deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ xenial-updates main restricted universe multiverse deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ xenial-security main restricted universe multiverse deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ xenial-security main restricted universe multiverse deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ xenial-backports main restricted universe multiverse deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ xenial-backports main restricted universe multiverse deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ xenial main universe restricted deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ xenial main universe restricted
假如默认官方源被删除的话,可以用下面的可用官方源恢复:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 deb http://ports.ubuntu.com/ubuntu-ports/ bionic main restricted deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic main restricted deb http://ports.ubuntu.com/ubuntu-ports/ bionic-updates main restricted deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic-updates main restricted deb http://ports.ubuntu.com/ubuntu-ports/ bionic universe deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic universe deb http://ports.ubuntu.com/ubuntu-ports/ bionic-updates universe deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic-updates universe deb http://ports.ubuntu.com/ubuntu-ports/ bionic multiverse deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic multiverse deb http://ports.ubuntu.com/ubuntu-ports/ bionic-updates multiverse deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic-updates multiverse deb http://ports.ubuntu.com/ubuntu-ports/ bionic-backports main restricted universe multiverse deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic-backports main restricted universe multiverse deb http://ports.ubuntu.com/ubuntu-ports/ bionic-security main restricted deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic-security main restricted deb http://ports.ubuntu.com/ubuntu-ports/ bionic-security universe deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic-security universe deb http://ports.ubuntu.com/ubuntu-ports/ bionic-security multiverse deb-src http://ports.ubuntu.com/ubuntu-ports/ bionic-security multiverse
8   配置python 查看torchvision的README页面 ,可知torch和torchvision对应版本信息,以及所要求的python版本。
torch torchvision Python 1.10 0.11 >=3.6, <=3.9
综合考虑,因为需要安装v1.10.2的pytorch,因此python选用v3.7(很多深度学习项目要求最低的python版本是v3.7),避免v3.8-v3.9可能会因此太新导致编译出错。
8.1   pyton3.6 系统自带pyton3.6.9,运行pip3报错:pip3:command not found
解决方法:sudo apt install python3-pip
8.2   pyton3.7 pyton3.7版本需要手动编译。下载python3.7.16源码压缩包
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 tar -xf Python-3.7.16.tar.xz sudo apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libbz2-dev liblzma-devsudo apt build-dep python3./configure --prefix=/usr/local/python3.7 ./configure --prefix=/root/miniconda3/lib/python3.7 make -j sudo make installcd /usr/bin/sudo ln -s /usr/local/python3.7/bin/python3.7 python3.7sudo ln -s /usr/local/python3.7/bin/pip3.7 pip3.7pip3.7 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple export PATH=$PATH :/usr/local/python3.7/bin
8.2.1   系统缺乏库的一些报错: 缺乏libssl-dev库:pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available。实际上 缺乏libffi-dev库:ModuleNotFoundError: No module named '_ctypes' 缺乏libbz2-dev库(Detectron2需要用到):ModuleNotFoundError: No module named '_bz2' 缺乏liblzma-dev库(Detectron2需要用到):ModuleNotFoundError: No module named '_lzma' 8.2.2   TLS/SSL报错的另类解决方式 如果因为缺乏libssl-dev库,导致编译出来的python报错TLS/SSL的话,实际上这个问题可以通过换pip源解决,必须是http源。
创建并编辑pip配置文件:
1 2 3 4 mkdir ~/.pipcd ~/.piptouch pip.confnano pip.conf
pip源设置内容:
1 2 3 4 [global] index-url = http://pypi.tuna.tsinghua.edu.cn/simple [install] trusted-host = pypi.tuna.tsinghua.edu.cn
8.2.3   安装系统库时的依赖版本报错 报错内容:libssl-dev : Depends: libssl1.0.0 (= 1.0.2g-1ubuntu4) but 1.0.2g-1ubuntu4.15 is to be installed
报错原因:要装的库OpenSSL,它需要依赖的包是X,但需要的X是A版本,但是系统环境中已经存在另外的程序,它也需要的依赖X,并且需要的是X另外一个版本B,两个版本之间发生冲突了。所以无法安装。
解决方法:sudo apt install libssl1.0.0=1.0.2g-1ubuntu4
,即安装推荐的折中版本(报错提示有给出)。
8.3   pypi清华源 临时使用:pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package
设置默认 (版本需>=10.0.0) :pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
升级pip本体官方源升级pip:python -m pip install --upgrade pip
清华源升级pip:python -m pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade pip
9   解决报错「Illegal instruction (core dumped)」 为了防止出现 “Illegal instruction (core dumped)” 报错,需要在 ~/.bashrc 最后添加如下语句,保存修改后,重启系统
1 export OPENBLAS_CORETYPE=ARMV8
若是export导出环境变量之后,仍是出现相同的报错的话,则可能是安装包版本问题或者程序包问题。
10   安装jetson-stats查看开发板情况 安装jetson-stats 前记得把pip源更换为清华源:sudo -H pip3.7 install -U jetson-stats
安装好jetson-stats之后,可以安装nvidia-jetpack包,使用jetson_release查看开发板信息(注意需提前装好jetson-stats)
1 2 sudo apt install nvidia-jetpacksudo jetson_release
10.1   The jetson_stats.service is not active 报错情况,jtop(jetson_stats.service)不能运行,systemctl查看服务状态为activating:
1 2 The jetson_stats.service is not active. Please run: sudo systemctl restart jetson_stats.service
解决方法:把pip源更换为清华源后,重新安装jetson-stats。
11   为开发板安装CUDA、cuDNN、TensorRT库 示例根文件系统并未包含cuda等库,因此烧录系统之后,需要手动安装CUDA、cuDNN、TensorRT库。
打开sdk-manager 下载网站,注册英伟达开发者帐号(注意是developer.nvidia.com域名,搜索nvidia register关键字注册的域名可能是partner.nvidia.com,即nvonline)。 下载最新的deb安装包,当前最新的是NVIDIA SDK Manager 1.9.3.deb。 安装sdk-manager:sudo apt install sdkmanager_1.9.3-10904_amd64.deb
运行sdk-manager:终端运行命令sdk-manager
或者在应用程序列表中找到SDKManager图标打开。 登录帐号(developer开发者帐号会打开浏览器登录,nvonline是直接在软件内页面登录),建议勾选上「Stay logged in」,避免因为安装出错后重复打开应用又要重新登录。 sdk-manager第一步:等待程序自动检测出开发板信息,若没正确识别到开发板,请手动在Target Hardware选择正确的开发板型号(比如我的是TX2 NX型号)。剩余默认即可,附加SDK(DeepStreem)不需要勾选上。 sdk-manager第二步:HOST COMPONENTS设置的是烧录主机的环境(清华镜像源可装),TARGET COMPONENTS(开发板环境,Jetson OS不要勾选,因为前面的步骤已经烧录过了。其他的SDK Components全部勾上)(清华源无法安装,会报错:SDK Manager received errors while using apt commands on your system,需换回官方源,并且是需要编辑开发板上的sources.list,编辑host主机上的sources.list源是没用的)。用micro-usb线连接开发板和烧录主机,点击下一步,提示指定路径不存在,点击create完成创建即可,这里输入烧录主机的密码用来提权。之后弹出一个窗口设置开发板帐号密码信息: 确认开发板型号,IPv4用192.168.55.1(usb连接的虚拟ip),系统账户名和系统密码,其余默认,之后点击Install确认安装库。 sdk-manager第三步:安装过程。 sdk-manager第四步:exit退出。 设置环境变量:sdk-manager安装好库之后,会自动添加export环境变量语句到用户级的.bashrc配置文件中(注意只是添加而已,仍需手动source ~/.bashrc,确保是在普通账户的终端下执行该命令)。但是root账户目录的.bashrc配置文件需手动配置1 2 3 4 5 6 7 8 9 10 source ~/.bashrcsudo -inano .bashrc export PATH=/usr/local/cuda-10.2/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH source .bashrc
查看cuda库的安装状态的命令:jtop
和nvcc -V
/nvcc --version
。 配置 cuDNN:sdk-manager虽然安装了cuDNN,但没有将对应的头文件、库文件放到cuda目录。不配置这个也可以,但编译opencv with cuda时需要手动指定路径到/usr/include和/usr/lib/aarch64-linux-gnu
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 cd /usr/include && sudo cp cudnn* /usr/local/cuda-10.2/includecd /usr/lib/aarch64-linux-gnu && sudo cp libcudnn* /usr/local/cuda-10.2/lib64sudo chmod 777 /usr/local/cuda-10.2/include/cudnn.hsudo chmod 777 /usr/local/cuda-10.2/lib64/libcudnn*cd /usr/local/cuda-10.2/lib64sudo ln -sf libcudnn.so /etc/alternatives/libcudnn_sosudo ln -sf libcudnn.so.8.2.1 libcudnn.so.8sudo ln -sf libcudnn_ops_train.so /etc/alternatives/libcudnn_ops_train_sosudo ln -sf libcudnn_ops_train.so.8.2.1 libcudnn_ops_train.so.8sudo ln -sf libcudnn_ops_infer.so /etc/alternatives/libcudnn_ops_infer_sosudo ln -sf libcudnn_ops_infer.so.8.2.1 libcudnn_ops_infer.so.8sudo ln -sf libcudnn_adv_train.so /etc/alternatives/libcudnn_adv_train_sosudo ln -sf libcudnn_adv_train.so.8.2.1 libcudnn_adv_train.so.8sudo ln -sf libcudnn_adv_infer.so /etc/alternatives/libcudnn_adv_infer_sosudo ln -sf libcudnn_adv_infer.so.8.2.1 libcudnn_adv_infer.so.8sudo ln -sf libcudnn_cnn_train.so /etc/alternatives/libcudnn_cnn_train_sosudo ln -sf libcudnn_cnn_train.so.8.2.1 libcudnn_cnn_train.so.8sudo ln -sf libcudnn_cnn_infer.so /etc/alternatives/libcudnn_cnn_infer_sosudo ln -sf libcudnn_cnn_infer.so.8.2.1 libcudnn_cnn_infer.so.8sudo libcudnn_static.a /etc/alternatives/libcudnn_stlibsudo ldconfig
测试Cudnn(可选)1 2 3 4 5 6 7 8 9 sudo cp -r /usr/src/cudnn_samples_v8/ ~/cd ~/cudnn_samples_v8/mnistCUDNNsudo chmod 777 ~/cudnn_samples_v8sudo make clean && sudo make./mnistCUDNN
12   安装pytorch和pytorch vision 12.1   pyton3.6 12.1.1   为python3.6安装pytorch pyton3.6直接访问nvidia官网PyTorch for Jetson帖子 下载官方编译提供的版本:点击跳转下载whl ,安装命令:pip3.6 install torch-1.10.0-cp36-cp36m-linux_aarch64.whl
12.1.2   为python3.6编译torchvision 同为python3.7编译torchvision
12.2   pyton3.7 12.2.1   为python3.7编译pytorch 注意:编译过程十分耗时,请耐心等待!我测试的机器型号为TX2 NX,编译耗时23个小时(大部分时间都是停留在编译caffe2仓库)。
由于英伟达官方只提供python3.6版本的pytorch安装包,并且torch库官方下载站 也没有arm芯片(aarch)的cu102版whl安装包,cpu版倒是有提供,因此pyton3.7的cuda版pytorch需要自行从源码编译。
编译步骤参考nvidia官网PyTorch for Jetson帖子 的Instructions小节的Build from Source内容。
设置开发板功率模式1 2 3 4 5 6 7 8 9 10 11 12 13 14 sudo nvpmodel -m 0sudo jetson_clockssudo nvpmodel -q verbosesudo jetson_clocks --show
下载PyTorch源码(v1.10.2)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 cd /home/nvidiagit clone --recursive --branch v1.10.2 http://github.com/pytorch/pytorch git submodule sync git submodule update --init --recursive --jobs 0 git clone --branch v1.10.2 http://github.com/pytorch/pytorch cd pytorchgit submodule update --init --recursive git submodule update --recursive git config --global http.sslVerify false git config --global https.sslVerify false git config --global url."https://20.205.243.166/cpp-pm/linux-syscall-support" .insteadOf "https://chromium.googlesource.com/linux-syscall-support" nano ~/.gitconfig [http] sslverify = false [https] sslverify = false [url "https://20.205.243.166/cpp-pm/linux-syscall-support" ] insteadOf = https://chromium.googlesource.com/linux-syscall-support
根据patch内容 手动为patch新修复的代码:就是根据patch文件内容查找对应文件的对应行数,同步改动的代码。如果懒得改动的话,直接下载我修改好的文件,上传替换就行。patch代码主要解决编译报错的问题(详见编译步骤参考帖子) vec256_float_neon.h CUDAContext.cpp KernelUtils.h 1 2 3 4 5 6 7 8 cd pytorchmv aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h.bakmv ../vec256_float_neon.h aten/src/ATen/cpu/vec/vec256/vec256_float_neon.hmv aten/src/ATen/cuda/CUDAContext.cpp aten/src/ATen/cuda/CUDAContext.cpp.bakmv ../CUDAContext.cpp aten/src/ATen/cuda/CUDAContext.cppmv aten/src/ATen/cuda/detail/KernelUtils.h aten/src/ATen/cuda/detail/KernelUtils.h.bakmv ../KernelUtils.h aten/src/ATen/cuda/detail/KernelUtils.h
设置构建参数的环境变量(如果有改变终端,记得重新export这些环境变量)1 2 3 4 5 6 7 8 9 10 export USE_NCCL=0export USE_DISTRIBUTED=0export USE_QNNPACK=0export USE_PYTORCH_QNNPACK=0export TORCH_CUDA_ARCH_LIST="5.3;6.2;7.2" export PYTORCH_BUILD_VERSION=1.10.2export PYTORCH_BUILD_NUMBER=1
安装构建必要的系统库:sudo apt install cmake libopenblas-dev libopenmpi-dev
,这里我使用的ubuntu官方源。 编译源码构建程序(在构建之前建议先备份pytorch文件夹,毕竟克隆不容易:cp -r pytorch pytorch.bak
):1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 cd pytorchpip3.7 install -r requirements.txt pip3.7 install scikit-build pip3.7 install ninja python3.7 setup.py bdist_wheel cd distpip3.7 install torch-1.10.2-cp37-cp37m-linux_aarch64.whl find / -name "*zram*" 2>/dev/null nano /etc/systemd/nvzramconfig.sh mem=$(((totalmem / 2 / ${NRDEVICES} ) * 1024 )) mem=$(((totalmem * 3 / ${NRDEVICES} ) * 1024 )) zramctl sudo fallocate -l 8G /var/swapfilesudo chmod 600 /var/swapfilesudo mkswap /var/swapfilesudo swapon /var/swapfilesudo bash -c 'echo "/var/swapfile swap swap defaults 0 0" >> /etc/fstab' sudo swapoff /var/swapfilesudo dd if =/dev/zero of=/var/swapfile bs=1M count=30720 oflag=append conv=notruncsudo rm /var/swapfilenano /etc/fstab
检测pytorch安装是否成功1 2 3 4 5 6 7 8 9 python3.7 import torch torch.cuda.is_available() torch.backends.cudnn.is_available() print (torch.version.cuda)print (torch.backends.cudnn.version())
12.2.2   为python3.7编译torchvision 查看torchvision的README页面 ,可知torch和torchvision对应版本信息,以及所要求的python版本。
torch torchvision Python 1.10 0.11 >=3.6, <=3.9
编译步骤参阅:Development installation
下载及编译torchvision源码:
1 2 3 4 5 6 7 8 9 10 11 git clone -b v0.11.3 https://20.205.243.166/pytorch/vision.git torchvision cd torchvisionsudo apt install libpng-dev libjpeg-turbo8-devexport BUILD_VERSION=0.11.3python setup.py bdist_wheel cd distpip3.7 install torchvision-0.11.3-cp37-cp37m-linux_aarch64.whl
测试安装是否成功:
1 2 import torchvision print (torchvision.__version__)
13   配置opencv with cuda 13.1   python3.6环境 使用python3.6环境,可以直接使用这个帖子给出的预编译的opencv包——官方隐藏资源:Jetson伪超频与CUDA版OpenCV 。
预构建好的deb文件:OpenCV-4.5.0-aarch64.tar.gz
可以先卸载系统自带的CPU版的opencv4.1.1,再执行安装deb包。
如果想自行从源码构建opencv with cuda,编译步骤可参考Install OpenCV on Jetson Nano 。
13.2   python3.7环境 卸载默认不带cuda的opencv
1 2 3 4 5 6 sudo apt purge libopencv*sudo apt autoremovesudo apt updatefind / -name "*opencv*" 2>/dev/null
下载opencv和opencv_contrib源码
1 2 3 4 5 6 7 8 9 10 11 12 curl -s -L -O https://github.com/opencv/opencv/archive/refs/tags/4.1.1.zip curl -s -L -O https://github.com/opencv/opencv_contrib/archive/refs/tags/4.1.1.zip sudo apt install unzipupzip opencv-4.1.1.zip unzip opencv_contrib-4.1.1.zip cd opencv-4.1.1mkdir buildcd build
安装依赖库(参见Install OpenCV on Jetson Nano )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 sudo sh -c "echo '/usr/local/cuda-10.2/lib64' >> /etc/ld.so.conf.d/nvidia-tegra.conf" sudo ldconfigsudo apt install build-essential cmake git unzip pkg-config zlib1g-devsudo apt install libjpeg-dev libjpeg8-dev libjpeg-turbo8-devsudo apt install libpng-dev libtiff-dev libglew-devsudo apt install libavcodec-dev libavformat-dev libswscale-devsudo apt install libgtk2.0-dev libgtk-3-dev libcanberra-gtk*sudo apt install python-dev python-numpy python-pipsudo apt install python3-dev python3-numpy python3-pipsudo apt install libxvidcore-dev libx264-dev libgtk-3-devsudo apt install libtbb2 libtbb-dev libdc1394-22-dev libxine2-devsudo apt install gstreamer1.0-tools libgstreamer-plugins-base1.0-devsudo apt install libgstreamer-plugins-good1.0-devsudo apt install libv4l-dev v4l-utils v4l2ucp qv4l2sudo apt install libtesseract-dev libxine2-dev libpostproc-devsudo apt install libavresample-dev libvorbis-devsudo apt install libfaac-dev libmp3lame-dev libtheora-devsudo apt install libopencore-amrnb-dev libopencore-amrwb-devsudo apt install libopenblas-dev libatlas-base-dev libblas-devsudo apt install liblapack-dev liblapacke-dev libeigen3-dev gfortransudo apt install libhdf5-dev libprotobuf-dev protobuf-compilersudo apt install libgoogle-glog-dev libgflags-dev
配置cmake
配置cmake一定要加上这项参数-D CUDNN_VERSION="8.0"
,否则会报错:Could NOT find CUDNN: Found unsuitable version "..", but required is at least "6" (found /usr/lib/aarch64-linux-gnu/libcudnn.so.8.2.1),即使是手动指定了库文件的绝对路径,仍然是会报这个错误。cudnn版本参见「为开发板安装CUDA、cuDNN、TensorRT库」 小节
下载的源码压缩包不包含所有modules组件的源码,并且raw.githubusercontent.com是无法直连的,直接ping也无法得到ip,通过网页https://www.ipaddress.com/ ,得到ip:185.199.110.133。如果不修改hosts进行域名解析映射IP,大概率会报错:
1 2 3 4 ======================================================================= Couldn't download files from the Internet. Please check the Internet access on this host. =======================================================================
1 2 3 sudo nano /etc/hosts185.199.110.133 raw.githubusercontent.com 20.205.243.166 github.com
倘若修改了hosts文件进行域名解析映射IP之后,仍然是报错提示无法连接网络下载文件的话,先移除opencv_contrib-4.1.1目录。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 cmake \ -D CMAKE_BUILD_TYPE=RELEASE \ -D CMAKE_INSTALL_PREFIX=/usr/local \ -D CMAKE_C_COMPILER=/usr/bin/gcc-7 \ -D INSTALL_PYTHON_EXAMPLES=OFF \ -D INSTALL_C_EXAMPLES=OFF \ -D OPENCV_ENABLE_NONFREE=ON \ -D BUILD_opencv_python3=ON \ -D WITH_CUDA=ON \ -D WITH_CUDNN=ON \ -D CUDNN_VERSION="8.0" \ -D WITH_TBB=ON \ -D OPENCV_DNN_CUDA=ON \ -D ENABLE_FAST_MATH=on \ -D CUDA_FAST_MATH=on \ -D CUDA_ARCH_BIN=6.2 \ -D CUDA_ARCH_PTX="" \ -D WITH_CUBLAS=on \ -D OPENCV_GENERATE_PKGCONFIG=ON \ -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib-4.1.1/modules \ -D PYTHON3_EXECUTABLE=/usr/local/python3.7/bin/python3.7m \ -D PYTHON3_INCLUDE_DIR=/usr/local/python3.7/include/python3.7m \ -D PYTHON3_LIBRARY=/usr/local/python3.7/lib/libpython3.7m.so \ -D PYTHON3_NUMPY_INCLUDE_DIRS=/usr/local/python3.7/lib/python3.7/site-packages/numpy/core/include \ -D PYTHON3_PACKAGES_PATH=/usr/local/python3.7/lib/python3.7/site-packages \ -D PYTHON_DEFAULT_EXECUTABLE=/usr/local/python3.7/bin/python3.7m \ -D CUDNN_LIBRARY=/usr/local/cuda-10.2/lib64/libcudnn.so.8.2.1 \ -D CUDNN_INCLUDE_DIR=/usr/local/cuda-10.2/include \ -D CUDA_CUDA_LIBRARY=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcuda.so \ -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.2/ \ -D OPENCV_PYTHON3_INSTALL_PATH=/usr/local/python3.7/lib/python3.7/site-packages \ -D WITH_WEBP=OFF \ -D WITH_OPENCL=OFF \ -D ETHASHLCL=OFF \ -D ENABLE_CXX11=ON \ -D BUILD_EXAMPLES=OFF \ -D WITH_OPENGL=ON \ -D WITH_GSTREAMER=ON \ -D WITH_V4L=ON \ -D WITH_LIBV4L=ON \ -D WITH_QT=OFF \ -D BUILD_opencv_python3=ON \ -D BUILD_opencv_python2=OFF \ -D WITH_FFMPEG=on \ -D HAVE_opencv_python3=ON \ -D EIGEN_INCLUDE_PATH=/usr/include/eigen3 \ -D WITH_EIGEN=ON \ -D ENABLE_NEON=ON \ -D WITH_OPENMP=ON \ -D BUILD_TIFF=ON \ -D WITH_TBB=ON \ -D BUILD_TBB=ON \ -D BUILD_TESTS=OFF \ -D WITH_PROTOBUF=ON \ ..
如果cmake提示找不到OpenBLAS头文件和库文件的话,可以尝试方法修复:
1 2 3 4 5 6 7 8 9 10 sudo apt install libopenblas-dev libopenblas-basesudo apt install liblapacke-devsudo ln -s /usr/include/lapacke.h /usr/include/x86_64-linux-gnu
编译及安装
1 2 3 4 5 6 make -j1 sudo make installsudo ldconfig
测试opencv安装情况
1 2 3 4 5 6 7 python3.7 import cv2 print (cv2.cuda.getCudaEnabledDeviceCount())
14   配置conda(可选,但不推荐) 不推荐配置,jetson tx2 nx型号能够安装成功的版本较老,并且创建虚拟环境后,默认的base环境一旦conda install任意包之后,再次运行conda的任何命令都会报错:illegal instruction (core dump)
,pip可以正常工作。如果是新建其他名称的虚拟环境,则是pip会报错:illegal instruction (core dump)
,而conda install可以正常工作。
anaconda下载: miniconda下载:点击跳转 Archiconda下载:点击跳转
tx2 nx硬件只有这个版本能成功安装:Miniconda3-py37_4.9.2-Linux-aarch64.sh ,而之后的新版本都会报错:illegal instruction (core dump)
。
14.1   conda基本操作 1 2 3 4 5 6 7 8 9 10 conda create -n xxx python=3.7 conda activate xxx conda deactivate conda remove -n xxx --all conda info --env
14.2   修改conda源 参见Anaconda 镜像使用帮助
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 nano ~/.condarc channels: - defaults show_channel_urls: true default_channels: - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2 custom_channels: conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud deepmodeling: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/ conda clean -i
14.3   使用yml配置文件创建虚拟环境报错(这个我实测没生效) 现象:使用yml文件创建conda环境时出现Solving environment: failed 和 ResolvePackageNotFound 的错误
原因:因为部分包的版本详细是根据机器配置而定,导出来的版本不一定适用当前的机器
解决方法:剔除环境配置文件environment.yml中的版本信息,只需删除第二个等号之后的内容即可
1 2 3 pytorch=1.1.0=py3.7_cuda10.0.130_cudnn7.5.1_0 pytorch=1.1.0
15   参考文献 [1] Nvidia xavier NX 通过 flash.sh 烧录 linux 系统[EB/OL]. https://blog.csdn.net/Yan_uuu/article/details/126947983 . [2] Jetson Xavier NX 镜像制作、烧录及克隆[EB/OL]. https://blog.csdn.net/qhdd123/article/details/123815911 . [3] 解决 Jetson:Failed to start nvpmode1 server. 和 Failed to start load kernel modules[EB/OL]. https://blog.csdn.net/alianfibakic/article/details/123807606 . [4] Jetson TX2 更换软件源[EB/OL]. https://blog.csdn.net/qlulibin/article/details/80271096 . [5] Ubuntu18.04 默认源恢复默认源恢复备份源[EB/OL]. https://blog.csdn.net/ZeyiRTangent/article/details/114885286 . [6] PyPI 镜像使用帮助[EB/OL]. https://mirrors.tuna.tsinghua.edu.cn/help/pypi/ . [7] Ubuntu 编译安装 python3.7.10, 解决‘_ctypes‘和 ssl 问题, 建立软链接 python3.7 和 pip3.7[EB/OL]. https://blog.csdn.net/tmaccs/article/details/117029067 . [8] ubuntu18.04 下源码编译安装最新版本 Python3[EB/OL]. https://zhuanlan.zhihu.com/p/62930419 . [9] ubuntu pip is configured with locations that require TLS/SSL[EB/OL]. https://blog.csdn.net/a1007720052/article/details/107342695 . [10] libssl‑dev : Depends: libssl1.0.0 (= 1.0.2g‑1ubuntu4) but 1.0.2g‑1ubuntu4.15 is to be installed[EB/OL]. https://blog.csdn.net/weixin_38890593/article/details/102783551 . [11] TX2开启最大功耗模式[EB/OL]. https://www.cnblogs.com/gezhuangzhuang/p/11674062.html . [12] 使用git --recursive进行循环克隆,由于网络原因,出现克隆失败的情况。[EB/OL]. https://blog.csdn.net/qq_43212651/article/details/116376103 . [13] 解决c++: internal compiler error: 已杀死 (program cc1plus)[EB/OL]. https://blog.csdn.net/chenmeng0508/article/details/122283442 . [14] Jetson nano增加Swap分区大小操作指南[EB/OL]. https://blog.csdn.net/qq_33475105/article/details/108372878 . [15] 玩转NVIDIA Jetson (25)--- jetson 安装pytorch和torchvision[EB/OL]. https://blog.csdn.net/huiyuanliyan/article/details/126686036 . [16] Ubuntu 安装PHP找不着BZip2[EB/OL]. https://blog.csdn.net/handsome_926/article/details/77933926 . [17] ModuleNotFoundError: No module named '_lzma'[EB/OL]. https://zhuanlan.zhihu.com/p/404162713 . [18] Error :: You must put some 'source' URIs in your sources.list[EB/OL]. https://askubuntu.com/questions/496549/error-you-must-put-some-source-uris-in-your-sources-list . [19] 3d-pose-baseline[EB/OL]. https://github.com/una-dinosauria/3d-pose-baseline . [20] How to stop jetson_clocks?[EB/OL]. https://forums.developer.nvidia.com/t/how-to-stop-jetson-clocks/144713 . [21] VideoPose3d:环境搭建+制作自己的视频[EB/OL]. https://blog.csdn.net/willbetter01/article/details/120906567 . [22] 【VideoPose3D】可视化自定义视频[EB/OL]. https://blog.csdn.net/qq_44942539/article/details/121983187 . [23] Windows下VideoPose3D成功运行记录2:运行自定义视频[EB/OL]. https://www.bilibili.com/read/cv19291005/ . [24] How to Use OpenCV with CUDA Support in Python[EB/OL]. https://saturncloud.io/blog/how-to-use-opencv-with-cuda-support-in-python/ . [25] Jetson Xavier NX OpenCV 安装[EB/OL]. https://zhuanlan.zhihu.com/p/411901208 . [26] Nvidia Jetson TX2 配置Cuda 加速的Opencv[EB/OL]. https://blog.csdn.net/weixin_62651190/article/details/129348245 . [27] Guide to build OpenCV from source with GPU support (CUDA and cuDNN)[EB/OL]. https://gist.github.com/minhhieutruong0705/8f0ec70c400420e0007c15c98510f133 . [28] How to install OpenCV 4.2.0 with CUDA 10.0 in Ubuntu distro 18.04[EB/OL]. https://gist.github.com/changx03/b4aa9bb2827217c3a6a7e08365441417 . [29] Jetson带CUDA编译的opencv4.5安装教程与踩坑指南,cmake配置很重要![EB/OL]. https://blog.csdn.net/weixin_39298885/article/details/110851373 . [30] ubuntu 编译安装支持CUDA的OpenCV[EB/OL]. https://blog.csdn.net/qq_44523137/article/details/124098406 . [31] Install OpenCV on Jetson Nano[EB/OL]. https://qengineering.eu/install-opencv-on-jetson-nano.html . [32] 「解析」Jetson Orin NX 安装 CUDA/cuDNN[EB/OL]. https://blog.csdn.net/ViatorSun/article/details/129909317 . [33] OpenCV 4.2.0 and CuDNN for Jetson Nano?[EB/OL]. https://forums.developer.nvidia.com/t/opencv-4-2-0-and-cudnn-for-jetson-nano/112281?page=2 . [34] Correction in OpenCV's default CMAKE search Path for OpenBLAS Library on Ubuntu-64bit Machines [Solution] [EB/OL]. https://github.com/opencv/opencv/issues/12957 . [35] jetson tx2 安装miniconda失败原因(illegal instruction (core dump)原因解析与简易安装miniconda[EB/OL]. https://blog.csdn.net/buxiangyaomingzi/article/details/123297295 . [36] TypeError: load() missing 1 required positional argument: 'Loader' in Google Colab[EB/OL]. https://stackoverflow.com/questions/69564817/typeerror-load-missing-1-required-positional-argument-loader-in-google-col . [37] 解决创建conda环境时Solving environment: failed 和 ResolvePackageNotFound 的错误[EB/OL]. https://blog.csdn.net/hshudoudou/article/details/126407029 . [38] 关于Ubuntu下ZRAM的配置和使用[EB/OL]. https://blog.xzr.moe/archives/88/ . [39] 解决python urllib3 v2.0 only supports OpenSSL 1.1.1+, currently[EB/OL]. https://blog.csdn.net/weixin_43205308/article/details/130830307 . [40] 从TensorFlow被kill到增加Swap分区[EB/OL]. https://www.zhihu.com/column/p/30562899 . [41] Linux运行程序时,程序进程莫名退出(被杀死)[EB/OL]. https://blog.csdn.net/ispringmw/article/details/112719262 . [42] linux 环境下进程被 killed掉原因分析和解决方法[EB/OL]. https://blog.csdn.net/ktigerhero3/article/details/80004315 . [43] Linux OOM Killer机制 以及防止被OOM Killer杀死的方法[EB/OL]. https://blog.csdn.net/top_explore/article/details/107733974 .