【功能模块】
【操作步骤&问题现象】
1、在本地可以跑通的代码在model_art 无法跑通
2、错误代码 try to send request before open()
【截图信息】
【日志信息】(可选,上传日志内容或者附件)
- do nothing
- [modelarts service log]user: uid=1101(work) gid=1101(work) groups=1101(work),1000(hwhiaiuser)
- [modelarts service log]pwd: /home/work
- [modelarts service log]app_url: s3://sunzengguo-train/zlf/nst/
- [modelarts service log]boot_file: nst/train.py
- [modelarts service log]log_url: /tmp/log/zlf_nst.log
- [modelarts service log]command: nst/train.py --data_url=s3://sunzengguo-train/zlf/nst/img_data/ --train_url=s3://sunzengguo-train/zlf/nst/output/
- [modelarts service log]local_code_dir:
- [modelarts service log]training start at 2022-06-02-17:17:47
- [modelarts service log][modelarts_create_log] modelarts-pipe found
- [modelarts service log]modelarts-pipe: will create log file /tmp/log/zlf_nst.log
- [modelarts service log]handle inputs of training job
- info:root:using moxing-v2.0.0.rc2.4b57a67b-4b57a67b
- info:root:using obs-python-sdk-3.20.9.1
- [modelarts service log][info][2022/06/02 17:17:48]: env ma_inputs is not found, skip the inputs handler
- info:root:using moxing-v2.0.0.rc2.4b57a67b-4b57a67b
- info:root:using obs-python-sdk-3.20.9.1
- [modelarts service log]2022-06-02 17:17:49,416 - modelarts-downloader.py[line:623] - info: main: modelarts-downloader starting with namespace(dst='./', recursive=true, skip_creating_dir=false, src='s3://sunzengguo-train/zlf/nst/', trace=false, type='common', verbose=false)
- [modelarts service log][modelarts_logger] modelarts-pipe found
- [modelarts service log]modelarts-pipe: will create log file /tmp/log/zlf_nst.log
- /home/work/user-job-dir
- [modelarts service log]modelarts-pipe: will write log file /tmp/log/zlf_nst.log
- [modelarts service log]modelarts-pipe: param for max log length: 1073741824
- [modelarts service log]modelarts-pipe: param for whether exit on overflow: 0
- [modelarts service log]modelarts-pipe: total length: 24
- [modelarts service log][modelarts_logger] modelarts-pipe found
- [modelarts service log]modelarts-pipe: will create log file /tmp/log/zlf_nst.log
- [modelarts service log]modelarts-pipe: will write log file /tmp/log/zlf_nst.log
- [modelarts service log]modelarts-pipe: param for max log length: 1073741824
- [modelarts service log]modelarts-pipe: param for whether exit on overflow: 0
- info:root:using moxing-v2.0.0.rc2.4b57a67b-4b57a67b
- info:root:using obs-python-sdk-3.20.9.1
- [modelarts service log]2022-06-02 17:17:57,381 - info - background upload stdout log to s3://sunzengguo-train/zlf/logs/job386a315f-job-zlf-nst-0.log
- [modelarts service log]2022-06-02 17:17:57,390 - info - ascend driver: version=21.0.2.spc001
- [modelarts service log]2022-06-02 17:17:57,391 - info - you are advised to use ascend_device_id env instead of device_id, as the device_id env will be discarded in later versions
- [modelarts service log]2022-06-02 17:17:57,391 - info - particularly, ${ascend_device_id} == ${device_id}, it's the logical device id
- [modelarts service log]2022-06-02 17:17:57,391 - info - davinci training command
- [modelarts service log]2022-06-02 17:17:57,391 - info - ['/usr/bin/python', '/home/work/user-job-dir/nst/train.py', '--data_url=s3://sunzengguo-train/zlf/nst/img_data/', '--train_url=s3://sunzengguo-train/zlf/nst/output/']
- [modelarts service log]2022-06-02 17:17:57,391 - info - wait for rank table file ready
- [modelarts service log]2022-06-02 17:17:57,391 - info - rank table file (k8s generated) is ready for read
- [modelarts service log]2022-06-02 17:17:57,392 - info -
- {
- "status": "completed",
- "group_count": "1",
- "group_list": [
- {
- "group_name": "job-zlf-nst",
- "device_count": "1",
- "instance_count": "1",
资源池位置是:西北雁塔超算 jobid: job386a315f
本地是什么昇腾硬件机器?是跑训练吧?代码再多给一点看看呢?看上去好像是初始化没弄好?
感谢分享
本地在cpu上面跑的,是可以正常跑通的,我把训练部分代码打成txt上传了
13 小时前 上传
点击文件名下载附件
请问使用的是哪个版本?我记得1.3版本有个关于num_parallel_worker设置的bug,在后面的版本中修复了,不确定是否和你碰到的问题有关;如果可以的话,尝试一下1.5或者1.6版本。
1.3和1.6的都试过,但是存在相同的问题