写在前面 本文主要从技术角度讨论了Python批量下载的可能,并没有任何批量下载文献的用意。如有用户使用该方法对文献进行批量下载,后果自负。有关文献超量下载的后果,可参考:知乎 - 如何评价哈工大留学生违规超量下载一事?
Python源代码
运行环境
Windows 10, 64bit
python 3.6
其他说明
已通过腾讯管家将C盘Desktop搬到F盘,因此“桌面”目录为”F:/personal/desktop/“。
提前在桌面建立了文件夹pdf,该文件夹地址为”F:/personal/desktop/pdf/“。
鉴于以上原因本程序仅供参考,不能在所有环境下适用,请读者按需进行修改。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 import urllib.request as rqimport sysimport redef getHtml (url ): with rq.urlopen(url) as page: html = page.read().decode('utf-8' ) print ("Html content achieved." ) return html def getPdf (html ): reg = r'nmat[0-9]{4}.pdf' pdfre = re.compile (reg) list_pdf = re.findall(pdfre,html) print ("Pdfs' namelist achieved." ) return list_pdf def urllist (url, list_pdf ): list_html = list_pdf for i in range (0 ,len (list_pdf)): list_html[i] = 'http://www.nature.com/nmat/journal/' + url[35 :42 ] + 'pdf/' + "" .join(list_pdf[i]) print ("Urllist achieved." ) return list_html def downloadPdf (list_html ): num = 1 url = list_html[1 ] path_toc = "F:\\personal\\desktop\\pdf\\toc.pdf" print (path_toc) link_toc = 'http://www.nature.com/nmat/journal/' + url[35 :42 ] + 'pdf/' + 'toc.pdf' rq.urlretrieve(link_toc, path_toc) print (link_toc + " completed!" ) path_masthead = "F:\\personal\\desktop\\pdf\\masthead.pdf" print (path_masthead) link_masthead = 'http://www.nature.com/nmat/journal/' + url[35 :42 ] + 'pdf/' + 'masthead.pdf' rq.urlretrieve(link_toc, path_toc) print (link_masthead + " completed!" ) for pdf_link in list_html: path = "F:\\personal\\desktop\\pdf\\" + str (num) + ".pdf" print (path) rq.urlretrieve(pdf_link, path) print (pdf_link + " completed!" ) num = num + 1 def nmatmain (): args = sys.argv if len (args) == 1 : print ("Format: python NmatDownload.py [url]" ) print ("Example: python NmatDownload.py http://www.nature.com/nmat/journal/v16/n4/index.html" ) return elif len (args) == 2 : url = args[1 ] else : print ("Too many arguments!" ) return html = getHtml(url) list_pdf = getPdf(html) list_html = urllist(url, list_pdf) downloadPdf(list_html) if __name__=='__main__' : nmatmain()
代码解析
建立 (def)一个函数,输入url,输出 (return)对应html文本:
1 2 3 4 5 6 import urllib.request as rqdef getHtml (url ): with rq.urlopen(url) as page: html = page.read().decode('utf-8' ) print ("Html content achieved." ) return html
关键步骤,Python 2写法:
1 2 3 page = urllib.request.urlopen(url) htmlencode = page.read() html =htmlcode.decode('urf-8' )
输入html文本,通过正则表达式提取指定字符串,最后输出所需list数据:
1 2 3 4 5 6 7 import redef getPdf (html ): reg = r'nmat[0-9]{4}.pdf' pdfre = re.compile (reg) list_pdf = re.findall(pdfre,html) print ("Pdfs' namelist achieved." ) return list_pdf
通过list数据进行字符串运算,输出pdf文件下载地址,并再次组成list数据:
1 2 3 4 5 6 def urllist (url, list_pdf ): list_html = list_pdf for i in range (0 ,len (list_pdf)): list_html[i] = 'http://www.nature.com/nmat/journal/' + url[35 :42 ] + 'pdf/' + "" .join(list_pdf[i]) print ("Urllist achieved." ) return list_html
下载关键步骤——通过list和urllib.urlretrieve(link, download_path)函数进行迭代下载,
1 2 3 def downloadPdf (list_html ): rq.urlretrieve(pdf_link, path)
实现了可以在命令行界面输入参数的需求:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 def nmatmain (): args = sys.argv if len (args) == 1 : print ("Format: python NmatDownload.py [url]" ) print ("Example: python NmatDownload.py http://www.nature.com/nmat/journal/v16/n4/index.html" ) return elif len (args) == 2 : url = args[1 ] else : print ("Too many arguments!" ) return html = getHtml(url) list_pdf = getPdf(html) list_html = urllist(url, list_pdf) downloadPdf(list_html) if __name__=='__main__' : nmatmain()
例如:
1 2 3 F:\nmat>python NmatDownload.py Format: python NmatDownload.py [url] Example: python NmatDownload.py http://www.nature.com/nmat/journal/v16/n4/index.html
笔者水平有限,如文中有错漏,恳请各位读者指出。 Email: mozheyang@outlook.com
Appendix Hexo代码高亮 - Codeblock
v1.4.16