使用selenium和phantomjs爬取动态网页
1 2 3 4 5 6 7
| from selenium import webdriver url = "http://www.dangniao.com/mh/22996/392953.html" browser = webdriver.PhantomJS("/usr/local/mysoft/phantomjs-2.1.1/bin/phantomjs") browser.get(url) browser.find_element_by_class_name("zsxiaye").click() src = browser.find_element_by_css_selector("#wdwailian img").get_attribute("src")
|
常用的phantomJS配置选项
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = (random.choice(USER_AGENTS)) dcap["phantomjs.page.settings.loadImages"] = False service_args = ['--proxy=127.0.0.1:9999','--proxy-type=socks5'] driver = webdriver.PhantomJS(phantomjs_driver_path, desired_capabilities=dcap,service_args=service_args) driver.implicitly_wait(5) driver.set_page_load_timeout(10) driver.set_script_timeout(10)
|
phantomJS的并发问题
phantomJS本身在多线程方面还有很多bug,建议使用多进程,关于多进程,推荐使用multiprocessing库,简洁、高效.
1 2 3 4 5
| from multiprocessing import Pool pool = Pool(8) data_list = pool.map(get, url_list) pool.close() pool.join()
|
phantomJS进程不自动退出问题
主程序退出后,selenium不保证phantomJS也成功退出,最好手动关闭phantomJS进程。
1 2 3 4 5 6 7 8
| try: self.driver.get(url) self.wait_() return True except Exception as e: self.driver.quit() return False
|