I needed a simple html only scraper. (This doesn’t use js, won’t pull down data via AJAX). I found an example on another site, thetaranights.com, but it wasn’t exactly what I needed. It only pulled the data and printed it to screen. I added a list to loop through and auto saving by url name to a html file.
import mechanize #pip install mechanize br = mechanize.Browser() br.set_handle_robots(False) br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:22.214.171.124) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")] sign_in = br.open("https://this.example.com/login") #the login url br.select_form(nr = 0) #accessing form by their index. Since we have only one form in this example, nr =0. #br.select_form(name = "form name") Alternatively you may use this instead of the above line if your form has name attribute available. br["email"] = "email or username" #the key "username" is the variable that takes the username/email value br["password"] = "password" #the key "password" is the variable that takes the password value logged_in = br.submit() #submitting the login credentials logincheck = logged_in.read() #reading the page body that is redirected after successful login urls = ["https://this.example.com/some/page","https://this.example.com/some/page2"] for url in urls: req = br.open(url).read() filename = url.split('/')[-1] + ".html" f = open(filename, 'w') f.write(req) f.close()
Which produces 2 files: