bzdww

Get answers and suggestions for various questions from here

Zero basics python crawler (3)

cms

**third chapter**


**0x00 **** gossip less **

The link address of the first two episodes of the series:

Zero-based introduction to python crawler (1) - Know the column

Zero-based introduction to python crawlers (2) - Know the column

Rookie technology exchange small group: 317784952


Through the first two chapters, I believe that you have already caught up with me. The original rookie has already smashed me a few streets. You have already lowered my IQ by the big cow. The third chapter to do a practical thing is equivalent to a reptile. Let's grab a picture of Baidu's post! You can say that the right button is saved as well, I am not lazy! You said that it would be nice to save the entire page. How can you satisfy a clean cook? The big and small pictures are messed up, the unwanted ones are all, and the disgusting is broken. What do you want? Reptile!


**0x001**** Baidu Post Bar Download Image Method **

Idea: Find the common feature you want to download the picture, download these feature pictures and you won't have it!

Use regular expressions and use the standard module re.

First use the same two chapters to get the content of the webpage, and then limit the features.


# -*- confing:utf-8 -*-
import urllib
import re
 
def get_content(url):
"""doc."""
html = urllib.urlopen(url)
content = html.read()
html.close()
return content
def get_images(info):
"""doc.
<img class="BDE_Image"
src="http://imgsrc.baidu.com/forum/w%3D580/sign=69b173e21a950a7b75354ecc3ad3625c/4fff8c1001e939010486a5ea7cec54e737d19607.jpg" size="59454" width="421"
height="750" size="59454">
"""
regex = r'class="DBE_Image" src="(.+?\.jpg)"'
pat = re.compile(regex)
images_code = re.findall(pat, info)
i = 0
for image_url in images_code:
print image_url
urllib.urlretrieve(image_url, '%s.jpg' % i)
i += 1
print images_code
info = get_content("http://tieba.baidu.com/p/xxx")
print get_images(info)



The writing of regular expressions is the focus and difficulty of this section. Take a look at the diagram. Please take a closer look.



The photos of the sisters have been placed in a row in the folder.




**0x002****Baidu Post Bar Download Picture Method 2**


Next is the reptile of the reptile god, it is - beautiful soul chicken soup (BeautifulSoup)


Official document address:


[ Beautiful Soup Documentation ]( Beautiful Soup Documentation )


Installation method and initial experience:


[The use of Python crawler weapon 2 Beautiful Soup ] ( usage of Python crawler weapon 2 Beautiful Soup )


[ Installation of Beautiful Soup module in Linux and Windows environment ] ( Installation of Beautiful Soup module in Linux and Windows environment )


It is definitely a good tutorial link to pick up. I took this opportunity to simply install pip, easy_inall and 1xml. The installation method is simple, and there are too many articles on the Internet.


Installation process:





Execute the test in the above figure, without error, the installation is successful.


Then we write the crawler code:


Import from bs4 import BeautifulSoup


Note: If the version of BeautifulSoup is 3.x, the import method is: from BeautifulSoup import BeautifulSoup


# -*- confing:utf-8 -*-
import urllib
from bs4 import BeautifulSoup
# BeautifulSoup
def get_content(url):

"""doc."""

html = urllib.urlopen(url)
content = html.read()
html.close()
return content
def get_images(info):

"""doc.
<img class="BDE_Image"
src="http://imgsrc.baidu.com/forum/w%3D580/sign=69b173e21a950a7b75354ecc3ad3625c/4fff8c1001e939010486a5ea7cec54e737d19607.jpg" size="59454" width="421"
height="750" size="59454">
"""
soup = BeautifulSoup(info)
all_img = soup.find_all('img',class_="BDE_Image")
x = 1
for img in all_img:
print img
image_name = '%s.jpg' % x
urllib.urlretrieve(img['src'], image_name)
x += 1
print type(img)
info = get_content("http://tieba.baidu.com/p/xxx")
print get_images(info)



Diagram the meaning of some important codes:




Reference article:


[ BeautifulSoup installation and its application - Prefecter ] ( BeautifulSoup installation and its application - Prefecter )


It is definitely a good choice.



**0x003 ****You found that you changed a post**


Yes, you are not mistaken. If you visit Baidu too frequently for a certain page, you will not be happy. You can’t always climb the family. We find a post with the expression “Emperor” and climbed it with the second method to save time and deliberately choose A post with only two figures. No regular expressions are used, so it looks better and more professional. 1xml is also a crawler weapon, and interested students continue to delve into it.




**0x004 ****This chapter summary**


In this chapter, we exchanged two ways to grab Baidu Post Bar images and download them to the local order. The second type of loading third-party module BeautifulSoup is worthy of praise. In fact, if you are based on 0, you should be able to read some of the tutorial code in the blog. What are they doing? This is the beginning of your study, and the best way to learn. You don't know what to look for. I wish you all the best in learning and work.


** **


**postscript:**


As of today, the "0 basics python crawler" trilogy article has come to an end. I believe that if you have really read the article seriously, according to the various ideas, whether you are a novice or a veteran, there will be gains. Of course, the veteran’s gain is more to see unreasonable code and methods to find a more reasonable way. Is this not a harvest? Or see this rotten text more slots, your criticism is the best guide for me.


It is very close to the deadline for writing this. It is very busy. Why am I willing to spend a lot of time doing this? In addition to your irresistible likes, you are more moved. It is important to remember that people who are forced to do things are definitely not masters. Even if he is stronger than you, one day you will catch up with him. Everyone has their own business. Why do people are willing to collect and collect the collected materials every day, see an article for ten minutes, how long does it take to write an article? So be sure to help them get up.


The article will have a follow-up. Next time I want to write a small publicity of the interface MD5 decryption website, to achieve batch decryption of multiple websites, there are already such ready-made tools, and the principle is being analyzed. In fact, the reptile road has just begun.


At the end of the text, the bottom of my heart echoed with a sentence in the book of the sages: the wood of the embrace, born in the end; the platform of the nine floors, from the soil; the journey of a thousand miles begins with the next...




**Recommended reading:**


[ Python crawler combat (4): Grab Taobao MM photos - Python - Bole Online ] ( Python crawler combat (4): Grab Taobao MM photos - Python - Bole Online )


Home - Liao Xuefeng's official website


Usage of Python Reptile II


[ BeautifulSoup installation and its application - Prefecter ] ( BeautifulSoup installation and its application - Prefecter )


BeautifulSoup4 installation and use _ still scarecrow _ Sina blog




————