First web crawler

Development

Install python

First crawler program

Install dependent environment

1
pip install requests

First crawler code

  1. request
1
2
3
4
5
6
7
8
import requests

response = requests.get("http://books.toscrape.com/")

if response.ok:
print(response.text)
else:
print("请求失败")

2.head

获取豆瓣网页信息

1.send requests

1
2
3
4
5
import requests

response = requests.get("https://movie.douban.com/top250")

print(response.status_code)

2.伪装浏览器请求

1
2
3
4
5
6
7
8
import requests
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}

response = requests.get("https://movie.douban.com/top250",headers=headers)

print(response.status_code)

3.打印html源码

1
2
3
4
5
6
7
8
import requests
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}

response = requests.get("https://movie.douban.com/top250",headers=headers)

print(response.text)

4.安装第三方库bs4

1
pip install bs4

5.print title tag text

1
2
3
4
5
6
7
8
9
10
11
12
import requests
from bs4 import BeautifulSoup

headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}
response = requests.get("https://movie.douban.com/top250",headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
all_titles = soup.findAll("span", attrs={"class": "title"})
for title in all_titles:
print(title)

6 print string

1
2
3
4
5
6
7
8
9
10
11
12
import requests
from bs4 import BeautifulSoup

headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}
response = requests.get("https://movie.douban.com/top250",headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
all_titles = soup.findAll("span", attrs={"class": "title"})
for title in all_titles:
print(title.string)

7.只打印不带斜杠的文本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import requests
from bs4 import BeautifulSoup

headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}
response = requests.get("https://movie.douban.com/top250",headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
all_titles = soup.findAll("span", attrs={"class": "title"})
for title in all_titles:
title_string = title.string
if "/" not in title_string:
print(title.string)

8.打印所有页面

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import requests
from bs4 import BeautifulSoup

headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}
for start_num in range(0, 250, 25):

response = requests.get(f"https://movie.douban.com/top250?start={start_num}",headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
all_titles = soup.findAll("span", attrs={"class": "title"})
for title in all_titles:
title_string = title.string
if "/" not in title_string:
print(title.string)