First web crawler
- Development
  - Install python
- First crawler program

First web crawler

Development

Install python

First crawler program

Install dependent environment

1	pip install requests

First crawler code

request

import requests

response = requests.get("http://books.toscrape.com/")

if response.ok:
    print(response.text)
else:
    print("请求失败")

2.head

获取豆瓣网页信息

1.send requests

import requests

response = requests.get("https://movie.douban.com/top250")

print(response.status_code)

2.伪装浏览器请求

import requests
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
    }

response = requests.get("https://movie.douban.com/top250",headers=headers)

print(response.status_code)

3.打印html源码

import requests
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
    }

response = requests.get("https://movie.douban.com/top250",headers=headers)

print(response.text)

4.安装第三方库bs4

1	pip install bs4

5.print title tag text

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
    }
response = requests.get("https://movie.douban.com/top250",headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
all_titles = soup.findAll("span", attrs={"class": "title"})
for title in all_titles:
    print(title)

6 print string

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
    }
response = requests.get("https://movie.douban.com/top250",headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
all_titles = soup.findAll("span", attrs={"class": "title"})
for title in all_titles:
    print(title.string)

7.只打印不带斜杠的文本

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
    }
response = requests.get("https://movie.douban.com/top250",headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
all_titles = soup.findAll("span", attrs={"class": "title"})
for title in all_titles:
    title_string = title.string
    if "/" not in title_string:
        print(title.string)

8.打印所有页面

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
    }
for start_num in range(0, 250, 25):

    response = requests.get(f"https://movie.douban.com/top250?start={start_num}",headers=headers)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    all_titles = soup.findAll("span", attrs={"class": "title"})
    for title in all_titles:
        title_string = title.string
        if "/" not in title_string:
            print(title.string)