如何配置Nginx使其只对特定的URL进行爬虫分流?
|
admin
2026年7月2日 12:9
本文热度 103
|
下面给出两套可直接复制的 Nginx 配置:
精确匹配指定 URL / 指定路由前缀才走预渲染爬虫分流;
其余页面、静态资源依旧正常返回前端静态 index.html。
核心思路
完整可复制 Nginx Server 配置
server {
listen 80;
server_name yourdomain.com www.yourdomain.com;
root /data/www/your-dist;
index index.html;
set $is_crawler 0;
if ($http_user_agent ~* "Baiduspider|Googlebot|360Spider|SogouSpider|Bytespider|bingbot|YandexBot|DuckDuckBot|spider|bot|crawler") {
set $is_crawler 1;
}
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|map)$ {
expires 7d;
try_files $uri =404;
}
location ~ /(robots.txt|sitemap.xml)$ {
try_files $uri =404;
}
location ^~ /list/ {
if ($is_crawler = 1) {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 10s;
proxy_read_timeout 10s;
break;
}
try_files $uri $uri/ /index.html;
}
location ^~ /article/ {
if ($is_crawler = 1) {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 10s;
proxy_read_timeout 10s;
break;
}
try_files $uri $uri/ /index.html;
}
location = /about {
if ($is_crawler = 1) {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 10s;
proxy_read_timeout 10s;
break;
}
try_files $uri $uri/ /index.html;
}
location / {
try_files $uri $uri/ /index.html;
}
location = /index.html {
expires -1;
}
}
匹配规则说明(按需修改路径)
1. 前缀匹配 ^~ /xxx/
location ^~ /product/ { ... }
2. 精确匹配 location = /page
3. 正则匹配(多组零散路径)
location ~ ^/(news|goods|case)/ {
if ($is_crawler = 1) {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 10s;
proxy_read_timeout 10s;
break;
}
try_files $uri $uri/ /index.html;
}
上面会匹配 /news/*、/goods/*、/case/* 全部路由。
HTTPS 443 站点使用方式
直接把整套 set $is_crawler + 各个 location 分流规则复制到 listen 443 ssl 的 server 块内即可。
测试是否生效
1. 匹配分流的页面(爬虫 UA)
curl -A "Baiduspider" http://yourdomain.com/article/123
2. 不参与分流的页面(爬虫 UA)
curl -A "Baiduspider" http://yourdomain.com/user
只返回原始 SPA 空 HTML,不会转发到 prerender 服务
3. 普通用户任意页面
curl -A "Mozilla/5.0" http://yourdomain.com/article/123
优化精简:抽离代理公共配置(减少重复代码)
在 http{} 块添加公共代理片段,所有分流 location 直接引用:
http {
proxy_prerender @prerender;
location @prerender {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 10s;
proxy_read_timeout 10s;
}
}
location ^~ /article/ {
if ($is_crawler = 1) {
return @prerender;
}
try_files $uri $uri/ /index.html;
}
该文章在 2026/7/2 12:09:07 编辑过