Wget递归爬取整个网站的内容

GNU Wget is a computer program that retrieves content from web servers. It is part of the GNU Project. Its name derives from World Wide Web and get. It supports downloading via HTTP, HTTPS, and FTP

— Wikipedia

Wget很强大,对于递归爬取网站内容,用它还是不错的,总比你再去自己写一个爬取工具来得快。

下面是一个使用Wget爬取内容的Bash程序,该程序参考了:http://www.linuxjournal.com/content/downloading-entire-web-site-wget

#!/bin/bash

###
# Get website contents recursively
#
# @author YanWen <i@yanwen.email>
# @modified 2017-11-30
# @references http://www.linuxjournal.com/content/downloading-entire-web-site-wget
###

###
# wget web site (only support one domain)
#
# @param string href
# @param string domain
# @return none
###
function wget_get() {
  # Test the wget command
  if command -v wget > /dev/null 2>&1; then
    if [ "$#" -ne 2 ]; then
      echo 'Params is not enough or too much'
      return
    else
      href=$1; domain=$2

      echo "wget $href in domains $domain..."
      wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains $domain --no-parent $href

        #--recursive
        #--no-clobber       # NOTE: Don't overwrite any existing files
        #--page-requisites  # NOTE: Get all the elements that compose the page
        #--html-extension
        #--convert-links    # NOTE: Convert links so that they work locally
        #--restrict-file-names=windows  # NOTE: Modify filenames so that they will work in Windows as well
        #--domains $domain  # NOTE: Don't follow links outside the domains
        #--no-parent $href  # NOTE: Don't follow links outside the directory
      echo "Done"
    fi

  else
    echo 'Error: No wget'
  fi
}

# Run
if [ "$#" -ne 2 ]; then
  echo 'Usage: ./wget_get [href] [domain]'
  exit
else
  wget_get $1 $2
fi

当然,这个简单程序只能递归爬取单个指定域名下的内容,若要限定多个域名,还需要修改。

它的强大之处在于,对于CSS、JS等资源也可以橹下来。

参考:

  1. linuxjournal.com – downloading-entire-web-site-wget
  2. Wikipedia – Wget

作者: V

Web Dev

發表迴響

在下方填入你的資料或按右方圖示以社群網站登入:

WordPress.com 標誌

您的留言將使用 WordPress.com 帳號。 登出 /  變更 )

Google photo

您的留言將使用 Google 帳號。 登出 /  變更 )

Twitter picture

您的留言將使用 Twitter 帳號。 登出 /  變更 )

Facebook照片

您的留言將使用 Facebook 帳號。 登出 /  變更 )

連結到 %s