Bạn đang xem: Nggiahao/crawler, how to create a web crawler using php
Bạn vẫn xem: How lớn create a website crawler using php
Web Scraping (also termed Screen Scraping, web Data Extraction, website Harvesting etc.) is a technique employed to lớn extract large amounts of data from websites whereby the data is extracted and saved khổng lồ a local file in your computer or to lớn a database in table (spreadsheet) format.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide website directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers lớn automated processes implemented using a bot or website crawler. It is a khung of copying, in which specific data is gathered và copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Xem thêm: Cách Làm Món Bò Nướng Lá Lốt Ngon Đậm Đà Hương Vị, Cách Làm Bò Nướng Lá Lốt
crawling with simple HTML DOM parserPHP Simple HTML DOM Parser adaptation for Composer and PSR-0. For using this way, you should use this package:
Install with the composer:
composer install sunra/php-simple-html-dom-parserUsage:
use SunraPhpSimpleHtmlDomParser;$url = "https://obatambeienwasirherbal.com.com";// get html from URL$dom = HtmlDomParser::str_get_html($url);// get all images on obatambeienwasirherbal.com.com$imagesElements = $dom->find('img');foreach ($imagesElements as $imagesElement) $images = $imagesElement->getAttribute('src');// get meta og:image on obatambeienwasirherbal.com.com$metaElements = $dom->find("metaproperty='og:image'>");foreach ($metaElements as $metaElement) $images = $metaElement->getAttribute('href');For more information about simple HTML DOM parser, read this documentation:
https://simplehtmldom.sourceforge.io/manual.htm
Crawling with GoutteGoutte is a screen scraping and web crawling library for PHP.
Goutte provides a nice API khổng lồ crawl websites và extract data from the HTML/XML responses.
Install with the composer:
composer require fabpot/goutteUsage:
use GoutteClient;$client = new Client();// get html from URL$crawler = $client->request('GET', 'https://www.laravel.com/');// get all images on laravel.com$crawler->filter('img')->each(function ($node) print $node->attr('src')." ";);For more information about Goutte, read this documentation:
https://goutte.readthedocs.io/en/latest/
Compare Goutte và simple HTML DOM parserYou can see the difference between the two ways & choose a better one.