PHP Website Scanners And MySQL Database Reconstruction

There is often a need for custom PHP Programs to scan external websites to retrieve content and reconstruct a MySQL Database. The content types and formats are unlimited, but the major goal is the same: to collect all data and objects into a local format for reuse and distribution. Most commonly scanned are retail websites or manufacturers for retail products.

ABC Company wants to resell widgets Online and has the infrastructure for a PHP eCommerce Web Site and has a MySQL database for data handling. Widgets are manufactured by Widgets-R-Us, who are old school. Since Widgets-R-Us is old school, the prioritize brick-and-mortar sales over the Internet and try to remain in the Dark Ages. Therefore, they distribute their whole product list mainly in print. Their B2B website allows retailers to search for products and place order, obviously powered by a database. Yet, when you call Widgets-R-Us, they make ignorant claims that there is no electronic list anywhere and you must work from their printed catalog.

Widget-R-Us is trying to prevent the easy distribution of Widgets Online, and therefore lie to their sales staff as well as their retail community. They are shooting themselves in the foot, but hey, that’s their own fault. Lt’s build a scanner to cruise their entire inventory Online and retrieve all the JPEG photos, Flash objects, and any other support artwork we need to reconstruct their entire database drive nB2B site. We can do the same with their front end for the public if available.

The scanner is a custom build devoted to the structure of the site we’re scanning. The B2B site uses a repetitive query structure that moves from record to record in their MySQL database. (We don’t really know if they have a MySQL database, but we like to think they have at least that much sense.) We build the PHP Program to cycle each record in their database by starting at the low end of the record count.

The query to retrieve the database records is either sent via GET or POST requests and we must discover that query structure and iteration. A simple GET query may have th form on http://www.widgetsrus.com/product.php?id=345 and we need only cycle the id value from 1 through the top end of the discovered total count. We can manually change the id value in the GET query string until we know there are no more records, and define the top end id value, let’s say 1000.

If the B2B system uses POST requests, we can discover the structure by viewing the source code, or by placing a sniffer on our computer and capturing a POST submit. The first best thing to try is to see if the POST data can be sent as GET data, to simplify the PHP Program. We’ve often found that we can convert POST requests into GET requests with the same results.

GET request scanners are a bit simpler than POST request scanners. The POST request scanner requires slightly more complex code to handle the POST, where the GET request scanner simply structures the GET request in a normal query string. And this is where our scanner starts collecting data. We create a iterative loop starting at the first id value of 1 and ending at the max id value, which is 1000 in this sample. On each iteration, we perform a standard web page collection that downloads the source code for the results for the id value we submitted in this iteration.

Older sites may no longer have an id value of 1 as that product may no longer exist, and they ignored good SEO practice. The first source code gives us the details of the first product. It is most likely that the source code is a template that is assembled with MySQL Database data for display, and contains our product name, price, description, etc. We now use Regex filtration to parse out the chunks of data we need to collect. Each parse will return a data type, which is collected into variables in our PHP Program. As we finish parsing the source code and have collected our Name, Retail, Description, JPEG_Path, etc., we an start the database reconstruction.

The MySQL record creation can be done before or after the page objects are retrieved. There may be JPEG Photos, GIF Images, Flash Media, MP3s, and other object types that need to be collected, tagged and stored. Each must be parsed from the source code and the PHP Program designed to download the objects to specified directories. Whether they are collected into same or separate directories, and of what nomenclature, is up to the PHP Programmer. An advantage of inserting the new record into the database after collecting media objects is the verify that the object had been collected. The object may be renamed to some new naming structure, or left as its original name. Either way, the database should be tagged so our own product details page knows whether to try and display the object.

The same process is reiterated through every id value in the B2B website, or whatever the target site is. The MySQL database record creation on our end is designed around our own requirements for data searching and details display. Make sure the permissions for the object directories is configured correctly before the scanner does all its iterations, else when it finishes, you may have no objects to use. That would cause you to rerun your scanner again.

The PHP Web Scanner can be additionally modified to rescan the same site looking for discontinued products, new products, and price changes. The PHP Program should create a buffer database, or a buffer flat file, or do a comparison check for each record and toggle records as the same, updated, absent or new. The toggle value can be used in the PHP website to discern between archival records for SEO purposes, new items to display in important locations, and other valid records.

PHP Programs can be developed to perform many web scanning functions. Good planning before PHP Programming is important as is a well-planned MySQL Database. If you want PHP Web Scanners for your business of eCommerce Web Site Design, contact The PHP Kemist for Web and Database Programming

Leave a Reply