Alright, so you want a website that displays live stock quotes? Or maybe you want it to download and save financial reports off from the SEC (Security and Exchange Commission) database? In this tutorial, I show you how to do it in 10 minutes using PHP.
Alright, for the novices out there, I suggest you read my article on how the internet works before we begin and the first few pages of w3 school’s html tutorial.
Alright, lets begin. Scraping involves several steps:
- Downloading the contents of the other page
- Interpreting/Reading the other page
- (Possibly) Using information gained to go back to #1
For example, Google “spiders” through every page on the internet. How does it do this? Well, it follows the steps above and, in number 2, it searches for “links” within each page and then uses those links to loop back to number one. Simple, right?
Alright, so lets pick a site that we want to scrape. How about we scrape tutorials off from w3schools. Below is the contents of http://www.w3schools.com/ in plain html code. To view this, go to http://www.w3schools.com/ and then click view->source in your browser. (The verbiage is a little different in each browser, but same basic idea). I have chosen to not only show the html code that interests us by searching for the specific html code that we want to read first.
Alright, so upon basic analysis, what do we find? Well, this portion of code STARTS with “<td id=”leftcolumn” width=”150″ valign=”top” align=”left” style=”padding:4px;border:none”>” and ends with “<table border=”0″ width=”100%” cellpadding=”0″ cellspacing=”0″>“. So, if we wanted to “grab” this section of html code from the whole of the site, then we’d be all set.
Alright, so lets go back to our steps again.
- Downloading the contents of the other page
- Interpreting/Reading the other page
Downloading the contents of the other page in PHP is easy. We can just use the file_get_contents function like this:
$pagecontents = file_get_contents("http://www.w3schools.com/");
The above code assigns the php variable $pagecontents the contents of w3schools.com (in html, of course).
Now, we need to “grab” the html code of interest. To do this, we need to write a function that can search for the “start”, the “end” and then grab what is in between. Here is a php function that does just that:
function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen($start)));
}
So, to wrap it all up, we’ve learned how to download a list of the tutorials on w3schools.com. This is hardly a complete project, as you could continue to loop through each “individual” link by searching for things between “<a” and “</a>”, but this is a great start! The complete code is below.
<?php
function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen
($start)));
}
$pagecontents = file_get_contents("http://w3schools.com/");
$part = getBetween($pagecontents, '<td id="leftcolumn" width="150" valign="top" align="left" style="padding:4px;border:none">', '<table border="0" width="100%" cellpadding="0" cellspacing="0">');
print($part);
?>
– Jacob Beasley