How To Scrape Data Using Standard PHP

Alright, so you want a website that displays live stock quotes? Or maybe you want it to download and save financial reports off from the SEC (Security and Exchange Commission) database? In this tutorial, I show you how to do it in 10 minutes using PHP.

Alright, for the novices out there, I suggest you read my article on how the internet works before we begin and the first few pages of w3 school’s html tutorial.

Alright, lets begin. Scraping involves several steps:

  1. Downloading the contents of the other page
  2. Interpreting/Reading the other page
  3. (Possibly) Using information gained to go back to #1

For example, Google “spiders” through every page on the internet. How does it do this? Well, it follows the steps above and, in number 2, it searches for “links” within each page and then uses those links to loop back to number one. Simple, right?

Alright, so lets pick a site that we want to scrape. How about we scrape tutorials off from w3schools. Below is the contents of http://www.w3schools.com/ in plain html code. To view this, go to http://www.w3schools.com/ and then click view->source in your browser. (The verbiage is a little different in each browser, but same basic idea). I have chosen to not only show the html code that interests us by searching for the specific html code that we want to read first.

<tr>
<td id="leftcolumn" width="150" valign="top" align="left" style="padding:4px;border:none">
<h2><span>HTML</span> Tutorials</h2>
<a href="html/default.asp" target="_top">Learn HTML</a><br />
<a href="xhtml/default.asp" target="_top">Learn XHTML</a><br />
<a href="css/default.asp" target="_top">Learn CSS</a><br />
<a href="tcpip/default.asp" target="_top">Learn TCP/IP</a><br />
<br />
<h2><span>Browser</span> Scripting</h2>
<a href="js/default.asp" target="_top">Learn JavaScript</a><br />
<a href="htmldom/default.asp" target="_top">Learn HTML DOM</a><br />
<a href="dhtml/default.asp" target="_top">Learn DHTML</a><br />
<a href="vbscript/default.asp" target="_top">Learn VBScript</a><br />
<a href="ajax/default.asp" target="_top">Learn AJAX</a><br />
<a href="jquery/default.asp" target="_top">Learn jQuery</a><br />
<a href="e4x/default.asp" target="_top">Learn E4X</a><br />
<br />
<h2><span>XML</span> Tutorials</h2>
<a href="xml/default.asp" target="_top">Learn XML</a><br />
<a href="dtd/default.asp" target="_top">Learn DTD</a><br />
<a href="dom/default.asp" target="_top">Learn XML DOM</a><br />
<a href="xsl/default.asp" target="_top">Learn XSLT</a><br />
<a href="xslfo/default.asp" target="_top">Learn XSL-FO</a><br />
<a href="xpath/default.asp" target="_top">Learn XPath</a><br />
<a href="xquery/default.asp" target="_top">Learn XQuery</a><br />
<a href="xlink/default.asp" target="_top">Learn XLink</a><br />
<a href="xlink/default.asp" target="_top">Learn XPointer</a><br />
<a href="schema/default.asp" target="_top">Learn Schema</a><br />
<a href="xforms/default.asp" target="_top">Learn XForms</a><br />
<br />
<h2><span>Server</span> Scripting</h2>
<a href="sql/default.asp" target="_top">Learn SQL</a><br />
<a href="asp/default.asp" target="_top">Learn ASP</a><br />
<a href="ado/default.asp" target="_top">Learn ADO</a><br />
<a href="php/default.asp" target="_top">Learn PHP</a><br />
<a href="aspnet/default.asp" target="_top">Learn ASP.NET</a><br />
<a href="dotnetmobile/default.asp" target="_top">Learn .NET Mobile</a><br />
<br />
<h2><span>Web</span> Services</h2>
<a href="webservices/default.asp" target="_top">Learn Web Services</a><br />
<a href="wsdl/default.asp" target="_top">Learn WSDL</a><br />
<a href="soap/default.asp" target="_top">Learn SOAP</a><br />
<a href="rss/default.asp" target="_top">Learn RSS</a><br />
<a href="rdf/default.asp" target="_top">Learn RDF</a><br />
<a href="wap/default.asp" target="_top">Learn WAP</a><br />
<a href="wmlscript/default.asp" target="_top">Learn WMLScript</a><br />
<br />
<h2><span>Multimedia</span></h2>
<a href="media/default.asp" target="_top">Learn Media</a><br />
<a href="smil/default.asp" target="_top">Learn SMIL</a><br />
<a href="svg/default.asp" target="_top">Learn SVG</a><br />
<a href="flash/default.asp" target="_top">Learn Flash</a><br />
<br />
<h2><span>Web</span> Building</h2>
<a href="site/default.asp" target="_top">Web Building</a><br />
<a href="browsers/default.asp" target="_top">Web Browsers</a><br />
<a href="cert/default.asp" target="_top">Web Certification</a><br />
<a href="hosting/default.asp" target="_top">Web Hosting</a><br />
<a href="w3c/default.asp" target="_top">Web W3C</a><br />
<a href="quality/default.asp" target="_top">Web Quality</a><br />
<a href="semweb/default.asp" target="_top">Web Semantic</a><br />
<br />
</td>
<td valign="top" align="left">
<table border="0" width="100%" cellpadding="0" cellspacing="0">
<tr>

Alright, so upon basic analysis, what do we find? Well, this portion of code STARTS with “<td id=”leftcolumn” width=”150″ valign=”top” align=”left” style=”padding:4px;border:none”>” and ends with “<table border=”0″ width=”100%” cellpadding=”0″ cellspacing=”0″>“. So, if we wanted to “grab” this section of html code from the whole of the site, then we’d be all set.

Alright, so lets go back to our steps again.

  1. Downloading the contents of the other page
  2. Interpreting/Reading the other page

Downloading the contents of the other page in PHP is easy. We can just use the file_get_contents function like this:

$pagecontents = file_get_contents("http://www.w3schools.com/");

The above code assigns the php variable $pagecontents the contents of w3schools.com (in html, of course).

Now, we need to “grab” the html code of interest. To do this, we need to write a function that can search for the “start”, the “end” and then grab what is in between. Here is a php function that does just that:

function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen($start)));
}

So, to wrap it all up, we’ve learned how to download a list of the tutorials on w3schools.com. This is hardly a complete project, as you could continue to loop through each “individual” link by searching for things between “<a” and “</a>”, but this is a great start! The complete code is below.

<?php

function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen
($start)));
}

$pagecontents = file_get_contents("http://w3schools.com/");
$part = getBetween($pagecontents, '<td id="leftcolumn" width="150" valign="top" align="left" style="padding:4px;border:none">', '<table border="0" width="100%" cellpadding="0" cellspacing="0">');

print($part);

?>

– Jacob Beasley

  • Share/Bookmark



7 Comments

This code will return the first string between the strings that you specify as the start and end tags. Pretty cool! I haven’t been getting around to web programming lately, but this article got me a bit interested.

What if we want to find all things that fall between these tags inside of $pagecontents? This would be useful if we wanted to find all things between more generic tags like (“<b","</b>”) or (“<table","”).

One algorithm that comes to mind:
1. Find tags in $pagecontents (if any), otherwise 5.
2. Tags found: Add targeted string to an array of strings.
3. Remove the beginning of $pagecontents up until the end of the end tag.
4. Go back to 1.
5. Return the array of targeted strings.

Data scraping! 😀

Lame comment formatting removed my \</b>\ and \\ from the right side of the two tuples in my second paragraph. Hope that helps!

Gah, Jake’s going to be like “Oh wow! 3 comments on my blog!”, but NO! It’s just Micah having trouble trying to communicate “LESSTHAN FORWARDSLASH b GREATERTHAN” and “LESSTHAN FORWARDSLASH table GREATERTHAN”. GRAH FORMATTING!

Yeah, great point. I’ve had contracts writing much more complex things where you do just that… plus doing crazy stuff too (like writing a piece of software that would find CEOs and keeping track of all of their stock trading scraping off from the SEC database… freelancing is fun lol).

Anyways, there are many ways to do it. Lets say that we “did” grab the above code and wanted to get each link. Here are a few ways to do that:

1) run the explode function on <br /> elements then use getbetween on each index in the resulting array.

$pieces = explode(“<br />”, $part);
foreach ($pieces as $piece) {
$url = getBetween($piece, ‘<a href=”‘, ‘”‘);
$title = getBetween($piece, ‘>’, ‘<‘);
if ($url == “” || $title == “”) {
continue; // information is missing…
}
print(“<a href=’http://www.w3schools.com/$url’ rel=”nofollow”>$title</a><br>”);
}

2) use a regular expression (I’m not gonna do it here cause I always have to look up how to do this… can never get it right on the first few tries… but is VERY fast because regular expression code is written in C++)

o snap it messed up everything because it had html in it… lets try again…

Yeah, great point. I’ve had contracts writing much more complex things where you do just that… plus doing crazy stuff too (like writing a piece of software that would find CEOs and keeping track of all of their stock trading scraping off from the SEC database… freelancing is fun lol).

Anyways, there are many ways to do it. Lets say that we “did” grab the above code and wanted to get each link. Here are a few ways to do that:

1) run the explode function on <br /> elements then use getbetween on each index in the resulting array.

$pieces = explode(“<” . “br ” . “/>”, $part);
foreach ($pieces as $piece) {
$url = getBetween($piece, ‘<‘ . ‘a href=”‘, ‘”‘);
$title = getBetween($piece, ‘>’, ‘<‘);
if ($url == “” || $title == “”) {
continue; // information is missing…
}
print(“<” . “a href=’http://www.w3schools.com/$url’>$title<” . “/a><” . “br>”);
}

2) use a regular expression (I’m not gonna do it here cause I always have to look up how to do this… can never get it right on the first few tries… but is VERY fast because regular expression code is written in C++)

Leave a Reply

Your email address will not be published. Required fields are marked *