jacobbeasley (7)

How To Scrape Data Using Standard PHP

Alright, so you want a website that displays live stock quotes? Or maybe you want it to download and save financial reports off from the SEC (Security and Exchange Commission) database? In this tutorial, I show you how to do it in 10 minutes using PHP.

Alright, for the novices out there, I suggest you read my article on how the internet works before we begin and the first few pages of w3 school’s html tutorial.

Alright, lets begin. Scraping involves several steps:

  1. Downloading the contents of the other page
  2. Interpreting/Reading the other page
  3. (Possibly) Using information gained to go back to #1

For example, Google “spiders” through every page on the internet. How does it do this? Well, it follows the steps above and, in number 2, it searches for “links” within each page and then uses those links to loop back to number one. Simple, right?

Alright, so lets pick a site that we want to scrape. How about we scrape tutorials off from w3schools. Below is the contents of http://www.w3schools.com/ in plain html code. To view this, go to http://www.w3schools.com/ and then click view->source in your browser. (The verbiage is a little different in each browser, but same basic idea). I have chosen to not only show the html code that interests us by searching for the specific html code that we want to read first.

<td id="leftcolumn" width="150" valign="top" align="left" style="padding:4px;border:none">
<h2><span>HTML</span> Tutorials</h2>
<a href="html/default.asp" target="_top">Learn HTML</a><br />
<a href="xhtml/default.asp" target="_top">Learn XHTML</a><br />
<a href="css/default.asp" target="_top">Learn CSS</a><br />
<a href="tcpip/default.asp" target="_top">Learn TCP/IP</a><br />
<br />
<h2><span>Browser</span> Scripting</h2>
<a href="js/default.asp" target="_top">Learn JavaScript</a><br />
<a href="htmldom/default.asp" target="_top">Learn HTML DOM</a><br />
<a href="dhtml/default.asp" target="_top">Learn DHTML</a><br />
<a href="vbscript/default.asp" target="_top">Learn VBScript</a><br />
<a href="ajax/default.asp" target="_top">Learn AJAX</a><br />
<a href="jquery/default.asp" target="_top">Learn jQuery</a><br />
<a href="e4x/default.asp" target="_top">Learn E4X</a><br />
<br />
<h2><span>XML</span> Tutorials</h2>
<a href="xml/default.asp" target="_top">Learn XML</a><br />
<a href="dtd/default.asp" target="_top">Learn DTD</a><br />
<a href="dom/default.asp" target="_top">Learn XML DOM</a><br />
<a href="xsl/default.asp" target="_top">Learn XSLT</a><br />
<a href="xslfo/default.asp" target="_top">Learn XSL-FO</a><br />
<a href="xpath/default.asp" target="_top">Learn XPath</a><br />
<a href="xquery/default.asp" target="_top">Learn XQuery</a><br />
<a href="xlink/default.asp" target="_top">Learn XLink</a><br />
<a href="xlink/default.asp" target="_top">Learn XPointer</a><br />
<a href="schema/default.asp" target="_top">Learn Schema</a><br />
<a href="xforms/default.asp" target="_top">Learn XForms</a><br />
<br />
<h2><span>Server</span> Scripting</h2>
<a href="sql/default.asp" target="_top">Learn SQL</a><br />
<a href="asp/default.asp" target="_top">Learn ASP</a><br />
<a href="ado/default.asp" target="_top">Learn ADO</a><br />
<a href="php/default.asp" target="_top">Learn PHP</a><br />
<a href="aspnet/default.asp" target="_top">Learn ASP.NET</a><br />
<a href="dotnetmobile/default.asp" target="_top">Learn .NET Mobile</a><br />
<br />
<h2><span>Web</span> Services</h2>
<a href="webservices/default.asp" target="_top">Learn Web Services</a><br />
<a href="wsdl/default.asp" target="_top">Learn WSDL</a><br />
<a href="soap/default.asp" target="_top">Learn SOAP</a><br />
<a href="rss/default.asp" target="_top">Learn RSS</a><br />
<a href="rdf/default.asp" target="_top">Learn RDF</a><br />
<a href="wap/default.asp" target="_top">Learn WAP</a><br />
<a href="wmlscript/default.asp" target="_top">Learn WMLScript</a><br />
<br />
<a href="media/default.asp" target="_top">Learn Media</a><br />
<a href="smil/default.asp" target="_top">Learn SMIL</a><br />
<a href="svg/default.asp" target="_top">Learn SVG</a><br />
<a href="flash/default.asp" target="_top">Learn Flash</a><br />
<br />
<h2><span>Web</span> Building</h2>
<a href="site/default.asp" target="_top">Web Building</a><br />
<a href="browsers/default.asp" target="_top">Web Browsers</a><br />
<a href="cert/default.asp" target="_top">Web Certification</a><br />
<a href="hosting/default.asp" target="_top">Web Hosting</a><br />
<a href="w3c/default.asp" target="_top">Web W3C</a><br />
<a href="quality/default.asp" target="_top">Web Quality</a><br />
<a href="semweb/default.asp" target="_top">Web Semantic</a><br />
<br />
<td valign="top" align="left">
<table border="0" width="100%" cellpadding="0" cellspacing="0">

Alright, so upon basic analysis, what do we find? Well, this portion of code STARTS with “<td id=”leftcolumn” width=”150″ valign=”top” align=”left” style=”padding:4px;border:none”>” and ends with “<table border=”0″ width=”100%” cellpadding=”0″ cellspacing=”0″>“. So, if we wanted to “grab” this section of html code from the whole of the site, then we’d be all set.

Alright, so lets go back to our steps again.

  1. Downloading the contents of the other page
  2. Interpreting/Reading the other page

Downloading the contents of the other page in PHP is easy. We can just use the file_get_contents function like this:

$pagecontents = file_get_contents("http://www.w3schools.com/");

The above code assigns the php variable $pagecontents the contents of w3schools.com (in html, of course).

Now, we need to “grab” the html code of interest. To do this, we need to write a function that can search for the “start”, the “end” and then grab what is in between. Here is a php function that does just that:

function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen($start)));

So, to wrap it all up, we’ve learned how to download a list of the tutorials on w3schools.com. This is hardly a complete project, as you could continue to loop through each “individual” link by searching for things between “<a” and “</a>”, but this is a great start! The complete code is below.


function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen

$pagecontents = file_get_contents("http://w3schools.com/");
$part = getBetween($pagecontents, '<td id="leftcolumn" width="150" valign="top" align="left" style="padding:4px;border:none">', '<table border="0" width="100%" cellpadding="0" cellspacing="0">');



– Jacob Beasley

  • Share/Bookmark

Just a few paid template websites

With a Pre-Designed Website Template, you do not need to have as many artists and programmers. Some templates are already pre-programmed. At times, you will want to use one template that was made for one CMS with another, in which case you will need a programmer. Other times, you will want customization of a template, in which case you may need both a programmer and an artist. It varies. These templates are also useful when you want to figure out what general “image” a client is wanting to project.


Template Monster
The premier paid themes site. They are awesome.

Free Templates
Just about every CMS has free templates. Here are some sites I turned up with a quick google search

  • Share/Bookmark

Review of a number of free, specialized systems

are a number of systems out there that don’t really fit into any
category but are useful and could be used as a platform for
development. Below are a few.


An Opensource, PHP/MySQL CRM (Customer
Relationship Management) tool. CRMs are for managing and tracking
marketing campaigns, client communications, and, mainly, the entire
sales process. Things is sort of an “opensource” salesforce, though
Salesforce may be better. Recently, it allowed for an individual to
extend SugarCRM by creating custom modules, which is VERY useful.


Easy to use
Easy to extend
MANY features
Easy to install
CostDoesn’t integrate with as much software as Salesforce

Hosted version costs money, but you can get it on your own server too.



A tool for managing human resources. Costs
money for full features and doesn’t have every feature you can think
of, but does timecards, does healthcare stuff, and is relatively
inexpensive. Is PHP/MySQL and could be integrated with phone systems
and things to do INCREDIBLE stuff, such as letting people mark
timecards through their cell phone… etc…


Easy to figure out
Has Helpful Technical Support
CostMaybe doesn’t have all the features your client will need out of box

not sure if it allows for templating



A free, PHP/MySQL accounting system. Not
perfect… but gets the job done. Quickbooks online is probably better,
but if for some reason you felt the need to integrate an opensource
accounting system with another piece of software, this might be the way
to go.


A number of reports
Does what it needs to do
CostNot use friendly…
reporting is difficult
Not always pretty
dropdown menus are funky



A tool for doing online classes. Could be
used by colleges, universities, bible studies, or a host of other
groups. Could also be used by businesses to train staff. Integrates
with tons of tools. Huge community. Entirely PHP/MySQL.


Large Community
Virtually Bug-Free
Many Addons
Many templates
Lots of Integration



Though not PHP/MySQL based, it has a
number of useful tools that can be used as an alternative to google
technologies. Also, has tools for converting MS Access databases
DIRECTLY OVER into Zoho database systems. The advantage, of course,
being that multiple users can access it at once! Might be able to
develop custom solutions using “Zoho Create” otherwise can make client
love you by getting them setup and moved to Zoho.


BenefitsLarge Community
Virtually Bug-Free
Easy to figure out
Relatively inexpensive
Backed up regularly!
CostCosts a little bit of money.
Though not PHP/MySQL based, it has a number of phone tools. Namely, it as an “API” for creating phone-based applications. This could be integrated with something like PHP or ASP. I believe it is SOAP or some proprietary XML-based protocol. It comes equipped out of box with tools for
call queues, call lists, click-to-call (client types in their phone number and it calls you), etc. Very useful if you want to make a website/application that uses phone systems.
  • Share/Bookmark

Reviews of Several Video Systems (Youtube Clones)

Youtube was a cultural phenomenon. The reality is that videos sell products more than anything else. A video site can be modified into an ecommerce website or it can be used as a video sharing site. Needless to say, programming this yourself (from scratch) would be horribly time consuming… using something out there that is pre-built is a must. Below are a few of these systems that are relatively well-known and have most every feature that you will ever want.



A commercial, opensource PHP/MySQL system. It is tops: bug-free, fast, and full-featured. Only downside is that it costs money. Their provided web hosting through the site is through a company that backs up everything once a week, so for the average mid-size company that is no big deal. May want to implement some kind of daily email mysql backup that shoots them an email with the database or something, just in case, but you generally should be fine. Their webhosting is DIRT CHEAP in terms of space/bandwidth. Free server install if you purchase hosting from clipshare.

bug free and tested
many templates
large user base

Costs around $250, depending on options and things

need to have special software one your server, of course, due to video editing needs



A script similar to clip share but with less features. Thought technically free, costs money to remove the phpmotion logos. Server install is not free. All in all, not too much different from clipshare, except that most seem to think clipshare is better.

lots of features
large community
lots of templates

free unless you want to remove the phpmotion logo

need to have special software one your server, of course, due to video editing needs

  • Share/Bookmark

Reviews of Several Social Networking Systems

Almost everybody these days has a Facebook. Wouldn’t it be nice to be able to have your own Facebook or MySpace? Well, look no further! The scripts below can do just that! A client who is setting up a networking group can use a tool like this or can build upon a tool like this. I do not suggest starting from scratch, as that is just a TON of work and then you never benefit from the work of other programmers who have “added on” to these platforms.



The premier, free social networking platform. Though not as full-featured as dolphin, it has a huge community of users. Checkout the site for more information. Entirely PHP/MySQL.

easy to use
huge community

doesn’t have as many features as dolphin



Without a doubt the most feature-rich and professional of any social networking tool out there. It is expensive, however, and requires its own VPS to run. Sort of the “facebook” of the group.

Most features
Most professional

Most expensive
Smaller community due to cost



A great tool. Really feels like a “myspace” clone. Check the site to learn more.

Tons of pre-built templates

Costs money, but cheaper…
Not particularly opensource


An industry standard forum system. Not a full “social networking” system, though.

Can make custom templates
excellent forum system
very configurable

has a “box” it fits in and that’s all it does

  • Share/Bookmark

Reviews of Several E-Commerce Systems

E-Commerce systems generally operate on a few basic principles: A person lists products for certains prices and certain shipping amounts. Users can browse and search the site and then add quantities of products to their “cart.” Upon clicking “checkout” in the cart, they are sent to a payment processor (such as Paypal or authorize.net) where credit card are processed. It then logs that a payment was received/sent and vuala, a business sells products! Below are a list of a a few notable cart systems that you might consider using.

Without a doubt the best most powerful PHP/MySQL carts out there. Pretty old but not bad. Best of all, it is totally free! It is, compared to X-Cart, much less configurable, but not bad at all for a startup company. Consult your programmers before you do any templating work.

Easy to figure out

many templates
large community

Less Add-ons

X-Cart is, arguably, the “premier” paid PHP/MySQL cart system out there. It has many features and a great community. The biggest challenge is that it costs quite a bit up front to get access to. Its biggest benefit is that there are tech support plans available, to a certain extent. Check their site for more information.

Easy to figure out
Many addons
Not Free

An offshoot of OSCommerce, ZenCart has many of the features of OSCommerce, though a larger developer community.



Easy to figure out
many templates
large community

More Add-ons

  • Share/Bookmark

Best Content Management Systems

Below are a few content management systems that I researched a few months back. I hope it saves you time and money knowing which system to use! In short, a Content Management System, once installed, allows your average hill billy to manage their own website without a knowledge of html, css, javascript, php, or what would otherwise require study. As CMS has evolved, the best ones allow for user-added plugins and, through these plugins, have extended themselves so that if you want to add a forum or gallery, it only takes about 5 minutes with a plugin that somebody else made for you to use for FREE! WordPress is, without a doubt, the best CMS out there and if you can use it, do so. This is what I have used with my blog.

Wordpress is a simple blog system. It is extremely easy to setup and install. It uses smarty templating. It does have a reputation for being vulnerable to hackers, though this is most likely because there are so many people who use it and, thus, so many opportunities for hackers to find glitches and exploit them. There are also literally thousands of free templates available on the net. There are also many community-made add-ons that allow you to integrate other systems (like twitter/facebook) with the site. It is programmed using PHP/MySQL.

Very easy to install
Very easy to use
Low system resources
lots of pre-made templates
Lots of pre-made plugins
Not really any. Most of its limitations have been eliminated

Drupal is a powerful CMS. It, behind WordPress, has the most templates. It occasionally has bizarre glitches when installing, but an entry-level programmer can figuring them out with a little googling. It is HIGHLY configurable and has A TON of add-ons available, like wordpress. Its main difference is that it allows multiple users to use the system, unlike wordpress. WordPress is meant for mostly one-way communication whereas Drupal allows for a little more group communication. It is programmed using PHP/MySQL. It is SERIOUSLY LACKING a WYSIWYG (What you see is what you get) editor in order to let the user edit content without knowing HTML.

Moderately easy to install
lots of pre-made templates
lots of pre-made addons
high system resources (slower)
No WYSIWYG pre-installed
A bit confusing to manage
addons can be confusing

WordPress Multi-User
Wordpress is a great tool. WordPress multi-user is an even better one. It allows you to have multiple wordpress sites or one site with multiple wordpress authors (or a combination of the above). It is meant for newspapers, but can be used for many other systems. It is PHP/MySQL and, because it is used less, relatively secure.

All the benefits of wordpress
can manage many sites
can have many users
Doesn’t wash your clothes for you or do the dishes

Joomla is another EXTREMELY powerful website system. It is functionally similar to Drupal, though the backend is much different. It has many addons and templating can be a bit more difficult than Drupal, though it is much cleaner to manage. Seems to have less pre-made templates available, but don’t let that stop you from using it. Also PHP/MySQL.

Moderately easy to install
Moderately easy to manage
Addons (Extensions) more clear
Less templates available
smaller user community

CMSMadeSimple is the new guy on the street. It combined the strengths of Drupal with the strengths of WordPress. It is VERY easy to use compared to Joomla and Drupal and it has more features than WordPress. It is newer and, thus, has less addons and less pre-made templates than both. It also has a smaller user base and less “community” features than Drupal or Joomla, but it is EASY EASY EASY.  Its is PHP/MySQL.

easy to install
easy to use
multi-user administrating
Not community-oriented (yet)
less pre-made templates
less addons
  • Share/Bookmark