jacobbeasley (7)


How To Scrape Data Using Standard PHP

Alright, so you want a website that displays live stock quotes? Or maybe you want it to download and save financial reports off from the SEC (Security and Exchange Commission) database? In this tutorial, I show you how to do it in 10 minutes using PHP.

Alright, for the novices out there, I suggest you read my article on how the internet works before we begin and the first few pages of w3 school’s html tutorial.

Alright, lets begin. Scraping involves several steps:

  1. Downloading the contents of the other page
  2. Interpreting/Reading the other page
  3. (Possibly) Using information gained to go back to #1

For example, Google “spiders” through every page on the internet. How does it do this? Well, it follows the steps above and, in number 2, it searches for “links” within each page and then uses those links to loop back to number one. Simple, right?

Alright, so lets pick a site that we want to scrape. How about we scrape tutorials off from w3schools. Below is the contents of http://www.w3schools.com/ in plain html code. To view this, go to http://www.w3schools.com/ and then click view->source in your browser. (The verbiage is a little different in each browser, but same basic idea). I have chosen to not only show the html code that interests us by searching for the specific html code that we want to read first.

<tr>
<td id="leftcolumn" width="150" valign="top" align="left" style="padding:4px;border:none">
<h2><span>HTML</span> Tutorials</h2>
<a href="html/default.asp" target="_top">Learn HTML</a><br />
<a href="xhtml/default.asp" target="_top">Learn XHTML</a><br />
<a href="css/default.asp" target="_top">Learn CSS</a><br />
<a href="tcpip/default.asp" target="_top">Learn TCP/IP</a><br />
<br />
<h2><span>Browser</span> Scripting</h2>
<a href="js/default.asp" target="_top">Learn JavaScript</a><br />
<a href="htmldom/default.asp" target="_top">Learn HTML DOM</a><br />
<a href="dhtml/default.asp" target="_top">Learn DHTML</a><br />
<a href="vbscript/default.asp" target="_top">Learn VBScript</a><br />
<a href="ajax/default.asp" target="_top">Learn AJAX</a><br />
<a href="jquery/default.asp" target="_top">Learn jQuery</a><br />
<a href="e4x/default.asp" target="_top">Learn E4X</a><br />
<br />
<h2><span>XML</span> Tutorials</h2>
<a href="xml/default.asp" target="_top">Learn XML</a><br />
<a href="dtd/default.asp" target="_top">Learn DTD</a><br />
<a href="dom/default.asp" target="_top">Learn XML DOM</a><br />
<a href="xsl/default.asp" target="_top">Learn XSLT</a><br />
<a href="xslfo/default.asp" target="_top">Learn XSL-FO</a><br />
<a href="xpath/default.asp" target="_top">Learn XPath</a><br />
<a href="xquery/default.asp" target="_top">Learn XQuery</a><br />
<a href="xlink/default.asp" target="_top">Learn XLink</a><br />
<a href="xlink/default.asp" target="_top">Learn XPointer</a><br />
<a href="schema/default.asp" target="_top">Learn Schema</a><br />
<a href="xforms/default.asp" target="_top">Learn XForms</a><br />
<br />
<h2><span>Server</span> Scripting</h2>
<a href="sql/default.asp" target="_top">Learn SQL</a><br />
<a href="asp/default.asp" target="_top">Learn ASP</a><br />
<a href="ado/default.asp" target="_top">Learn ADO</a><br />
<a href="php/default.asp" target="_top">Learn PHP</a><br />
<a href="aspnet/default.asp" target="_top">Learn ASP.NET</a><br />
<a href="dotnetmobile/default.asp" target="_top">Learn .NET Mobile</a><br />
<br />
<h2><span>Web</span> Services</h2>
<a href="webservices/default.asp" target="_top">Learn Web Services</a><br />
<a href="wsdl/default.asp" target="_top">Learn WSDL</a><br />
<a href="soap/default.asp" target="_top">Learn SOAP</a><br />
<a href="rss/default.asp" target="_top">Learn RSS</a><br />
<a href="rdf/default.asp" target="_top">Learn RDF</a><br />
<a href="wap/default.asp" target="_top">Learn WAP</a><br />
<a href="wmlscript/default.asp" target="_top">Learn WMLScript</a><br />
<br />
<h2><span>Multimedia</span></h2>
<a href="media/default.asp" target="_top">Learn Media</a><br />
<a href="smil/default.asp" target="_top">Learn SMIL</a><br />
<a href="svg/default.asp" target="_top">Learn SVG</a><br />
<a href="flash/default.asp" target="_top">Learn Flash</a><br />
<br />
<h2><span>Web</span> Building</h2>
<a href="site/default.asp" target="_top">Web Building</a><br />
<a href="browsers/default.asp" target="_top">Web Browsers</a><br />
<a href="cert/default.asp" target="_top">Web Certification</a><br />
<a href="hosting/default.asp" target="_top">Web Hosting</a><br />
<a href="w3c/default.asp" target="_top">Web W3C</a><br />
<a href="quality/default.asp" target="_top">Web Quality</a><br />
<a href="semweb/default.asp" target="_top">Web Semantic</a><br />
<br />
</td>
<td valign="top" align="left">
<table border="0" width="100%" cellpadding="0" cellspacing="0">
<tr>

Alright, so upon basic analysis, what do we find? Well, this portion of code STARTS with “<td id=”leftcolumn” width=”150″ valign=”top” align=”left” style=”padding:4px;border:none”>” and ends with “<table border=”0″ width=”100%” cellpadding=”0″ cellspacing=”0″>“. So, if we wanted to “grab” this section of html code from the whole of the site, then we’d be all set.

Alright, so lets go back to our steps again.

  1. Downloading the contents of the other page
  2. Interpreting/Reading the other page

Downloading the contents of the other page in PHP is easy. We can just use the file_get_contents function like this:

$pagecontents = file_get_contents("http://www.w3schools.com/");

The above code assigns the php variable $pagecontents the contents of w3schools.com (in html, of course).

Now, we need to “grab” the html code of interest. To do this, we need to write a function that can search for the “start”, the “end” and then grab what is in between. Here is a php function that does just that:

function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen($start)));
}

So, to wrap it all up, we’ve learned how to download a list of the tutorials on w3schools.com. This is hardly a complete project, as you could continue to loop through each “individual” link by searching for things between “<a” and “</a>”, but this is a great start! The complete code is below.

<?php

function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen
($start)));
}

$pagecontents = file_get_contents("http://w3schools.com/");
$part = getBetween($pagecontents, '<td id="leftcolumn" width="150" valign="top" align="left" style="padding:4px;border:none">', '<table border="0" width="100%" cellpadding="0" cellspacing="0">');

print($part);

?>

– Jacob Beasley

  • Share/Bookmark



Just a few paid template websites

With a Pre-Designed Website Template, you do not need to have as many artists and programmers. Some templates are already pre-programmed. At times, you will want to use one template that was made for one CMS with another, in which case you will need a programmer. Other times, you will want customization of a template, in which case you may need both a programmer and an artist. It varies. These templates are also useful when you want to figure out what general “image” a client is wanting to project.

KEEP IN MIND THAT SOMETIMES THEMES FOR ONE VERSION OF A CMS ARE NOT COMPATIBLE WITH THEMES OF ANOTHER VERSION. THIS IS VERY IMPORTANT WHEN DRAWING UP A COST-ANALYSIS.

Template Monster
http://templatemonster.com/
The premier paid themes site. They are awesome.

Free Templates
Just about every CMS has free templates. Here are some sites I turned up with a quick google search
http://wordpressthemesbase.com/
http://drupal.org/project/Themes
http://joomla2u.net/
http://themes.cmsmadesimple.org/
http://drupal2u.com/

  • Share/Bookmark



Review of a number of free, specialized systems

There
are a number of systems out there that don’t really fit into any
category but are useful and could be used as a platform for
development. Below are a few.
SugarCRM

http://www.sugarcrm.com/crm/

An Opensource, PHP/MySQL CRM (Customer
Relationship Management) tool. CRMs are for managing and tracking
marketing campaigns, client communications, and, mainly, the entire
sales process. Things is sort of an “opensource” salesforce, though
Salesforce may be better. Recently, it allowed for an individual to
extend SugarCRM by creating custom modules, which is VERY useful.

 

Benefits
Easy to use
Easy to extend
MANY features
Easy to install
CostDoesn’t integrate with as much software as Salesforce

Hosted version costs money, but you can get it on your own server too.

OrangeHRM

http://www.orangehrm.com/

A tool for managing human resources. Costs
money for full features and doesn’t have every feature you can think
of, but does timecards, does healthcare stuff, and is relatively
inexpensive. Is PHP/MySQL and could be integrated with phone systems
and things to do INCREDIBLE stuff, such as letting people mark
timecards through their cell phone… etc…

 

Benefits
Opensource
Cheap/Free
Easy to figure out
Has Helpful Technical Support
CostMaybe doesn’t have all the features your client will need out of box

not sure if it allows for templating

Phreebooks

http://www.phreebooks.com/

A free, PHP/MySQL accounting system. Not
perfect… but gets the job done. Quickbooks online is probably better,
but if for some reason you felt the need to integrate an opensource
accounting system with another piece of software, this might be the way
to go.

 

Benefits
Free
A number of reports
Does what it needs to do
CostNot use friendly…
reporting is difficult
Not always pretty
dropdown menus are funky

Moodle

http://moodle.org/

A tool for doing online classes. Could be
used by colleges, universities, bible studies, or a host of other
groups. Could also be used by businesses to train staff. Integrates
with tons of tools. Huge community. Entirely PHP/MySQL.

 

Benefits
Large Community
Virtually Bug-Free
Many Addons
Many templates
Lots of Integration
CostNone

Zoho

http://www.zoho.com/

Though not PHP/MySQL based, it has a
number of useful tools that can be used as an alternative to google
technologies. Also, has tools for converting MS Access databases
DIRECTLY OVER into Zoho database systems. The advantage, of course,
being that multiple users can access it at once! Might be able to
develop custom solutions using “Zoho Create” otherwise can make client
love you by getting them setup and moved to Zoho.

 

BenefitsLarge Community
Virtually Bug-Free
Easy to figure out
Relatively inexpensive
Backed up regularly!
CostCosts a little bit of money.
IfByPhone
http://ifbyphone.com/
Though not PHP/MySQL based, it has a number of phone tools. Namely, it as an “API” for creating phone-based applications. This could be integrated with something like PHP or ASP. I believe it is SOAP or some proprietary XML-based protocol. It comes equipped out of box with tools for
call queues, call lists, click-to-call (client types in their phone number and it calls you), etc. Very useful if you want to make a website/application that uses phone systems.
  • Share/Bookmark



Reviews of Several Video Systems (Youtube Clones)

Youtube was a cultural phenomenon. The reality is that videos sell products more than anything else. A video site can be modified into an ecommerce website or it can be used as a video sharing site. Needless to say, programming this yourself (from scratch) would be horribly time consuming… using something out there that is pre-built is a must. Below are a few of these systems that are relatively well-known and have most every feature that you will ever want.

Clip-Share

http://www.clip-share.com/

A commercial, opensource PHP/MySQL system. It is tops: bug-free, fast, and full-featured. Only downside is that it costs money. Their provided web hosting through the site is through a company that backs up everything once a week, so for the average mid-size company that is no big deal. May want to implement some kind of daily email mysql backup that shoots them an email with the database or something, just in case, but you generally should be fine. Their webhosting is DIRT CHEAP in terms of space/bandwidth. Free server install if you purchase hosting from clipshare.

Benefits
bug free and tested
many templates
large user base
Cost

Costs around $250, depending on options and things

need to have special software one your server, of course, due to video editing needs

PHPMotion

http://www.phpmotion.com/

A script similar to clip share but with less features. Thought technically free, costs money to remove the phpmotion logos. Server install is not free. All in all, not too much different from clipshare, except that most seem to think clipshare is better.

Benefits
“free”
lots of features
large community
lots of templates
Cost

free unless you want to remove the phpmotion logo

need to have special software one your server, of course, due to video editing needs


  • Share/Bookmark



Reviews of Several Social Networking Systems

Almost everybody these days has a Facebook. Wouldn’t it be nice to be able to have your own Facebook or MySpace? Well, look no further! The scripts below can do just that! A client who is setting up a networking group can use a tool like this or can build upon a tool like this. I do not suggest starting from scratch, as that is just a TON of work and then you never benefit from the work of other programmers who have “added on” to these platforms.

Elgg

http://elgg.org/

The premier, free social networking platform. Though not as full-featured as dolphin, it has a huge community of users. Checkout the site for more information. Entirely PHP/MySQL.

Benefits
free
easy to use
huge community
opensource
Cost

doesn’t have as many features as dolphin

Dolphin

http://www.boonex.com/products/dolphin/

Without a doubt the most feature-rich and professional of any social networking tool out there. It is expensive, however, and requires its own VPS to run. Sort of the “facebook” of the group.

Benefits
Most features
Most professional
opensource
Cost

Most expensive
Smaller community due to cost

Ning

http://about.ning.com/product.php

A great tool. Really feels like a “myspace” clone. Check the site to learn more.

Benefits
Tons of pre-built templates
Cost

Costs money, but cheaper…
Not particularly opensource

PHPBB

http://www.phpbb.com/
An industry standard forum system. Not a full “social networking” system, though.

Benefits
Can make custom templates
excellent forum system
very configurable
opensource
Cost

has a “box” it fits in and that’s all it does

  • Share/Bookmark



Reviews of Several E-Commerce Systems

E-Commerce systems generally operate on a few basic principles: A person lists products for certains prices and certain shipping amounts. Users can browse and search the site and then add quantities of products to their “cart.” Upon clicking “checkout” in the cart, they are sent to a payment processor (such as Paypal or authorize.net) where credit card are processed. It then logs that a payment was received/sent and vuala, a business sells products! Below are a list of a a few notable cart systems that you might consider using.

OSCommerce
http://www.oscommerce.com/
Without a doubt the best most powerful PHP/MySQL carts out there. Pretty old but not bad. Best of all, it is totally free! It is, compared to X-Cart, much less configurable, but not bad at all for a startup company. Consult your programmers before you do any templating work.

Benefits
Free
Powerful
Easy to figure out

many templates
large community

Costs
Less Add-ons

X-Cart
http://www.x-cart.com/
X-Cart is, arguably, the “premier” paid PHP/MySQL cart system out there. It has many features and a great community. The biggest challenge is that it costs quite a bit up front to get access to. Its biggest benefit is that there are tech support plans available, to a certain extent. Check their site for more information.

Benefits
Powerful
Templatable
Easy to figure out
Many addons
Costs
Not Free

ZenCart
http://www.zen-cart.com/
An offshoot of OSCommerce, ZenCart has many of the features of OSCommerce, though a larger developer community.

Benefits
Free

Powerful

Easy to figure out
many templates
large community

Costs
More Add-ons


  • Share/Bookmark



Best Content Management Systems

Below are a few content management systems that I researched a few months back. I hope it saves you time and money knowing which system to use! In short, a Content Management System, once installed, allows your average hill billy to manage their own website without a knowledge of html, css, javascript, php, or what would otherwise require study. As CMS has evolved, the best ones allow for user-added plugins and, through these plugins, have extended themselves so that if you want to add a forum or gallery, it only takes about 5 minutes with a plugin that somebody else made for you to use for FREE! WordPress is, without a doubt, the best CMS out there and if you can use it, do so. This is what I have used with my blog.

WordPress
http://wordpress.org/
Wordpress is a simple blog system. It is extremely easy to setup and install. It uses smarty templating. It does have a reputation for being vulnerable to hackers, though this is most likely because there are so many people who use it and, thus, so many opportunities for hackers to find glitches and exploit them. There are also literally thousands of free templates available on the net. There are also many community-made add-ons that allow you to integrate other systems (like twitter/facebook) with the site. It is programmed using PHP/MySQL.

Benefits
Very easy to install
Very easy to use
Low system resources
lots of pre-made templates
Lots of pre-made plugins
PHP/MySQL
Costs
Not really any. Most of its limitations have been eliminated

Drupal
http://drupal.org/
Drupal is a powerful CMS. It, behind WordPress, has the most templates. It occasionally has bizarre glitches when installing, but an entry-level programmer can figuring them out with a little googling. It is HIGHLY configurable and has A TON of add-ons available, like wordpress. Its main difference is that it allows multiple users to use the system, unlike wordpress. WordPress is meant for mostly one-way communication whereas Drupal allows for a little more group communication. It is programmed using PHP/MySQL. It is SERIOUSLY LACKING a WYSIWYG (What you see is what you get) editor in order to let the user edit content without knowing HTML.

Benefits
Moderately easy to install
lots of pre-made templates
lots of pre-made addons
PHP/MySQL
Costs
high system resources (slower)
No WYSIWYG pre-installed
A bit confusing to manage
addons can be confusing

WordPress Multi-User
http://mu.wordpress.org/
Wordpress is a great tool. WordPress multi-user is an even better one. It allows you to have multiple wordpress sites or one site with multiple wordpress authors (or a combination of the above). It is meant for newspapers, but can be used for many other systems. It is PHP/MySQL and, because it is used less, relatively secure.

Benefits
All the benefits of wordpress
can manage many sites
can have many users
Costs
Doesn’t wash your clothes for you or do the dishes

Joomla
http://www.joomla.org/
Joomla is another EXTREMELY powerful website system. It is functionally similar to Drupal, though the backend is much different. It has many addons and templating can be a bit more difficult than Drupal, though it is much cleaner to manage. Seems to have less pre-made templates available, but don’t let that stop you from using it. Also PHP/MySQL.

Benefits
Moderately easy to install
Moderately easy to manage
Addons (Extensions) more clear
Costs
Less templates available
smaller user community

CMSMadeSimple
http://www.cmsmadesimple.org/
CMSMadeSimple is the new guy on the street. It combined the strengths of Drupal with the strengths of WordPress. It is VERY easy to use compared to Joomla and Drupal and it has more features than WordPress. It is newer and, thus, has less addons and less pre-made templates than both. It also has a smaller user base and less “community” features than Drupal or Joomla, but it is EASY EASY EASY.  Its is PHP/MySQL.

Benefits
easy to install
easy to use
multi-user administrating
Costs
Not community-oriented (yet)
less pre-made templates
less addons
  • Share/Bookmark