Tutorials (4)

How To Scrape Data Using Standard PHP

Alright, so you want a website that displays live stock quotes? Or maybe you want it to download and save financial reports off from the SEC (Security and Exchange Commission) database? In this tutorial, I show you how to do it in 10 minutes using PHP.

Alright, for the novices out there, I suggest you read my article on how the internet works before we begin and the first few pages of w3 school’s html tutorial.

Alright, lets begin. Scraping involves several steps:

  1. Downloading the contents of the other page
  2. Interpreting/Reading the other page
  3. (Possibly) Using information gained to go back to #1

For example, Google “spiders” through every page on the internet. How does it do this? Well, it follows the steps above and, in number 2, it searches for “links” within each page and then uses those links to loop back to number one. Simple, right?

Alright, so lets pick a site that we want to scrape. How about we scrape tutorials off from w3schools. Below is the contents of http://www.w3schools.com/ in plain html code. To view this, go to http://www.w3schools.com/ and then click view->source in your browser. (The verbiage is a little different in each browser, but same basic idea). I have chosen to not only show the html code that interests us by searching for the specific html code that we want to read first.

<td id="leftcolumn" width="150" valign="top" align="left" style="padding:4px;border:none">
<h2><span>HTML</span> Tutorials</h2>
<a href="html/default.asp" target="_top">Learn HTML</a><br />
<a href="xhtml/default.asp" target="_top">Learn XHTML</a><br />
<a href="css/default.asp" target="_top">Learn CSS</a><br />
<a href="tcpip/default.asp" target="_top">Learn TCP/IP</a><br />
<br />
<h2><span>Browser</span> Scripting</h2>
<a href="js/default.asp" target="_top">Learn JavaScript</a><br />
<a href="htmldom/default.asp" target="_top">Learn HTML DOM</a><br />
<a href="dhtml/default.asp" target="_top">Learn DHTML</a><br />
<a href="vbscript/default.asp" target="_top">Learn VBScript</a><br />
<a href="ajax/default.asp" target="_top">Learn AJAX</a><br />
<a href="jquery/default.asp" target="_top">Learn jQuery</a><br />
<a href="e4x/default.asp" target="_top">Learn E4X</a><br />
<br />
<h2><span>XML</span> Tutorials</h2>
<a href="xml/default.asp" target="_top">Learn XML</a><br />
<a href="dtd/default.asp" target="_top">Learn DTD</a><br />
<a href="dom/default.asp" target="_top">Learn XML DOM</a><br />
<a href="xsl/default.asp" target="_top">Learn XSLT</a><br />
<a href="xslfo/default.asp" target="_top">Learn XSL-FO</a><br />
<a href="xpath/default.asp" target="_top">Learn XPath</a><br />
<a href="xquery/default.asp" target="_top">Learn XQuery</a><br />
<a href="xlink/default.asp" target="_top">Learn XLink</a><br />
<a href="xlink/default.asp" target="_top">Learn XPointer</a><br />
<a href="schema/default.asp" target="_top">Learn Schema</a><br />
<a href="xforms/default.asp" target="_top">Learn XForms</a><br />
<br />
<h2><span>Server</span> Scripting</h2>
<a href="sql/default.asp" target="_top">Learn SQL</a><br />
<a href="asp/default.asp" target="_top">Learn ASP</a><br />
<a href="ado/default.asp" target="_top">Learn ADO</a><br />
<a href="php/default.asp" target="_top">Learn PHP</a><br />
<a href="aspnet/default.asp" target="_top">Learn ASP.NET</a><br />
<a href="dotnetmobile/default.asp" target="_top">Learn .NET Mobile</a><br />
<br />
<h2><span>Web</span> Services</h2>
<a href="webservices/default.asp" target="_top">Learn Web Services</a><br />
<a href="wsdl/default.asp" target="_top">Learn WSDL</a><br />
<a href="soap/default.asp" target="_top">Learn SOAP</a><br />
<a href="rss/default.asp" target="_top">Learn RSS</a><br />
<a href="rdf/default.asp" target="_top">Learn RDF</a><br />
<a href="wap/default.asp" target="_top">Learn WAP</a><br />
<a href="wmlscript/default.asp" target="_top">Learn WMLScript</a><br />
<br />
<a href="media/default.asp" target="_top">Learn Media</a><br />
<a href="smil/default.asp" target="_top">Learn SMIL</a><br />
<a href="svg/default.asp" target="_top">Learn SVG</a><br />
<a href="flash/default.asp" target="_top">Learn Flash</a><br />
<br />
<h2><span>Web</span> Building</h2>
<a href="site/default.asp" target="_top">Web Building</a><br />
<a href="browsers/default.asp" target="_top">Web Browsers</a><br />
<a href="cert/default.asp" target="_top">Web Certification</a><br />
<a href="hosting/default.asp" target="_top">Web Hosting</a><br />
<a href="w3c/default.asp" target="_top">Web W3C</a><br />
<a href="quality/default.asp" target="_top">Web Quality</a><br />
<a href="semweb/default.asp" target="_top">Web Semantic</a><br />
<br />
<td valign="top" align="left">
<table border="0" width="100%" cellpadding="0" cellspacing="0">

Alright, so upon basic analysis, what do we find? Well, this portion of code STARTS with “<td id=”leftcolumn” width=”150″ valign=”top” align=”left” style=”padding:4px;border:none”>” and ends with “<table border=”0″ width=”100%” cellpadding=”0″ cellspacing=”0″>“. So, if we wanted to “grab” this section of html code from the whole of the site, then we’d be all set.

Alright, so lets go back to our steps again.

  1. Downloading the contents of the other page
  2. Interpreting/Reading the other page

Downloading the contents of the other page in PHP is easy. We can just use the file_get_contents function like this:

$pagecontents = file_get_contents("http://www.w3schools.com/");

The above code assigns the php variable $pagecontents the contents of w3schools.com (in html, of course).

Now, we need to “grab” the html code of interest. To do this, we need to write a function that can search for the “start”, the “end” and then grab what is in between. Here is a php function that does just that:

function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen($start)));

So, to wrap it all up, we’ve learned how to download a list of the tutorials on w3schools.com. This is hardly a complete project, as you could continue to loop through each “individual” link by searching for things between “<a” and “</a>”, but this is a great start! The complete code is below.


function getBetween($str, $start, $end, $searchpos=0) {
$startpos = strpos($str, $start, $searchpos);
if ($startpos === false) return ""; // didn't find start
$endpos = strpos($str, $end, $startpos + strlen($start));
if ($endpos === false) return ""; // didn't find end
return substr($str, $startpos + strlen($start), ($endpos - $startpos - strlen

$pagecontents = file_get_contents("http://w3schools.com/");
$part = getBetween($pagecontents, '<td id="leftcolumn" width="150" valign="top" align="left" style="padding:4px;border:none">', '<table border="0" width="100%" cellpadding="0" cellspacing="0">');



– Jacob Beasley

  • Share/Bookmark

Applied Theory: How Facebook Works

After having studied my articles on Relational Databases and How the Internet Works, here are the basics of how Facebook works. I will only concentrate on the users and their friends (as Facebook is a HUGE undertaking).
Below are the tables for the users and friends features. Notice how Jen & Jake are friends and Rob & Davy are friends.


id username password
1 jacob beasley
2 jennifer hamilton
3 rob mohr
4 davy stiles
5 charles beasley

id userid friendid
1 1 2
2 2 1
3 3 4
4 4 3
Login Process:

  1. You go to Facebook.com. Facebook.com shows you a login form. You type in your username/password and click login.
  2. Your browser sends a request to Facebook containing your login information
  3. The web server receives the requests, recognizes that it is for a PHP file, and it starts up the PHP interpreter.
  4. The PHP Interpreter reads the Facebook programmer’s code. The code realizes that you are trying to login and queries up the MySQL database to see if the email/password is valid. The “query” is sent using SQL (structured query language).
  5. The mysql database searches the table and finds the user. It then sends this back to the php script
  6. The php script continues running where it left off. It recognizes that the data was valid. It tells the browser to save a cookie with the user’s information and outputs the rest of the page.

Viewing another page:

  1. You click on your inbox.
  2. Your browser requests the inbox page. It sends the cookie that was set in the login process over to the web server.
  3. The web server recognizes it is requesting a PHP file and runs the code through the PHP interpreter.
  4. The php code sends a request to the mysql server to see if the information stored in the cookie is valid.
  5. The mysql server sends back a response.
  6. The php code sees that the cookie does checkout. It then queries the mysql table for a list of messages
  7. the mysql servers sends back a response with the messages
  8. the php code then outputs the rest of the page with the messages
  9. The web server sends what the php code outputs to the browser
  10. The browser, finally, displays the page to you, the user.
  • Share/Bookmark

Relational Databases In Ten Minutes!

Database Crash Course
How do you store large amounts of information? How does a person store, say, 20000 users each with 4 different purchases? It does this using a database system. Below are two different types of database systems. We will be focusing on a relational database system, as a flat database systems are really only hypothetical these days… most everything we do will be relational in nature.

Flat Database
A flat database simply is a list of items. For example, if you just have a list of people have signed up for a newsletter.
Relational Database
A relational database combines flat databases (or tables) and links up entries from one flat database (table) with another. For example, if you have a flat database (or table) of newsletters and a flat database (or table) of people who have signed up for newspapers. Each entry (or row) in the list of people who signed up for newsletters could be linked to several entries in the newsletters flat database (or table).


A database is a collection of tables.


A table is a bit like an excel spreadsheet; it has rows and columns. Each row is called a row, record, or entry (the terms are used interchangably). A column is like a “field.”


Each table contains a number of fields. Each field has a type. For example, if I have a table that is meant to store customer information, I could call it “customerinformation” and give it the fields “id”, “firstname”, “lastname”, and “phonenumber”.


Each Row is an entry in a table. For example, in the above customer example, an entry in the “customerinformation” table might have an id of 1, a firstname of jacob, a lastname of beasley, and a phone number of 612 210 7533.


Each table “should” have an Index. In other words, something unique. You might set it to autoincrement, too. For example, when you are put on a school system, you are given a “school id” number. Same idea… everything is given an “index” so you can tell it apart.

Relational Database Example: Customer Orders
Lets say that we have a bunch of orders and each order, for whatever reason, can only have one product related to it. This would mean that we could have two tables: one that is orders and one that is products. Below, there are three orders. Two of the orders were for donuts and one was for a crescent roll:

Products Table

id name price
1 donut $3
2 crescent roll $2
Orders Table

id products_id status date
1 1 done 9/13/2009
2 2 done 9/13/2009
3 1 still need to ship 9/14/2009

Relationship Types
Fundamentally, there are 4 relationship types you need to be familiar with: one-to-one, one-to-many, many-to-one, many-to-many.




id name phone sex
1 john smith 555-555-5555 male
2 ivy, poison 555-654-6456 female

id customer_id (make it an index along with id) amount date
1 1 $3245 9/13/2009
2 2 $234 9/13/2009


Allows you to have one table have many associations on another table. For example, a person may have received, say, 50 messages. Each user has a one-to-many relationship between themselves and the messages they have sent. They also have a one-to-many relationship between their user account the messages they have received.


id name phone sex
1 john smith 555-555-5555 male
2 ivy, poison 555-654-6456 female


id customer_id (but not an index so multiple products can be associated) amount date
1 1 $3245 9/13/2009
2 2 $234 9/13/2009


Just like one-to-many above, but flip left and right around.


In some cases, you may have many of one table associated to many of another table. For example, you may have 50 employees and 10 different office locations. Each employee may work out of several office locations and each office location may have many employees, thus you have a many-to-many relationship between the office locations and the employees table. Below is a demonstration of what this might sort of look like. Below, Jacob works at all locations, Davy works at lakeville, and Rob works in san diego:

Employees Table

id firstname lastname
1 jacob beasley
2 davy stiles
3 rob mohr
Employees_OfficeLocations Table

index employee_id index location_id
1 1
1 2
1 3
2 2
3 3
OfficeLocations Table

id city state
1 farmington mn
2 lakeville mn
3 san diego ca

  • Share/Bookmark

How The Internet Works

Technology Diagram










Above, one can see that the browser sends requests to the server. The server then either returns the request immediately or, if it is PHP/ASP or other “server side” languages, reads and “parses” the “code.” The PHP/ASP code may connect to a database server such as MSSQL or MySQL and send a request using SQL. The server then sends back a response. When the PHP/ASP is done, it sends back HTML/CSS/Javascript code (generally). The browser receives it and displays it for the user. Below are a list of key vocabulary terms.

Client-Side Languages
Client side languages are executes on the clients’ browser. This means HTML, CSS, and Javascript, generally, though Flash and Java applets are also executed on the client. Below is an explanation of each.


HTML, Hyper-text markup language, was given its name because it allowed for people to “link up” different “pages.” It is a “markup” language because it “marks up” things. In other words, it describes information; generally, it describes how information is organized on the computer screen. HTML is the foundation of describing how we contents is to be displayed and interpreted on a webpage.

XML, Extensible Markup Language, is a “general” language that is similar to HTML but can describe any type of information (not just HTML). It can be a pain to work with, but it is MUCH easier than proprietary file formats like, say, Microsoft Office! When sending data between different programming languages (say, PHP and ASP and Javascript), XML is a great way to do this.

HTML was not “uniform.” Each browser interpreted things differently. The solution was SUPPOSED to be XHTML. By using XML and CSS to describe HTML, the theory was that all browsers could talk the same language and display things identically. The reality? It was a huge pain in the butt… Internet Explorer, the most common web browser, failed to do XHTML right. XHTML failed, but generally speaking you should try and have things done according to “XHTML” standards, but NOT AT THE EXPENSE OF COMPLETING WITHIN BUDGET.


CSS, cascading stylesheets, is an “add-on” to HTML. HTML does not allow you to specify “specifics” very well. For example, if I want to set the colors of the scrollbar in a web browser, there exists no way to do this in HTML. CSS, however, has allowed this to be done. Additionally, I can make a change throughout the entire document with CSS, such as making all table borders disappear by default.


HTML didn’t allow things to move. The solution? Well, one of the solutions was javascript. Javascript can “change” html after the page loads. For example, in a dropdown menu, javascript can be set to be run when someone moves their mouse over an element, it shows the menu. When they click on a menu item, it goes to another page. It can do things like verify that fields have decent information, popup new windows, or popup “are you sure you want to?” prompts. Ask your programmers… they’ll explain what it does. Keep in mind that Javascript is run on the client, not the server, so you STILL MUST VERIFY THAT ALL DATA ENTERED IS CLEAN using a server-side language. For example, I once had a client who had a Captcha system (one of those “type in the word from the image below” things) written in javascript… was useless when spammers wrote programs that would automatically spider through pages, find web forms, and post junk data. The spiders don’t execute javascript code, thus the Captcha was ABSOLUTELY USELESS. We got a few hundred to fix that 😀


Java was invented as compiled-interpreted language. What this means, in short, is that java code will run on any operating system that has Java installed: Windows, Linux, or even Mac. Java, however, takes a long time to load, is EXTREMELY time consuming to program, and requires a very high level of expertise to know how to use (compared to PHP/HTML, for example). Flash has replaced Java on websites, though Java does creep up for uploading large files, though you can often times find pre-made java tools that you can have a programmer sort of “plugin” to the site. For video sites, build on top of something that is already out there such as clipshare that ALREADY HAS THIS BUILT FOR YOU. Java is also used occassionally on the server end in a technology called JSP (Java server pages) much as PHP would be used, but it failed… not many people use it anymore.


Have you ever seen the Lion King? That was done in Adobe Flash. Flash has, in the past few years, become more powerful than Java in many respects (when it relates to web pages, at least). It is highly media intensive and supports communications with web servers directly (bypassing the browser). Whenever you have a site that needs a lot of “movement”, flash will be involved. Contact your team member who is a flash expert for advice on this. Oftentimes, you can use a CMS but just modify the template so that a flash item is put at the top of the page in the place of, say, a large image banner.

Web Server Programs
Web Servers are computer programs. They sit and wait for incoming requests from a client. When a client connects, the client sends a request. The request contains what domain it is for, what page it wants, and any “form” information that is needed.

Apache Web Server
Linux Web Servers are very common and Apache Web Server is the most used web server on a Linux system. It is 100% free and can handle millions of requests per second, depending on how powerful your server is, of course. There are other web servers, but this is the most common.

Internet Information Server
Windows web servers are the second most common type of web server. The “Internet Information Server” web server is provided with professional versions of windows and windows server addition. Generally speaking, if you want to run ASP/MSSQL, you will need a windows web host and they most likely will have Internet Information Server.

Server-Side Programming Languages
Most websites use server side programming these days. For example, you login to Facebook. What language do you think does all the processing to decide what to display on the page? What language decides how it is displayed? What language figures out that it needs to query up the database server, look up your login information, and log you in? Server-Side languages do this.

PHP has been around for awhile. All of the new “upstart” companies have done their work in PHP. Youtube, Myspace, and Facebook, to name a few. Why? Because it is totally free, easy to learn, and has a HUGE COMMUNITY OF DEVELOPERS. Our entire business model is built around this language.

Perhaps the second most common server-side language today is ASP. ASP, or Active Server Pages, is a language that was created by Microsoft and allows automation. Aside from Microsoft’s own sites, few major “upstart” companies have grown off from this technology. It requires a Microsoft server license on all of your servers and it requires that you use Microsoft technologies for everything, thus tripling your costs across the board.

Just about any programming language has been adapted by someone for server side coding. The most notable are Perl, Ruby, and Java (JSP). I used to do it in C++ when I was a kid and had too much time on my hands (as did Ebay when they first started out, I think). Those listed here are about 98% of the server side scripting.

Database Server
Database servers store information. They allow you to get access to that information fast and to do all kinds of cool searches on it. For example, if I have a list of 100,000 companies in a plain text file and I wrote a php program to search through it, it could take 20 to 30 seconds. If I send that over to mysql running on an optimized server, it could take just about .1 seconds… obviously, much better.

This software is a lot like MsSQL, but totally free. PHP, out of box, integrates with MySQL completely. Combined with PhpMyAdmin (and other tools), a programmer can add/remove tables and otherwise manage the database very easily.

This software is a lot like MySQL, but made by Microsoft and rather expensive. It has a few more features than MySQL, but most projects won’t take advantage of those so they don’t matter anyways.

Browser Comparison
Every browser seems to do things differently these days. Below are a description of the major browsers and how they compare

Internet Explorer
Internet Explorer was created by Microsoft therefore it sucks. It is NOT standards compliant and each different version seems to display things different. It, until recently, had not had an “auto-update” feature thus there are lots of people out there still using IE6 and that IE6 does not do CSS right. This means that you can’t rely on all of the CSS features and will need to develop sites with tables instead of just css. Your programmers will know what that means.

Firefox is a great, standards-compliant browser. Through its partnership with google, it has become the second most popular browser out there and the #1 choice among programmers. It has a ton of add-ons and mows your lawn while you’re not busy.

They both, for the most part, are “standards compliant” meaning they display XHTML how it is supposed to be displayed. If it works in firefox, it will work in these 99% of the time.

Finally, CPanel is a tool you will most likely run into in the course of your work. On most reputable Linux servers, you are given a Cpanel account. Through this account, you can manage mysql databases, php settings, files, ftp accounts, mail accounts, etc. You can give the cpanel password to trusted programmers. When you have to terminate someone, change all ftp and cpanel passwords.

FTP, or File Transfer Protocol, is a method sending files between a client (the programmer) and a server (the web server, generally). My personal favorite FTP client is File-Zilla Client. When a programmer wants to access a server, he generally wants an FTP Account or a CPanel account or both.

Cookies are simply a way to store data between page loads. For example, lets say that you login to a site and then you want to remember what user is logged in. Generally, a PHP programmer will have PHP tell the browser to save a cookie with the person’s username/password and/or session information. This way, when they click on a link to go to another page in the site, the browser will send the cookie contents to the web server and the php script can validate the session and know that the user is logged in.

More Reading
There are several topics not covered in this article in any way that are important to learn and become familiar with if you wish to continue your studies:

  1. DNS – How domain names work.
  2. Protocols – What underlying “methods of communication” that the browsers are using (namely, HTTP, HTTPS, and TCP/IP)

  • Share/Bookmark