Old 08-07-2006, 11:13   #1 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
finding links in a web page

Hi

I have been looking for a function that finds links on a web page, i have used get_meta_tags() and file_get_contents() to get data off of a web page and wondered if there was something similar to get URLs. If not what would be simple and logical way to get URLs off a page?
  Reply With Quote
Old 08-07-2006, 11:20   #2 (permalink)
mike_m
Work faster microphone ..
 
mike_m's Avatar
 
Join Date: Feb 2003
Location: Cambridge, UK
Posts: 1,709
I reckon theres a firefox extension somewhere that would do it, unless you're trying to do it over a large number of pages, for which you'd need to use some sort of crawler script i guess
  Reply With Quote
Old 08-07-2006, 11:53   #3 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
I'm trying to get my own crawler script together, i'm missing the URl part. Thats the best photo i've seen in a forum by the way "do want me to open the beer Frank? - Naaaah! I want you to **** it!"
  Reply With Quote
Old 08-07-2006, 12:13   #4 (permalink)
paulanthony
mingin dawg baitch
 
paulanthony's Avatar
 
Join Date: Apr 2004
Location: Belfast
Posts: 1,035
Send a message via MSN to paulanthony
look into Regular Expressions.
www.regular-expressions.info
  Reply With Quote
Old 08-07-2006, 13:13   #5 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
short but civil and a good shout.
  Reply With Quote
Old 08-07-2006, 14:43   #6 (permalink)
MikeMackay
Everything is fine.
 
MikeMackay's Avatar
 
Join Date: Feb 2005
Location: Witham & London
Posts: 772
Send a message via MSN to MikeMackay Send a message via Skype™ to MikeMackay
I usually begin any project like this by sitting down and writing out (by breaking down in to steps) the way I think, or want, the script to actually work in basic terms so that its core routine is easy to understand & clear in my mind, once this is done and you have the concept lodged in your memory it's generally easier to work out how to achieve what's needed. This means considering all points and functions that need to occur in order to reach your scripts end target.

After a little thought, I think you would begin by needing to do something along the lines of:

Step 1) Receive a web page URL to scrape/index (perhaps via a form submission?)
Step 2) Use PHPs CURL function to retrieve the contents (HTML code) of the web page and store said contents in a variable (I presume you're using PHP?)
Step 3) Trawl through the variable data and use a regular expression to parse any 'A HREF' links and store these in an array
Step 4) Once completed, loop through the array and repeat step 2 for each new link

One thing that immediately sticks out though is that you would need to validate each URL so that you don't keep crawling each linked page everytime meaning you only visit the page once, otherwise your script would end up in an endless loop.

You may also want to build in some code that prevents your crawler from leaving the submitted website and indexing a 3rd party site from any external links that may be present, unless you specifically want this to happen.

There is probably more that you will want to happen and you can easily elaborate on what's written here but that's the basics that will form your script. By getting this initial part done you can then move on to include better parsing routines or to extract other types of data (for example META Keywords) etc etc.

Hope this helps you out a little, let us know how you get on with it.

- Mike
  Reply With Quote
Old 09-07-2006, 07:20   #7 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
Thanks Mike, thats a big help. It confirms pretty much what i was going to do anyway, i just never feel that confident with my own decisions and always think there must be a smarter and quicker way to do things. Thanks again.
  Reply With Quote
Old 09-07-2006, 07:48   #8 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
p.s why would you use CURL instead of file_get_contents() ?
  Reply With Quote
Old 10-07-2006, 06:24   #9 (permalink)
MikeMackay
Everything is fine.
 
MikeMackay's Avatar
 
Join Date: Feb 2005
Location: Witham & London
Posts: 772
Send a message via MSN to MikeMackay Send a message via Skype™ to MikeMackay
Quote:
Originally Posted by Skidzy McFergus
p.s why would you use CURL instead of file_get_contents() ?
I am still learning all PHP's built in functions so I wasn't too familiar with get_file_contents(). When I had previously worked with another PHP developer he used CURL for remote server work so it kinda just stuck with me I guess.

After looking at the manual, get_file_contents() seems suitable enough (providing it's enabled/setup on your server) for the work you are doing.

- Mike
  Reply With Quote
Old 11-07-2006, 19:41   #10 (permalink)
sjd
Registered Abuser
 
sjd's Avatar
 
Join Date: Jun 2006
Location: Manchester, England.
Posts: 174
For anyone else reading this thread (because I've already posted it on the other related one).

http://www.regexlib.com/

Searches for uri and url returned some promising results. This is one of the first that showed up: http://www.regexlib.com/REDetails.aspx?regexp_id=115
  Reply With Quote
Old 12-07-2006, 07:14   #11 (permalink)
xml
Senior Member
 
Join Date: Sep 2004
Posts: 149
Quote:
Originally Posted by Skidzy McFergus
p.s why would you use CURL instead of file_get_contents() ?

cURL supports fetch timeouts, handles HTTP redirects flawlessly and allows you to easily create a custom user agent very easily (e.g. you can set it as Google Bot, Yahoo! Slurp, MSNBot, IE, Firefox etc.). The trade off is increased CPU usage, which is hardly a trade off in this day and age of cheap CPU cycles.

The main drawback i've found when using file_get_contents() during HTTP calls is the stalling of applications, it is great for grabbing file contents on the local system tho e.g. file_get_contents('/home/myBackups/data.txt').
__________________
  Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search


Contact Us - Web Design Forums - Archive - Top
Search Engine Optimization by vBSEO 3.0.0 RC8