| Home | Register | FAQ | Members List | Search | Today's Posts | Mark Forums Read |
|
|
#1 (permalink) |
|
Registered User
Join Date: Nov 2005
Posts: 57
|
finding links in a web page
Hi I have been looking for a function that finds links on a web page, i have used get_meta_tags() and file_get_contents() to get data off of a web page and wondered if there was something similar to get URLs. If not what would be simple and logical way to get URLs off a page? |
|
|
|
|
|
#2 (permalink) |
|
Work faster microphone ..
Join Date: Feb 2003
Location: Cambridge, UK
Posts: 1,709
|
I reckon theres a firefox extension somewhere that would do it, unless you're trying to do it over a large number of pages, for which you'd need to use some sort of crawler script i guess |
|
|
|
#4 (permalink) |
|
mingin dawg baitch
|
look into Regular Expressions. www.regular-expressions.info |
|
|
|
#6 (permalink) |
|
Everything is fine.
|
I usually begin any project like this by sitting down and writing out (by breaking down in to steps) the way I think, or want, the script to actually work in basic terms so that its core routine is easy to understand & clear in my mind, once this is done and you have the concept lodged in your memory it's generally easier to work out how to achieve what's needed. This means considering all points and functions that need to occur in order to reach your scripts end target. After a little thought, I think you would begin by needing to do something along the lines of: Step 1) Receive a web page URL to scrape/index (perhaps via a form submission?) Step 2) Use PHPs CURL function to retrieve the contents (HTML code) of the web page and store said contents in a variable (I presume you're using PHP?) Step 3) Trawl through the variable data and use a regular expression to parse any 'A HREF' links and store these in an array Step 4) Once completed, loop through the array and repeat step 2 for each new link One thing that immediately sticks out though is that you would need to validate each URL so that you don't keep crawling each linked page everytime meaning you only visit the page once, otherwise your script would end up in an endless loop. You may also want to build in some code that prevents your crawler from leaving the submitted website and indexing a 3rd party site from any external links that may be present, unless you specifically want this to happen. There is probably more that you will want to happen and you can easily elaborate on what's written here but that's the basics that will form your script. By getting this initial part done you can then move on to include better parsing routines or to extract other types of data (for example META Keywords) etc etc. Hope this helps you out a little, let us know how you get on with it. - Mike |
|
|
|
#7 (permalink) |
|
Registered User
Join Date: Nov 2005
Posts: 57
|
Thanks Mike, thats a big help. It confirms pretty much what i was going to do anyway, i just never feel that confident with my own decisions and always think there must be a smarter and quicker way to do things. Thanks again. |
|
|
|
#9 (permalink) | |
|
Everything is fine.
|
Quote:
After looking at the manual, get_file_contents() seems suitable enough (providing it's enabled/setup on your server) for the work you are doing. - Mike |
|
|
|
|
#10 (permalink) |
|
Registered Abuser
Join Date: Jun 2006
Location: Manchester, England.
Posts: 174
|
For anyone else reading this thread (because I've already posted it on the other related one). http://www.regexlib.com/ Searches for uri and url returned some promising results. This is one of the first that showed up: http://www.regexlib.com/REDetails.aspx?regexp_id=115 |
|
|
|
#11 (permalink) | |
|
Senior Member
Join Date: Sep 2004
Posts: 149
|
Quote:
cURL supports fetch timeouts, handles HTTP redirects flawlessly and allows you to easily create a custom user agent very easily (e.g. you can set it as Google Bot, Yahoo! Slurp, MSNBot, IE, Firefox etc.). The trade off is increased CPU usage, which is hardly a trade off in this day and age of cheap CPU cycles. The main drawback i've found when using file_get_contents() during HTTP calls is the stalling of applications, it is great for grabbing file contents on the local system tho e.g. file_get_contents('/home/myBackups/data.txt'). |
|
|
![]() |