Reply LinkBack Thread Tools Search this Thread
Old 09-07-2006, 13:50   #1 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
regular expressions - everything after

Hi, thanks to those guys who have helped me so far. I am trying to write my own spider script and am in the process of extracting urls from a page. I have managed to get the urls with "<a href" attached at the beginning i.e. <a href="index.htm". I have then tried removing the "<a href" by putting the matching string into another regular expression preceded by $' which effectively gets 'everything after' a pattern, in my case everthing after <a href. No errors are highlighted in my code but it doesn't work. Can anyone put me straight?

Code:-

<?php

//get page contents
$page = file_get_contents("http://www.jagprops.co.uk/");

//find urls in $page, matches are put in $matches
preg_match_all ("/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"'>]+)[\"'>]/", $page, $matches);

//only first 2 urls are echoed at the moment
echo $matches[0][0];
echo $matches[0][1];

$string = $matches[0][0];

//get everything after href
preg_match("/$'href\s*=\s*/",$string,$url);


echo $url[0][1];

?>

Last edited by Skidzy McFergus : 10-07-2006 at 03:51.
  Reply With Quote
Old 10-07-2006, 11:30   #2 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
Have found a messy solution

<?php

//get page contents
$page = file_get_contents("http://www.jagprops.co.uk/");

//find urls in $page, matches are put in $matches
preg_match_all ("/\<[aA][^>]*href=[\"']?([^\"' >]+)/", $page, $matches);

$link = $matches[0][0];

$link = preg_replace( "/<a href=[\"']/","", $link);

echo $link;
?>
  Reply With Quote
Old 10-07-2006, 16:16   #3 (permalink)
paulanthony
mingin dawg baitch
 
paulanthony's Avatar
 
Join Date: Apr 2004
Location: Belfast
Posts: 1,035
Send a message via MSN to paulanthony
I wouldn't call that a messy solution, less code is cleaner plus regexs are memory intensive (normally). Im not a php programmer but looks good. Does it work for urls which are like

<a href="blah.htm" onclick="javascript:blah();">blah</a>

im just trying to think of alternative options to break the script.
  Reply With Quote
Old 10-07-2006, 16:18   #4 (permalink)
paulanthony
mingin dawg baitch
 
paulanthony's Avatar
 
Join Date: Apr 2004
Location: Belfast
Posts: 1,035
Send a message via MSN to paulanthony
  Reply With Quote
Old 11-07-2006, 07:52   #5 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
Just tested it on this page and it outputs "blah.htm" (with the speech marks) so will have to smooth that one over. What's more of a concern is the b****cks i get at certain sites. Check out what happens at skint.co.uk, you can see the results at this address:-

http://www.yourweb.org.uk/index2.php

code:-

<?php

//get page contents
$page = file_get_contents("http://www.skint.co.uk");


//find urls in $page, matches are put in $matches
preg_match_all ("/href=[\"']?([^\"' >]+)/", $page, $matches);

$count = count($matches[0]);

for ( $counter = 0; $counter <= $count; $counter++){

$link = $matches[0][$counter];

$link = preg_replace( "/href=[\"']/","", $link);

$link = preg_replace( "/href=/","", $link);
echo "$link<br/>";

}
?>

Last edited by Skidzy McFergus : 11-07-2006 at 08:24. Reason: typo
  Reply With Quote
Old 11-07-2006, 08:15   #6 (permalink)
MikeMackay
Everything is fine.
 
MikeMackay's Avatar
 
Join Date: Feb 2005
Location: Witham & London
Posts: 772
Send a message via MSN to MikeMackay Send a message via Skype™ to MikeMackay
Yeah the script is pulling out any links to external websites. As I mentioned in a previous post you may want to filter these out so you don't crawl them otherwise, in theory, you could end up indexing the entire interweb thing !

- Mike
  Reply With Quote
Old 11-07-2006, 08:33   #7 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
maybe i'm ignorant but i thought i needed to pull out links to other sites, give the spider something to do after it's finished a site - go and do another. Maybe i've misunderstood what you mean or what i'm meant to be doing but my impression is that i should give the spider a list of predefined sites from which it crawls other sites indexing pages as it goes, only stopping when my database is full. I obviously don't intend to index the whole web but i want to process a large enough subset to be able to produce a functional search engine. I think i'm probably missing something, not sure what.
  Reply With Quote
Old 11-07-2006, 09:52   #8 (permalink)
MikeMackay
Everything is fine.
 
MikeMackay's Avatar
 
Join Date: Feb 2005
Location: Witham & London
Posts: 772
Send a message via MSN to MikeMackay Send a message via Skype™ to MikeMackay
Quote:
Originally Posted by Skidzy McFergus
maybe i'm ignorant but i thought i needed to pull out links to other sites, give the spider something to do after it's finished a site - go and do another. Maybe i've misunderstood what you mean or what i'm meant to be doing but my impression is that i should give the spider a list of predefined sites from which it crawls other sites indexing pages as it goes, only stopping when my database is full. I obviously don't intend to index the whole web but i want to process a large enough subset to be able to produce a functional search engine. I think i'm probably missing something, not sure what.
Hey no worries, perhaps I misunderstood your general purpose for the search engine.

Obviously, I understand that you want to be able to search indexed web sites but I presumed this was going to be an engine for a client or private site where you were only wanting to index "selected" sites (and not externally linked ones too) for searching at a later date.

I didn't know you were planning on becoming the next Google...in which case, carry on.

By the way, continuing to index until your database is "full" could be a bad idea. I mean, have you set a restriction on the size of your DB or will it go until it just crashes out on you ? I hope this isn't going to be run on a shared hosting server otherwise you may find yourself on the receiving end of some angry admins.

Like I said, it all depends on your situation and how this search engine is going to be used and utilised.

- Mike
  Reply With Quote
Old 11-07-2006, 09:58   #9 (permalink)
MikeMackay
Everything is fine.
 
MikeMackay's Avatar
 
Join Date: Feb 2005
Location: Witham & London
Posts: 772
Send a message via MSN to MikeMackay Send a message via Skype™ to MikeMackay
On another note, I'm not too clued up on the technicalities of how engines such as Google actually work, but I don't think they spider each externally linked site that it comes across after a new submission otherwise the spidering work would never end and it would be continuously running. I would assume they set a limit of the depth of indexing ?

- Mike
  Reply With Quote
Old 11-07-2006, 11:01   #10 (permalink)
paulanthony
mingin dawg baitch
 
paulanthony's Avatar
 
Join Date: Apr 2004
Location: Belfast
Posts: 1,035
Send a message via MSN to paulanthony
I would maintain two arrays one for internal, one for external. Set an upper limit on each, when the page has been processed from the internal array pop it off, push on another. Do this until the internal array finishes (i.e. all pages have been parsed) then start at the top of your external array and start repopulating the internal array for that new domain...

[internal] - processes www.domainone.com first.
pageone.htm
pagetwo.htm

[external]
www.domainone.com
www.domaintwo.com

something like that?
  Reply With Quote
Old 11-07-2006, 11:52   #11 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
Cheers boys, excellent advice as always. When my company gets in the FTSE 100 i'll give you some shares, have to start a company first though:0) I obviously can't compete with any search engine as we know it and i don't intend to but i have socialist tendancies and a simple but effective idea for addressing the search engine bias towards established pages coined by a guy called Cho as the "rich get richer phenomenon". I'm sure you're aware but if not check: -

http://www.search-engine-book.co.uk/...nking_rich.pdf

I feel like i'm getting out of my depth but have spoken to several experts and they haven't told me i was delusional or mad and they're very interested in how i get on so i'm gonna give it a go.

All I have to do now is learn how to code properly.

When i get to the point where it turns out that i have been a deluded fool i'll give you the full details.
Hopefully all that doesn't sound too full of s**t.
  Reply With Quote
Old 11-07-2006, 12:16   #12 (permalink)
MikeMackay
Everything is fine.
 
MikeMackay's Avatar
 
Join Date: Feb 2005
Location: Witham & London
Posts: 772
Send a message via MSN to MikeMackay Send a message via Skype™ to MikeMackay
Quote:
Originally Posted by Skidzy McFergus
have spoken to several experts and they haven't told me i was delusional or mad and they're very interested in how i get
Are you implying that *we're* not professionals ?! How rude

- Mike
  Reply With Quote
Old 11-07-2006, 13:36   #13 (permalink)
Skidzy McFergus
Registered User
 
Join Date: Nov 2005
Posts: 57
Do professionals hang around coding forums all day? !o) I'll rephrase myself - I have asked some OTHER experts who are friends I've known for years.
  Reply With Quote
Old 11-07-2006, 19:15   #14 (permalink)
sjd
Registered Abuser
 
sjd's Avatar
 
Join Date: Jun 2006
Location: Manchester, England.
Posts: 174
Sounds like you've already got the regexp sussed. Still, in case you don't know it...

http://www.regexlib.com/

Searches for uri and url returned some promising results. This is one of the first that showed up: http://www.regexlib.com/REDetails.aspx?regexp_id=115
  Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search


Contact Us - Web Design Forums - Archive - Top
Search Engine Optimization by vBSEO 3.0.0 RC8