| Home | Register | FAQ | Members List | Search | Today's Posts | Mark Forums Read |
|
|
#1 (permalink) |
|
Registered User
Join Date: Nov 2005
Posts: 57
|
regular expressions - everything after
Hi, thanks to those guys who have helped me so far. I am trying to write my own spider script and am in the process of extracting urls from a page. I have managed to get the urls with "<a href" attached at the beginning i.e. <a href="index.htm". I have then tried removing the "<a href" by putting the matching string into another regular expression preceded by $' which effectively gets 'everything after' a pattern, in my case everthing after <a href. No errors are highlighted in my code but it doesn't work. Can anyone put me straight? Code:- <?php //get page contents $page = file_get_contents("http://www.jagprops.co.uk/"); //find urls in $page, matches are put in $matches preg_match_all ("/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"'>]+)[\"'>]/", $page, $matches); //only first 2 urls are echoed at the moment echo $matches[0][0]; echo $matches[0][1]; $string = $matches[0][0]; //get everything after href preg_match("/$'href\s*=\s*/",$string,$url); echo $url[0][1]; ?> Last edited by Skidzy McFergus : 10-07-2006 at 03:51. |
|
|
|
|
|
#2 (permalink) |
|
Registered User
Join Date: Nov 2005
Posts: 57
|
Have found a messy solution <?php //get page contents $page = file_get_contents("http://www.jagprops.co.uk/"); //find urls in $page, matches are put in $matches preg_match_all ("/\<[aA][^>]*href=[\"']?([^\"' >]+)/", $page, $matches); $link = $matches[0][0]; $link = preg_replace( "/<a href=[\"']/","", $link); echo $link; ?> |
|
|
|
#3 (permalink) |
|
mingin dawg baitch
|
I wouldn't call that a messy solution, less code is cleaner plus regexs are memory intensive (normally). Im not a php programmer but looks good. Does it work for urls which are like <a href="blah.htm" onclick="javascript:blah();">blah</a> im just trying to think of alternative options to break the script. |
|
|
|
#5 (permalink) |
|
Registered User
Join Date: Nov 2005
Posts: 57
|
Just tested it on this page and it outputs "blah.htm" (with the speech marks) so will have to smooth that one over. What's more of a concern is the b****cks i get at certain sites. Check out what happens at skint.co.uk, you can see the results at this address:- http://www.yourweb.org.uk/index2.php code:- <?php //get page contents $page = file_get_contents("http://www.skint.co.uk"); //find urls in $page, matches are put in $matches preg_match_all ("/href=[\"']?([^\"' >]+)/", $page, $matches); $count = count($matches[0]); for ( $counter = 0; $counter <= $count; $counter++){ $link = $matches[0][$counter]; $link = preg_replace( "/href=[\"']/","", $link); $link = preg_replace( "/href=/","", $link); echo "$link<br/>"; } ?> Last edited by Skidzy McFergus : 11-07-2006 at 08:24. Reason: typo |
|
|
|
#7 (permalink) |
|
Registered User
Join Date: Nov 2005
Posts: 57
|
maybe i'm ignorant but i thought i needed to pull out links to other sites, give the spider something to do after it's finished a site - go and do another. Maybe i've misunderstood what you mean or what i'm meant to be doing but my impression is that i should give the spider a list of predefined sites from which it crawls other sites indexing pages as it goes, only stopping when my database is full. I obviously don't intend to index the whole web but i want to process a large enough subset to be able to produce a functional search engine. I think i'm probably missing something, not sure what. |
|
|
|
#8 (permalink) | |
|
Everything is fine.
|
Quote:
Obviously, I understand that you want to be able to search indexed web sites but I presumed this was going to be an engine for a client or private site where you were only wanting to index "selected" sites (and not externally linked ones too) for searching at a later date. I didn't know you were planning on becoming the next Google...in which case, carry on. By the way, continuing to index until your database is "full" could be a bad idea. I mean, have you set a restriction on the size of your DB or will it go until it just crashes out on you ? I hope this isn't going to be run on a shared hosting server otherwise you may find yourself on the receiving end of some angry admins. Like I said, it all depends on your situation and how this search engine is going to be used and utilised. - Mike |
|
|
|
|
#9 (permalink) |
|
Everything is fine.
|
On another note, I'm not too clued up on the technicalities of how engines such as Google actually work, but I don't think they spider each externally linked site that it comes across after a new submission otherwise the spidering work would never end and it would be continuously running. I would assume they set a limit of the depth of indexing ? - Mike |
|
|
|
#10 (permalink) |
|
mingin dawg baitch
|
I would maintain two arrays one for internal, one for external. Set an upper limit on each, when the page has been processed from the internal array pop it off, push on another. Do this until the internal array finishes (i.e. all pages have been parsed) then start at the top of your external array and start repopulating the internal array for that new domain... [internal] - processes www.domainone.com first. pageone.htm pagetwo.htm [external] www.domainone.com www.domaintwo.com something like that? |
|
|
|
#11 (permalink) |
|
Registered User
Join Date: Nov 2005
Posts: 57
|
Cheers boys, excellent advice as always. When my company gets in the FTSE 100 i'll give you some shares, have to start a company first though:0) I obviously can't compete with any search engine as we know it and i don't intend to but i have socialist tendancies and a simple but effective idea for addressing the search engine bias towards established pages coined by a guy called Cho as the "rich get richer phenomenon". I'm sure you're aware but if not check: - http://www.search-engine-book.co.uk/...nking_rich.pdf I feel like i'm getting out of my depth but have spoken to several experts and they haven't told me i was delusional or mad and they're very interested in how i get on so i'm gonna give it a go. All I have to do now is learn how to code properly. When i get to the point where it turns out that i have been a deluded fool i'll give you the full details. Hopefully all that doesn't sound too full of s**t. |
|
|
|
#14 (permalink) |
|
Registered Abuser
Join Date: Jun 2006
Location: Manchester, England.
Posts: 174
|
Sounds like you've already got the regexp sussed. Still, in case you don't know it... http://www.regexlib.com/ Searches for uri and url returned some promising results. This is one of the first that showed up: http://www.regexlib.com/REDetails.aspx?regexp_id=115 |
|
![]() |