The W
December 6, 2008 - monopoly.jpg
Views: 158598324
Main | FAQ | Search: Y! / G | Calendar | Color chart | Log in for more!
21.6.21 0835
The W - Site Bashing - A1 and wrestingdb...how?!
This thread has 14 referrals leading to it
Register and log in to post!
Pages: 1(247 newer) Next thread | Previous thread
User
Post (4 total)
bradbice
Chorizo








Since: 2.1.02
From: MI

Since last post: 3041 days
Last activity: 1899 days
#1 Posted on
Can ANYONE tell me how they do that with the headlines? I know all about RSS feeds and stuff. But we all know that WWE.com and 1wrestling don't offer feeds.

What is the secret? Anyone?
Promote this thread!
vacheroi
Chorizo








Since: 29.5.02

Since last post: 6306 days
Last activity: 6302 days
#2 Posted on
They either "screen scrape" (parse the relevant pages in Perl or PHP or whatever) or do it manually.
Crip
Mettwurst








Since: 1.3.03

Since last post: 6327 days
Last activity: 5102 days
#3 Posted on
Just something to note, a1wrestling always only links to the headlines page whereas wrestlingdb links to the article.



StableWars.com - You'll never watch wrestling the same way again.
FriedEgg
Polska kielbasa








Since: 13.6.03
From: Washington, DC

Since last post: 4865 days
Last activity: 4865 days
#4 Posted on
I can answer that... since wrestlingdb is my site.

It's really no secret, and vacheroi is correct. I use PHP with Perl regular expressions to "match" the headlines and rip them out of the page. It's not an exact science and it requires a little bit of trial and error, but it works pretty well.

For example, for 1wrestling, after I've grabbed the page into a buffer, I use this code


$buffer = preg_replace("'^.*?<td width\=\"1%\">\ \;\ \; \;\</td\>(.*?)<td width\=\"1\%\">\ \;\ \;</td>[^>]*?>.*'s",'\1',$buffer);
$buffer = str_replace(' ','',strip_tags($buffer,'<a>'));
preg_match_all("'<a href=\'/(.*?)\'>(.*?)</a>\s?by\s(.*?)\s-.*?:\s(.*?M)'s",$buffer,$items,PREG_SET_ORDER);


The first line takes the page, and strips out most of header/footer type stuff, leaving the body of headlines.
The second line strips out all of the tags except for the links.
The third line then matches certain parts of the links and puts them into an array which I can then use to insert into a database.

Each site is a bit different, and when sites redesign, I have to go through the whole process of determining what'll work again.

If you have any other questions, let me know.

(edited by FriedEgg on 13.6.03 1520)
Pages: 1Thread ahead: 411's News/Opinion Squad
Next thread: I wish Scotty would just stick to wrestling
Previous thread: More stupidity from "The Slammer"
(247 newer) Next thread | Previous thread
Well, both, really. The OnlineOnslaught.com index page circa right now is about 32Kb (to the nearest allocation unit on my hard drive), meaning that the site incurs 1Mb of traffic for every 32 users who visit the index page alone.
- tarnish, online $ (2002)
The W - Site Bashing - A1 and wrestingdb...how?!Register and log in to post!

The W™ message board

ZimBoard
©2001-2021 Brothers Zim

This old hunk of junk rendered your page in 0.055 seconds.