Crip
Mettwurst Level: 32
Posts: 89/172 EXP: 197799 For next: 8645
Since: 1.3.03
Since last post: 7329 days Last activity: 6104 days
| #3 Posted on 12.6.03 0526.39 Reposted on: 12.6.10 0529.01 | Just something to note, a1wrestling always only links to the headlines page whereas wrestlingdb links to the article. | FriedEgg
Polska kielbasa Level: 28
Posts: 1/125 EXP: 121712 For next: 9628
Since: 13.6.03 From: Washington, DC
Since last post: 5867 days Last activity: 5867 days
| #4 Posted on 13.6.03 1420.12 Reposted on: 13.6.10 1427.34 | I can answer that... since wrestlingdb is my site.
It's really no secret, and vacheroi is correct. I use PHP with Perl regular expressions to "match" the headlines and rip them out of the page. It's not an exact science and it requires a little bit of trial and error, but it works pretty well.
For example, for 1wrestling, after I've grabbed the page into a buffer, I use this code
$buffer = preg_replace("'^.*?<td width\=\"1%\">\ \;\ \; \;\</td\>(.*?)<td width\=\"1\%\">\ \;\ \;</td>[^>]*?>.*'s",'\1',$buffer); $buffer = str_replace(' ','',strip_tags($buffer,'<a>')); preg_match_all("'<a href=\'/(.*?)\'>(.*?)</a>\s?by\s(.*?)\s-.*?:\s(.*?M)'s",$buffer,$items,PREG_SET_ORDER);
The first line takes the page, and strips out most of header/footer type stuff, leaving the body of headlines. The second line strips out all of the tags except for the links. The third line then matches certain parts of the links and puts them into an array which I can then use to insert into a database.
Each site is a bit different, and when sites redesign, I have to go through the whole process of determining what'll work again.
If you have any other questions, let me know.
(edited by FriedEgg on 13.6.03 1520) | ALL ORIGINAL POSTS IN THIS THREAD ARE NOW AVAILABLE |
| |