User Name
Password

Go Back   Planetarion Forums > Non Planetarion Discussions > Programming and Discussion
Register FAQ Members List Calendar Arcade Today's Posts

Reply
Thread Tools Display Modes
Unread 22 Jan 2003, 23:30   #1
King Elessar
Guest
 
Posts: n/a
PERL Question

Ok.... Here's what I want to do...

I think this may have to do with matching (=~) but I'm not sure. I want to be able to--uh--extract text that is between certain HTML commands in a file. e.g, I want to get the titles for a group of files, so, I'll want to find the and get "This is the title" from <title>This is the title</title> without getting the whole document.

Any ideas?

Thanks,
Nik
  Reply With Quote
Unread 22 Jan 2003, 23:59   #2
Nodrog
Registered User
 
Join Date: Jun 2000
Posts: 8,476
Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.
You could use an XML parser. Alternatively, something like this...

Code:
#!/usr/bin/perl
my(@lines, $open, $close);
$open = "<".$ARGV[1]; 
$close = "</".$ARGV[1].">";
open(FILE, "$ARGV[0]") || die "couldnt open file"; 
@lines = <FILE>;
close(FILE);
foreach(@lines) {
	if(/$open[^>]*>(.*)$close/) {print $1."\n";}
}
will search a file for a tag, and print out the text between it.

For example...

"perl parser.pl what.html title"

will output the text between the <title> and </title> tag in the what.html file

"perl parser.pl blah.html b"

will output all the bold text in the file (ie all the text between <b> and </b>)



Obviously thats pretty basic, but you could play around with it till it does what you need.

edit - stupid mistake

Last edited by Nodrog; 23 Jan 2003 at 00:14.
Nodrog is offline   Reply With Quote
Unread 23 Jan 2003, 00:59   #3
MT
/dev/zero
Retired Mod
 
MT's Avatar
 
Join Date: May 2000
Posts: 415
MT is an unknown quantity at this point
Code:
<b>owned</b> unbolded <b>is you</b>
heh
__________________
#linux : Home of Genius

<idimmu> ok i was chained to a desk with this oriental dude
MT is offline   Reply With Quote
Unread 23 Jan 2003, 01:04   #4
Atamur
Ngisne
 
Join Date: Jul 2001
Location: right here
Posts: 79
Atamur is an unknown quantity at this point
($text) = $html =~ m#<blah>(.*?)</blah>#is;
__________________
down with signatures
Atamur is offline   Reply With Quote
Unread 23 Jan 2003, 01:06   #5
Nodrog
Registered User
 
Join Date: Jun 2000
Posts: 8,476
Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.Nodrog has ascended to a higher existance and no longer needs rep points to prove the size of his e-penis.
Quote:
Originally posted by MT
Code:
<b>owned</b> unbolded <b>is you</b>
heh
youre obviously gay
Nodrog is offline   Reply With Quote
Unread 23 Jan 2003, 01:37   #6
MT
/dev/zero
Retired Mod
 
MT's Avatar
 
Join Date: May 2000
Posts: 415
MT is an unknown quantity at this point
Code:
#!/usr/bin/perl -w
use strict;
require HTML::TokeParser;
my ($tag,$doc) = @ARGV;
my $parser = new HTML::TokeParser->new($doc);
while (my $tok = $parser->get_tag("$tag")) {
  print $parser->get_trimmed_text("/$tag") . "\n";
}
Dont oppress me homobasher!
__________________
#linux : Home of Genius

<idimmu> ok i was chained to a desk with this oriental dude
MT is offline   Reply With Quote
Unread 23 Jan 2003, 06:36   #7
W
Gubbish
 
Join Date: Sep 2000
Location: #FoW
Posts: 2,323
W is a jewel in the roughW is a jewel in the roughW is a jewel in the rough
Code:
#!/usr/bin/perl
$starttag=shift(@ARGV);
$untag='</'.$starttag.'>';$startag='<'.$starttag;
$endtag='>';
$state=0;
while(<>)while(length){
$state=1,$_=$1 if $state=0 && m/$starttag(.*)/;
$state=2,$_=$1 if $state=1 && m/$endtag(.*)/;
$state=0,$_=$2,print $1 if $state=2 && m/(.*?)$untag(.*)/;
}
> ./myperlcode title index.html other.html

> ./myperlcode a links.html


You can notice that I like online algorithms.
__________________
Gubble gubble gubble gubble
W is offline   Reply With Quote
Unread 23 Jan 2003, 06:40   #8
W
Gubbish
 
Join Date: Sep 2000
Location: #FoW
Posts: 2,323
W is a jewel in the roughW is a jewel in the roughW is a jewel in the rough
Quote:
Originally posted by Nodrog
edit - stupid mistake
Well, and your program will only find one tag per line. If someone is bright enough to remove all the newlines from their html, your program will ever only find the first tag.
__________________
Gubble gubble gubble gubble
W is offline   Reply With Quote
Unread 23 Jan 2003, 11:32   #9
Atamur
Ngisne
 
Join Date: Jul 2001
Location: right here
Posts: 79
Atamur is an unknown quantity at this point
[quote]Originally posted by W
[b]
Code:
...............
why not simply gobble up the entire file into memory? it would be faster and simpler.
__________________
down with signatures
Atamur is offline   Reply With Quote
Unread 23 Jan 2003, 12:40   #10
W
Gubbish
 
Join Date: Sep 2000
Location: #FoW
Posts: 2,323
W is a jewel in the roughW is a jewel in the roughW is a jewel in the rough
[quote]Originally posted by Atamur
[b]
Quote:
Originally posted by W
Code:
...............
why not simply gobble up the entire file into memory? it would be faster and simpler.
As I said, I like online algorithms. Say it's a collection of eighty thousand 10kb html files?
__________________
Gubble gubble gubble gubble
W is offline   Reply With Quote
Unread 23 Jan 2003, 13:26   #11
Atamur
Ngisne
 
Join Date: Jul 2001
Location: right here
Posts: 79
Atamur is an unknown quantity at this point
Quote:
Originally posted by W
As I said, I like online algorithms. Say it's a collection of eighty thousand 10kb html files?
loading one file at a time in memory will be much faster then a line at a time.
__________________
down with signatures
Atamur is offline   Reply With Quote
Unread 24 Jan 2003, 00:26   #12
W
Gubbish
 
Join Date: Sep 2000
Location: #FoW
Posts: 2,323
W is a jewel in the roughW is a jewel in the roughW is a jewel in the rough
Quote:
Originally posted by Atamur
loading one file at a time in memory will be much faster then a line at a time.
No.
__________________
Gubble gubble gubble gubble
W is offline   Reply With Quote
Reply



Forum Jump


All times are GMT +1. The time now is 11:13.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2002 - 2018