Planetarion Forums

22 Jan 2003, 23:30

Ok.... Here's what I want to do...

I think this may have to do with matching (=~) but I'm not sure. I want to be able to--uh--extract text that is between certain HTML commands in a file. e.g, I want to get the titles for a group of files, so, I'll want to find the and get "This is the title" from <title>This is the title</title> without getting the whole document.

Any ideas?

Thanks,
Nik

Nodrog · 22 Jan 2003, 23:59

You could use an XML parser. Alternatively, something like this...

Code:

#!/usr/bin/perl
my(@lines, $open, $close);
$open = "<".$ARGV[1]; 
$close = "</".$ARGV[1].">";
open(FILE, "$ARGV[0]") || die "couldnt open file"; 
@lines = <FILE>;
close(FILE);
foreach(@lines) {
	if(/$open[^>]*>(.*)$close/) {print $1."\n";}
}

will search a file for a tag, and print out the text between it.

For example...

"perl parser.pl what.html title"

will output the text between the <title> and </title> tag in the what.html file

"perl parser.pl blah.html b"

will output all the bold text in the file (ie all the text between and )

Obviously thats pretty basic, but you could play around with it till it does what you need.

edit - stupid mistake

MT · 23 Jan 2003, 00:59

Code:

<b>owned</b> unbolded <b>is you</b>
heh

Atamur · 23 Jan 2003, 01:04

($text) = $html =~ m#<blah>(.*?)</blah>#is;

Nodrog · 23 Jan 2003, 01:06

Quote:

Originally posted by MT

Code:

owned unbolded is you heh

youre obviously gay

MT · 23 Jan 2003, 01:37

Code:

#!/usr/bin/perl -w
use strict;
require HTML::TokeParser;
my ($tag,$doc) = @ARGV;
my $parser = new HTML::TokeParser->new($doc);
while (my $tok = $parser->get_tag("$tag")) {
  print $parser->get_trimmed_text("/$tag") . "\n";
}

Dont oppress me homobasher!

W · 23 Jan 2003, 06:36

Code:

#!/usr/bin/perl
$starttag=shift(@ARGV);
$untag='</'.$starttag.'>';$startag='<'.$starttag;
$endtag='>';
$state=0;
while(<>)while(length){
$state=1,$_=$1 if $state=0 && m/$starttag(.*)/;
$state=2,$_=$1 if $state=1 && m/$endtag(.*)/;
$state=0,$_=$2,print $1 if $state=2 && m/(.*?)$untag(.*)/;
}

> ./myperlcode title index.html other.html

> ./myperlcode a links.html

You can notice that I like online algorithms.

W · 23 Jan 2003, 06:40

Quote:

Originally posted by Nodrog
edit - stupid mistake

Well, and your program will only find one tag per line. If someone is bright enough to remove all the newlines from their html, your program will ever only find the first tag.

Atamur · 23 Jan 2003, 11:32

[quote]Originally posted by W
[b]

Code:

...............

why not simply gobble up the entire file into memory? it would be faster and simpler.

W · 23 Jan 2003, 12:40

[quote]Originally posted by Atamur
[b]

Quote:

Originally posted by W

Code:

...............

why not simply gobble up the entire file into memory? it would be faster and simpler.

As I said, I like online algorithms. Say it's a collection of eighty thousand 10kb html files?

Atamur · 23 Jan 2003, 13:26

Quote:

Originally posted by W
As I said, I like online algorithms. Say it's a collection of eighty thousand 10kb html files?

loading one file at a time in memory will be much faster then a line at a time.

W · 24 Jan 2003, 00:26

Quote:

Originally posted by Atamur
loading one file at a time in memory will be much faster then a line at a time.

No.