Searching and Parsing Text

By Mark Ciotola

First published on September 9, 2019. Last updated on February 6, 2021.

A key strength of Perl is that it has lots of built-in tools for searching and parsing text. This is why is it so popular in both the humanities and biotechnology (genes are expressed in strings of letters).

Below is an example that you can try out. First, some text to search is provided and placed in the variable “$speechtext”.

The a search condition is specified. =~ means contains. The “\” (escape character) is added to tell the ideone environment that the ~ really is intended as a ~. (Escaping the tilde is only required in some environments).

The desired string is placed in “//” characters.

The if control structure test the search condition and then alerts the user if the condition is or is not met.

#!/usr/bin/perl6

my $speechtext = "Four score and seven years ago...";

if ($speechtext =~ /seven/ ) {
   print "My text item is found.\n";
} else {
   print "My text item is not found.\n";
}

Below is a program to count how many matches there are in a file.

use strict;
use warnings FATAL => 'all';

#### BANNER ####

print "\n=================================\n";
print "\n DOCUMENT SEARCHING PROGRAM      \n";
print "\n=================================\n\n";

#### CREATE SOME TEXT DATA ####

my $afile = 'Abe, Bill, Cathy, Darlene, Eva, Darlene, Bill';

#### WRITE FILE ####
{
    open my $fh, '>', 'afile.txt';
    print {$fh}  $afile. "\n";
    close $fh;
}

#### READ & PRINT CONTENTS ####
{
    #### ENTER SEARCH TERM HERE:
    
    my $textitem = "Darlene";
    
    #### OPEN FILE ####
    
    open my $fh, '<', 'afile.txt';
    my ($filetext) = <$fh>;
    print "Data in file: ". $afile . "\n\n";
    
    #### SEARCH ####

    if ($filetext =~ /$textitem/ ) {
        print "My search term $textitem is found.\n\n";
    } else {
        print "My search term $textitem is not found.\n\n";
    }

    #### COUNT MATCHES ####


    my @matches = ($filetext =~ /$textitem/g);
    my $mymatches = @matches;
    print "My search term $textitem is matched $mymatches time(s).";

    close $fh;
}

Below is a program that attempts to measure the strength of a relationship between to people by listing files that in which both of their names are present.

use strict;
use warnings FATAL => 'all';
# use autodie qw(:all);


#### BANNER ####

print "\n======================================\n";
print "\n DATA MINING PROGRAM -- RELATIONSHIPS \n";
print "\n======================================\n\n";

my $minerun = int(rand(100000));

print "Mining Run # $minerun \n\n";

#### DATA ####

my $meeting_attendees_1891 = 'Abe, Bill, Cathy, Darlene';
my $meeting_attendees_1892 = 'Bill, Jean';
my $meeting_attendees_1893 = 'Abe, Bill, Pat, Darlene';


#### CREATE AND WRITE DATA TO FILES ####

## In a real workflow, the files would already exist. ##
## Also, a loop would be more elegant, but trickier. ##

{
    open my $fh, '>', 'meeting_attendees_1891.txt';
    print {$fh}  $meeting_attendees_1891 . "\n";
    close $fh;
}

{
    open my $fh, '>', 'meeting_attendees_1892.txt';
    print {$fh}  $meeting_attendees_1892 . "\n";
    close $fh;
}

{
    open my $fh, '>', 'meeting_attendees_1893.txt'; # cfile is reserved
    print {$fh}  $meeting_attendees_1893 . "\n";
    close $fh;
}

#### PRINT LIST OF FILES ####

## This is somewhat manual, due to limit of this interface.

## Set up an array with the file names

my @meetingfiles = ("meeting_attendees_1891", "meeting_attendees_1892", "meeting_attendees_1893");

## Run through the array values with a loop

print "FILES:\n\n";

for (my $i=0; $i <= 2; $i++) {

print "$meetingfiles[$i]" . ".txt\n";

}

#### READ DATA FROM FILES ####

## Relationship Criteria ##

my $person1 = "Darlene"; my $person2 = "Bill";

## Mine The Data ##

print "\nMY CRITERIA RESULTS:\n\n";
{
    for (my $i=0; $i <= 2; $i++) {
        # print "$i\n";

        open my $fh, '<', $meetingfiles[$i] . ".txt";
        my ($filetext) = <$fh>;

        if ($filetext =~ /$person1/ and $filetext =~ /$person2/ ) {
        print "Both $person1 and $person2 are found in $meetingfiles[$i].txt\n";
        } 
        close $fh;
    
    }
}

Perl Programming Language

Searching and Parsing Text

By Mark Ciotola

Further Reading

Content is copyright the author. Layout is copyright Mark Ciotola. See Corsbook.com for further notices.