Regular Expressions

Notes:
  • Series.str functions with regular expressions:
    • contains
    • count
    • endswith
    • extract
    • extractall
    • findall
    • match
    • replace
    • rsplit
    • split
    • startswith

Dictionary regex:

import pandas as pd
import csv
import re

#read csv and assign to book
# book = pd.read_csv ('phonebook01.csv')
book = pd.read_csv ('../p.csv')

# print("*****Table of incorrect inputs. Rows will not be entered to phonebook"
# " if an element in that row contains data in table below******\n")
#give regexs for each column defined as First Last Zip and Phone
dict_regex = {
    'first': '([A-Z][a-z]*$)',
    'last' : '([A-Z][a-z]*$)',
    'zip'  : '(^\d{5}(-\d{4})?$)|(\d{9}$)',
    'phone': '\d{10}|(\d\d\d[-]\d\d\d[-]\d\d\d\d)'
}
#if data meets regex criteria then it is replaced with blank space
#this leaves only the incorrect entries showing.
# new_book = book.replace(dict_regex, value='',reg
new_book = book.replace(regex=dict_regex, value='')
print(new_book)

Output:

<complex and depends on data file>
<matches are replaced with blanks, leaving problem strings>


Funtion with regex:

import pandas as pd

book = pd.read_csv ('phonebook01.csv')

def check (col, rx, sn):
   b = book[col].str.match (r'('+rx+')')
   # print ("first b:\n", b)
   b = b.fillna (True)
   # print (type(b))
   bf = book[b].fillna ("<missing>")
   if bf.size > 0:
      print ("\nThe following lines have "+sn+" errors")
      print (bf)
   else:
      print ("\nAll "+sn+"s are ok.")
   return b

bf = check ('first' , '?![A-Z][a-z]+$'            , 'first name')
bz = check ('zip'   , '?!(\d{5}-\d{4}$)|(\d{5}$)' , 'zip code')


print ("good lines:\n", book [~(bz|bf)])

Output:

<complex and depends on data file>
<matches are replaced with blanks, leaving problem strings>

Row by row for loop:

# df is some panda data frame
for index, row in df.iterrows():
    print (f'>{row}<\n') # print each row to check
    if index == 0 : continue # skip header line

Any and all example:

a = range (-5, 25, 5)
print (a)
b = list (a)
print (b)

print ("any:", any(b))
print ("all:", all(b))

import pandas as pd
c = list (map(lambda x: x != 0, b))
print (c)
d = pd.DataFrame(a, columns=['values'])
print ("   plain:\n", d)
print ("filtered:\n", d[c])

output - notice that the 1 0 line has been removed in the filtered display

range(-5, 25, 5)
[-5, 0, 5, 10, 15, 20]
any: True
all: False
[True, False, True, True, True, True]
   plain:
    values
0      -5
1       0
2       5
3      10
4      15
5      20
filtered:
    values
0      -5
2       5
3      10
4      15
5      20




(end)