How to Read Words Into a List in Python 3

How to excerpt specific portions of a text file using Python

Updated: 06/thirty/2020 past Computer Hope

Python programming language logo

Extracting text from a file is a common task in scripting and programming, and Python makes it like shooting fish in a barrel. In this guide, we'll discuss some unproblematic ways to extract text from a file using the Python 3 programming language.

Brand sure you're using Python iii

In this guide, nosotros'll be using Python version 3. Virtually systems come pre-installed with Python 2.vii. While Python 2.seven is used in legacy code, Python 3 is the present and future of the Python language. Unless you accept a specific reason to write or support Python ii, we recommend working in Python 3.

For Microsoft Windows, Python three can be downloaded from the Python official website. When installing, brand sure the "Install launcher for all users" and "Add Python to PATH" options are both checked, equally shown in the image below.

Installing Python 3.7.2 for Windows.

On Linux, y'all can install Python 3 with your packet manager. For instance, on Debian or Ubuntu, you can install it with the following command:

sudo apt-become update && sudo apt-get install python3

For macOS, the Python 3 installer can be downloaded from python.org, as linked above. If you are using the Homebrew package director, information technology can likewise be installed past opening a final window (ApplicationsUtilities), and running this command:

brew install python3

Running Python

On Linux and macOS, the command to run the Python 3 interpreter is python3. On Windows, if you installed the launcher, the command is py. The commands on this page use python3; if you're on Windows, substitute py for python3 in all commands.

Running Python with no options starts the interactive interpreter. For more information about using the interpreter, see Python overview: using the Python interpreter. If you accidentally enter the interpreter, y'all can exit it using the command go out() or quit().

Running Python with a file proper noun will interpret that python program. For instance:

python3 plan.py

...runs the program contained in the file program.py.

Okay, how tin can nosotros utilize Python to extract text from a text file?

Reading data from a text file

Get-go, permit'southward read a text file. Let's say we're working with a file named lorem.txt, which contains lines from the Lorem Ipsum case text.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

Annotation

In all the examples that follow, we work with the four lines of text contained in this file. Copy and paste the latin text to a higher place into a text file, and save it as lorem.txt, so y'all tin can run the example lawmaking using this file as input.

A Python program can read a text file using the congenital-in open up() office. For instance, the Python 3 program beneath opens lorem.txt for reading in text mode, reads the contents into a string variable named contents, closes the file, and prints the data.

myfile = open("lorem.txt", "rt") # open up lorem.txt for reading text contents = myfile.read()         # read the entire file to cord myfile.close()                   # shut the file print(contents)                  # print string contents

Here, myfile is the name we give to our file object.

The "rt" parameter in the open() part means "nosotros're opening this file to read text data"

The hash mark ("#") means that everything on that line is a comment, and it'due south ignored past the Python interpreter.

If y'all save this program in a file called read.py, y'all tin can run information technology with the following control.

python3 read.py

The command to a higher place outputs the contents of lorem.txt:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

Using "with open"

It'southward important to close your open files as shortly every bit possible: open up the file, perform your operation, and close it. Don't leave it open for extended periods of time.

When you're working with files, information technology's good practise to use the with open up...equally compound statement. It's the cleanest way to open a file, operate on it, and close the file, all in ane like shooting fish in a barrel-to-read block of lawmaking. The file is automatically closed when the code block completes.

Using with open...equally, nosotros can rewrite our plan to look like this:

with open up ('lorem.txt', 'rt') every bit myfile:  # Open lorem.txt for reading text     contents = myfile.read()              # Read the unabridged file to a string print(contents)                           # Print the string

Note

Indentation is important in Python. Python programs use white space at the start of a line to define scope, such as a block of code. We recommend you lot use iv spaces per level of indentation, and that y'all use spaces rather than tabs. In the following examples, make certain your code is indented exactly equally information technology'southward presented here.

Example

Save the program equally read.py and execute it:

python3 read.py

Output:

Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

Reading text files line-past-line

In the examples so far, we've been reading in the whole file at one time. Reading a full file is no big bargain with small files, but by and large speaking, it's not a groovy idea. For i thing, if your file is bigger than the corporeality of bachelor memory, you'll encounter an error.

In virtually every instance, it's a better idea to read a text file 1 line at a time.

In Python, the file object is an iterator. An iterator is a blazon of Python object which behaves in certain ways when operated on repeatedly. For case, yous can apply a for loop to operate on a file object repeatedly, and each time the same operation is performed, you'll receive a different, or "side by side," result.

Instance

For text files, the file object iterates one line of text at a fourth dimension. It considers one line of text a "unit of measurement" of data, so nosotros tin can use a for...in loop argument to iterate i line at a fourth dimension:

with open ('lorem.txt', 'rt') as myfile:  # Open lorem.txt for reading     for myline in myfile:              # For each line, read to a string,         print(myline)                  # and print the cord.

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.  Nunc fringilla arcu congue metus aliquam mollis.  Mauris nec maximus purus. Maecenas sit amet pretium tellus.  Quisque at dignissim lacus.

Observe that nosotros're getting an extra line pause ("newline") afterwards every line. That's because ii newlines are existence printed. The first one is the newline at the finish of every line of our text file. The second newline happens because, by default, print() adds a linebreak of its own at the end of whatever you've asked information technology to print.

Let's store our lines of text in a variable — specifically, a list variable — so we can look at it more than closely.

Storing text data in a variable

In Python, lists are similar to, but non the aforementioned equally, an array in C or Java. A Python list contains indexed data, of varying lengths and types.

Example

mylines = []                             # Declare an empty listing named mylines. with open ('lorem.txt', 'rt') every bit myfile: # Open lorem.txt for reading text data.     for myline in myfile:                # For each line, stored as myline,         mylines.append(myline)           # add its contents to mylines. print(mylines)                           # Impress the list.

The output of this program is a niggling dissimilar. Instead of printing the contents of the list, this program prints our listing object, which looks like this:

Output:

['Lorem ipsum dolor sit amet, consectetur adipiscing elit.\north', 'Nunc fringilla arcu congue metus aliquam mollis.\n', 'Mauris nec maximus purus. Maecenas sit down amet pretium tellus.\due north', 'Quisque at dignissim lacus.\n']

Hither, nosotros see the raw contents of the list. In its raw object class, a list is represented as a comma-delimited list. Here, each element is represented as a string, and each newline is represented as its escape character sequence, \n.

Much like a C or Java array, the listing elements are accessed by specifying an index number after the variable name, in brackets. Index numbers get-go at nada — other words, the due norththursday element of a list has the numeric index n-1.

Annotation

If you're wondering why the index numbers start at null instead of one, you're not solitary. Estimator scientists have debated the usefulness of zero-based numbering systems in the past. In 1982, Edsger Dijkstra gave his opinion on the subject, explaining why goose egg-based numbering is the best way to index data in computer science. You tin read the memo yourself — he makes a compelling argument.

Example

We can print the first element of lines by specifying alphabetize number 0, contained in brackets after the name of the list:

print(mylines[0])

Output:

Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.

Example

Or the third line, past specifying index number 2:

print(mylines[2])

Output:

Quisque at dignissim lacus.

Only if we try to access an index for which there is no value, we get an error:

Example

print(mylines[three])

Output:

Traceback (most recent phone call terminal): File <filename>, line <linenum>, in <module> print(mylines[3]) IndexError: list index out of range

Instance

A list object is an iterator, and then to impress every chemical element of the list, we can iterate over it with for...in:

mylines = []                              # Declare an empty list with open up ('lorem.txt', 'rt') every bit myfile:  # Open up lorem.txt for reading text.     for line in myfile:                   # For each line of text,         mylines.append(line)              # add together that line to the listing.     for chemical element in mylines:               # For each element in the list,         print(chemical element)                    # print it.

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.  Nunc fringilla arcu congue metus aliquam mollis.  Mauris nec maximus purus. Maecenas sit amet pretium tellus.  Quisque at dignissim lacus.

But we're yet getting actress newlines. Each line of our text file ends in a newline character ('\n'), which is being printed. Also, after printing each line, print() adds a newline of its ain, unless y'all tell it to do otherwise.

We tin change this default behavior past specifying an cease parameter in our print() telephone call:

impress(element, stop='')

By setting end to an empty string (ii single quotes, with no space), nosotros tell print() to print cipher at the end of a line, instead of a newline character.

Example

Our revised programme looks like this:

mylines = []                              # Declare an empty list with open ('lorem.txt', 'rt') as myfile:  # Open file lorem.txt     for line in myfile:                   # For each line of text,         mylines.append(line)              # add that line to the list.     for element in mylines:               # For each element in the list,         impress(element, end='')            # impress it without extra newlines.

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

The newlines you see hither are really in the file; they're a special graphic symbol ('\n') at the end of each line. We desire to go rid of these, so we don't accept to worry almost them while we process the file.

How to strip newlines

To remove the newlines completely, we can strip them. To strip a string is to remove one or more than characters, ordinarily whitespace, from either the beginning or end of the cord.

Tip

This process is sometimes too chosen "trimming."

Python iii string objects have a method chosen rstrip(), which strips characters from the correct side of a string. The English language language reads left-to-right, and then stripping from the right side removes characters from the end.

If the variable is named mystring, nosotros can strip its right side with mystring.rstrip(chars), where chars is a string of characters to strip. For example, "123abc".rstrip("bc") returns 123a.

Tip

When you correspond a string in your program with its literal contents, it's called a string literal. In Python (as in most programming languages), cord literals are always quoted — enclosed on either side by single (') or double (") quotes. In Python, single and double quotes are equivalent; you can employ 1 or the other, as long as they lucifer on both ends of the string. It's traditional to correspond a homo-readable cord (such equally Hello) in double-quotes ("Hello"). If yous're representing a unmarried character (such as b), or a unmarried special character such equally the newline character (\northward), information technology'due south traditional to use single quotes ('b', '\north'). For more than information virtually how to use strings in Python, yous tin can read the documentation of strings in Python.

The statement string.rstrip('\n') will strip a newline character from the right side of string. The following version of our program strips the newlines when each line is read from the text file:

mylines = []                                # Declare an empty list. with open ('lorem.txt', 'rt') as myfile:    # Open lorem.txt for reading text.     for myline in myfile:                   # For each line in the file,         mylines.suspend(myline.rstrip('\n')) # strip newline and add to list. for element in mylines:                     # For each element in the list,     print(element)                          # impress information technology.

The text is at present stored in a listing variable, and then individual lines can be accessed by index number. Newlines were stripped, so we don't have to worry about them. Nosotros can always put them back after if we reconstruct the file and write it to disk.

Now, let'due south search the lines in the listing for a specific substring.

Searching text for a substring

Allow'south say nosotros want to locate every occurrence of a sure phrase, or even a single letter of the alphabet. For instance, maybe nosotros need to know where every "e" is. We tin can accomplish this using the string'south find() method.

The list stores each line of our text as a string object. All string objects have a method, notice(), which locates the first occurrence of a substrings in the cord.

Let'southward apply the discover() method to search for the letter "east" in the beginning line of our text file, which is stored in the list mylines. The first chemical element of mylines is a string object containing the first line of the text file. This cord object has a discover() method.

In the parentheses of find(), we specify parameters. The starting time and merely required parameter is the string to search for, "e". The argument mylines[0].detect("east") tells the interpreter to search frontwards, starting at the get-go of the cord, i graphic symbol at a time, until it finds the letter "e." When it finds 1, information technology stops searching, and returns the index number where that "e" is located. If it reaches the end of the string, information technology returns -ane to indicate zero was constitute.

Example

impress(mylines[0].discover("e"))

Output:

iii

The return value "3" tells us that the letter "e" is the fourth character, the "east" in "Lorem". (Remember, the index is zippo-based: index 0 is the first character, 1 is the second, etc.)

The find() method takes two optional, boosted parameters: a start alphabetize and a terminate alphabetize, indicating where in the string the search should brainstorm and end. For instance, string.detect("abc", ten, 20) searches for the substring "abc", but simply from the 11th to the 21st character. If stop is non specified, detect() starts at index get-go, and stops at the finish of the string.

Instance

For instance, the post-obit statement searchs for "e" in mylines[0], beginning at the fifth character.

print(mylines[0].find("eastward", 4))

Output:

24

In other words, starting at the 5th character in line[0], the first "e" is located at index 24 (the "e" in "nec").

Instance

To starting time searching at alphabetize 10, and stop at alphabetize 30:

print(mylines[1].find("due east", 10, xxx))

Output:

28

(The first "e" in "Maecenas").

If observe() doesn't locate the substring in the search range, it returns the number -1, indicating failure:

print(mylines[0].discover("e", 25, 30))

Output:

-1

In that location were no "eastward" occurrences between indices 25 and thirty.

Finding all occurrences of a substring

But what if nosotros want to locate every occurrence of a substring, not merely the first ane nosotros encounter? Nosotros tin can iterate over the string, starting from the alphabetize of the previous match.

In this example, nosotros'll use a while loop to repeatedly find the letter "east". When an occurrence is plant, nosotros call find once again, starting from a new location in the cord. Specifically, the location of the last occurrence, plus the length of the cord (so nosotros tin move forward by the terminal one). When notice returns -1, or the start alphabetize exceeds the length of the string, we stop.

# Build array of lines from file, strip newlines  mylines = []                                # Declare an empty list. with open ('lorem.txt', 'rt') as myfile:    # Open lorem.txt for reading text.     for myline in myfile:                   # For each line in the file,         mylines.append(myline.rstrip('\n')) # strip newline and add to list.  # Locate and print all occurences of alphabetic character "e"  substr = "eastward"                  # substring to search for. for line in mylines:          # cord to be searched   index = 0                   # electric current index: character being compared   prev = 0                    # previous index: final grapheme compared   while index < len(line):    # While index has non exceeded string length,     alphabetize = line.find(substr, index)  # set alphabetize to get-go occurrence of "e"     if index == -1:           # If nothing was institute,       break                   # exit the while loop.     impress(" " * (index - prev) + "e", stop='')  # print spaces from previous                                                # friction match, then the substring.     prev = index + len(substr)       # recall this position for side by side loop.     index += len(substr)      # increment the index by the length of substr.                               # (Repeat until alphabetize > line length)   print('\n' + line);         # Impress the original string nether the eastward's        

Output:

          e                    e       e  due east               eastward Lorem ipsum dolor sit amet, consectetur adipiscing elit.                          eastward  e Nunc fringilla arcu congue metus aliquam mollis.         e                   eastward e          eastward    e      e Mauris nec maximus purus. Maecenas sit amet pretium tellus.       east Quisque at dignissim lacus.

Incorporating regular expressions

For complex searches, utilize regular expressions.

The Python regular expressions module is called re. To utilize information technology in your plan, import the module earlier you use it:

import re

The re module implements regular expressions past compiling a search pattern into a pattern object. Methods of this object can then be used to perform lucifer operations.

For example, let'due south say you lot desire to search for any discussion in your document which starts with the letter d and ends in the letter r. Nosotros can accomplish this using the regular expression "\bd\w*r\b". What does this mean?

character sequence pregnant
\b A discussion boundary matches an empty string (anything, including nothing at all), simply merely if it appears before or later a non-word character. "Word characters" are the digits 0 through ix, the lowercase and uppercase messages, or an underscore ("_").
d Lowercase letter d.
\w* \w represents any word character, and * is a quantifier meaning "zero or more of the previous character." Then \westward* will match aught or more than word characters.
r Lowercase letter r.
\b Word boundary.

So this regular expression will match whatever string that can be described as "a word boundary, then a lowercase 'd', then zero or more word characters, so a lowercase 'r', then a word boundary." Strings described this way include the words destroyer, dour, and doc, and the abbreviation dr.

To utilise this regular expression in Python search operations, we first compile information technology into a design object. For example, the following Python statement creates a pattern object named blueprint which we tin use to perform searches using that regular expression.

pattern = re.compile(r"\bd\w*r\b")

Note

The letter r before our string in the statement to a higher place is important. It tells Python to translate our string as a raw string, exactly as we've typed it. If nosotros didn't prefix the cord with an r, Python would translate the escape sequences such as \b in other means. Whenever you demand Python to translate your strings literally, specify it equally a raw string by prefixing it with r.

Now we tin utilise the pattern object's methods, such every bit search(), to search a string for the compiled regular expression, looking for a lucifer. If it finds i, it returns a special issue called a match object. Otherwise, it returns None, a built-in Python constant that is used like the boolean value "false".

import re str = "Good morning time, doctor." pat = re.compile(r"\bd\westward*r\b")  # compile regex "\bd\w*r\b" to a blueprint object if pat.search(str) != None:     # Search for the pattern. If institute,     impress("Found it.")

Output:

Found information technology.

To perform a instance-insensitive search, you lot tin specify the special abiding re.IGNORECASE in the compile step:

import re str = "Hello, Doctor." pat = re.compile(r"\bd\west*r\b", re.IGNORECASE)  # upper and lowercase will match if pat.search(str) != None:     print("Constitute it.")

Output:

Found it.

Putting it all together

Then now we know how to open a file, read the lines into a listing, and locate a substring in any given list element. Let'due south apply this knowledge to build some example programs.

Print all lines containing substring

The program below reads a log file line by line. If the line contains the word "fault," it is added to a list chosen errors. If non, information technology is ignored. The lower() string method converts all strings to lowercase for comparing purposes, making the search case-insensitive without altering the original strings.

Note that the find() method is called directly on the issue of the lower() method; this is called method chaining. Also, note that in the impress() statement, we construct an output string by joining several strings with the + operator.

errors = []                       # The list where we will store results. linenum = 0 substr = "error".lower()          # Substring to search for. with open up ('logfile.txt', 'rt') as myfile:     for line in myfile:         linenum += 1         if line.lower().discover(substr) != -1:    # if case-insensitive lucifer,             errors.append("Line " + str(linenum) + ": " + line.rstrip('\n')) for err in errors:     impress(err)

Input (stored in logfile.txt):

This is line 1 This is line ii Line 3 has an fault! This is line iv Line 5 besides has an mistake!

Output:

Line 3: Line 3 has an error! Line v: Line v also has an error!

Extract all lines containing substring, using regex

The program beneath is similar to the higher up program, simply using the re regular expressions module. The errors and line numbers are stored as tuples, e.m., (linenum, line). The tuple is created by the boosted enclosing parentheses in the errors.suspend() argument. The elements of the tuple are referenced like to a list, with a zero-based index in brackets. Equally synthetic here, err[0] is a linenum and err[1] is the associated line containing an error.

import re errors = [] linenum = 0 pattern = re.compile("mistake", re.IGNORECASE)  # Compile a case-insensitive regex with open up ('logfile.txt', 'rt') as myfile:         for line in myfile:         linenum += 1         if blueprint.search(line) != None:      # If a lucifer is institute              errors.suspend((linenum, line.rstrip('\due north'))) for err in errors:                            # Iterate over the listing of tuples     impress("Line " + str(err[0]) + ": " + err[1])

Output:

Line half-dozen: Mar 28 09:x:37 Error: cannot contact server. Connexion refused. Line ten: Mar 28 10:28:fifteen Kernel error: The specified location is not mounted. Line xiv: Mar 28 11:06:30 Error: usb ane-1: tin't set up config, exiting.

Extract all lines containing a phone number

The plan below prints any line of a text file, info.txt, which contains a U.s.a. or international phone number. It accomplishes this with the regular expression "(\+\d{1,2})?[\s.-]?\d{3}[\south.-]?\d{4}". This regex matches the following phone number notations:

  • 123-456-7890
  • (123) 456-7890
  • 123 456 7890
  • 123.456.7890
  • +91 (123) 456-7890
import re errors = [] linenum = 0 blueprint = re.compile(r"(\+\d{1,ii})?[\southward.-]?\d{three}[\s.-]?\d{iv}") with open ('info.txt', 'rt') as myfile:     for line in myfile:         linenum += 1         if pattern.search(line) != None:  # If pattern search finds a match,             errors.suspend((linenum, line.rstrip('\n'))) for err in errors:     print("Line ", str(err[0]), ": " + err[i])

Output:

Line  3 : My telephone number is 731.215.8881. Line  7 : Y'all can attain Mr. Walters at (212) 558-3131. Line  12 : His agent, Mrs. Kennedy, can be reached at +12 (123) 456-7890 Line  14 : She can also exist contacted at (888) 312.8403, extension 12.

Search a dictionary for words

The program below searches the dictionary for whatsoever words that start with h and end in pe. For input, it uses a dictionary file included on many Unix systems, /usr/share/dict/words.

import re filename = "/usr/share/dict/words" pattern = re.compile(r"\bh\due west*pe$", re.IGNORECASE) with open up(filename, "rt") as myfile:     for line in myfile:         if pattern.search(line) != None:             impress(line, end='')

Output:

Hope heliotrope hope hornpipe horoscope hype

wilsonginge1969.blogspot.com

Source: https://www.computerhope.com/issues/ch001721.htm

0 Response to "How to Read Words Into a List in Python 3"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel