Modules: I/O
[outline]
In this part we do our first major exploration of modules/packages to achieve some end; in this case file input and output (I/O).
There are a variety of functions to help with I/O, including the builtins input
and open
, and
functions in the libraries:
fileinput
;
os
;
pathlib
;
contextlib
;
csv
;
json
;
pickle
;
markup
;
internet
;
tempfile
;
shutil
.
We'll look at how to use these for getting hold of data, starting with basic file reading and writing, then how to handle different filetypes, before we finally look at navigating the Operating System. First up, the basics.
I/O basics (powerpoint)
Further info:
If you want to experiment with redirecting stdin and stout, here's some files (instructions in the Python file): stdio.py; stdin.txt.
You can find a clear list of the ASCII numbers at ASCII table and a more detailed presentation on Wikipedia.
Line Endings:
The line endings in files vary depending on the operating system.
POSIX systems (Linux; MacOS; etc.) use the ASCII newline (or "linefeed") character, represented by the escape character \n
.
Windows uses two characters: ASCII carriage return (\r) (which was used by typewriters to return the typing head to the start of the line), followed by newline.
Macs now use Linux standard, but did used to use just carriage return.
You can find the OS default using the os
library: os.linesep
,
but generally if you use \n
, the Python default, Windows copes with it fine, and directly using os.linesep
is advised against.
For more information on context managers, see the contextlib documentation.
For more information on open, see the functions documentation.
Quiz: This code, designed to build a list of lists, with the inner lists being rows of data, is wrong because _____________________
f = open("in.txt")
data = []
for line in f:
parsed_line = str.split(line,",")
data_line = []
for word in parsed_line:
data_line.append(float(word))
data.append(data_line)
print(data)
f.close()
- of nothing, it's fine.
print(data)
should be after the file is closed.- the
data.append(data_line)
isn't indented properly.
Correct! Because the data.append(data_line)
is double indented, it gets done for every number in the file. As data_line
is a list, rather than a number,
you end up with a lists of lists in which each inner list is a single figure. What we're after is a list of lists where each inner lists is a row
of numbers. We need to dedent the append one level so it is done for each row.
Having seen how to read generic text files, now let's look at specific file types. For most common generic file types there are libraries that make reading and writing even simpler.
File types (powerpoint)
Further info:
Note that there are different dialects of csv which can be accounted for (details).
For more on JSON, see JSON.org and the Python json library. For GeoJSON see the GeoJSON specification. Both are subsets of the broader markup YAML, which is used for for configuration and object saving.
For processing HTML/XML and other markup, see the markup library; for dealing with the Internet and Web more generally, see the internet library.
To serialise more complicated objects, see Pickle.
Quiz: The following JSON is broken because of ________________________________________
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry":
{
"type": "Point",
"coordinates": [42.0, 21.0]
,
"properties":
{
"prop0": "value0"
}
}
]
}
- an extra comma.
- a missing data element.
- a missing }.
Correct! The correct JSON is:
{
But you have to be pretty eagle eyed to stop these kind of syntax errors manually, which is why Python provides the command line
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry":
{
"type": "Point",
"coordinates": [42.0, 21.0]
},
"properties":
{
"prop0": "value0"
}
}
]
}
json.tool
for checking JSON files.
Finally, if we're dealing with files, it is useful to look in detail at how we might navigate the file system to find specific files and directories.
Paths and OS (powerpoint)
Further info:
os library documentation.
pathlib library documentation.
For more info on setting Env Variables, see os.environ.
To get information about files and directories (including file sizes and modification times), see Path.stat
To set file permissions and ownership, see chmod and chown.
Understanding file permission representation, especially on POSIX systems, is one of those key
skills all programmers need under their belt. In Python you'll see file permissions represented
as oct number sequences like 0o666
(the first two
characters being Python's marker for oct numbers.
The numbers are, left-to-right: the owner's permissions, group permissions, and public permissions.
Each is a sum of the following numbers:
4
= read
2
= write
1
= execute
So in 0o666
all three are set to read and write permissions, but not execute. You'll see you can only derive a number from a unique set of combinations.
This is the classic POSIX file permission system. The Windows one is more sophisticated, which means Python interacts with it poorly, largely only to set read permission.
The glob library is for pattern hunting in files and directories (documentation).
The tempfile library is for generating temporary files and directories (documentation).
The shutil library is for high-level file operations, like copying files and directory structures (documentation).