Working with Data Files On Linux

When you have a large amount of data, handling the information and making it useful can be difficult. As you saw with the du command in the previous section, it’s easy to get data overload when working with system commands.

The Linux system provides several command line tools to help you manage large amounts of data. This section covers the basic commands that every system administrator — as well as any everyday Linux user — should know how to use to make their lives easier.

Sorting data

The sort command is a popular function that comes in handy when working with large amounts of data. The sort command does what it says: It sorts data.

By default, the sort command sorts the data lines in a text file using standard sorting rules for the language you specify as the default for the session.

$ cat file1
one
two
three
four
five
$ sort file1
five
four
one
three
two
$

It’s pretty simple, but things aren’t always as easy as they appear. Look at this example:

$ cat file2
12 100
45
3 10
145
75
$ sort file2
1 10
100
145
23 45
75
$

If you were expecting the numbers to sort in numerical order, you were disappointed. By default, the sort command interprets numbers as characters and performs a standard character sort, producing output that might not be what you want. To solve this problem, use the -n parameter, which tells the sort command to recognize numbers as numbers instead of characters and to sort them based on their numerical values:

$ sort -n file2
123 10
45
75
100
145
$

Now, that’s much better! Another common parameter that’s used is -M, the month sort. Linux log files usually contain a timestamp at the beginning of the line to indicate when the event occurred:

Sep 13 07:10:09 testbox smartd[2718]: Device: /dev/sda, opened

If you sort a file that uses timestamp dates using the default sort, you get something like
this:

$ sort file3
Apr
Aug
Dec
Feb
Jan
Jul
Jun
Mar
May
Nov
Oct
Sep
$

It’s not exactly what you wanted. If you use the -M parameter, the sort command recognizes the three-character month nomenclature and sorts appropriately:

$ sort -M file3
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
$

The Table will shows other handy sort parameters you can use.

The sort Command Parameters

pic tabel

The -k and -t parameters are handy when sorting data that uses fields, such as the /etc/passwd file. Use the -t parameter to specify the field separator character, and use the -k parameter to specify which field to sort on. For example, to sort the password file based on numerical userid, just do this:

$ sort -t ‘:’ -k 3 -n /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
news:x:9:13:news:/etc/news:
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
gopher:x:13:30:gopher:/var/gopher:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin

Now the data is perfectly sorted based on the third field, which is the numerical userid value.

The -n parameter is great for sorting numerical outputs, such as the output of the du command:

$ du -sh * | sort -nr
1008k mrtg-2.9.29.tar.gz
972k bldg1
888k fbs2.pdf
760k Printtest
680k rsync-2.6.6.tar.gz
660k code
516k fig1001.tiff
496k test
496k php-common-4.0.4pl1-6mdk.i586.rpm
448k MesaGLUT-6.5.1.tar.gz
400k plp

Notice that the -r option also sorts the values in descending order, so you can easily see what files are taking up the most space in your directory.

Note
The pipe command (|) used in this example redirects the output of the du command to
the sort command.

Searching for data

Often in a large file, you must look for a specific line of data buried somewhere in the middle of the file. Instead of manually scrolling through the entire file, you can let the grep command search for you. The command line format for the grep command is:

grep [options] pattern [file]

The grep command searches either the input or the file you specify for lines that contain characters that match the specified pattern. The output from grep is the lines that contain the matching pattern.

Here are two simple examples of using the grep command with the file1 file used in the
“Sorting data” section:

$ grep three file1
three
$ grep t file1
two
three
$

The first example searches the file file1 for text matching the pattern three. The grep command produces the line that contains the matching pattern. The next example searches the file file1 for the text matching the pattern t. In this case, two lines matched the specified pattern, and both are displayed.

Because of the popularity of the grep command, it has undergone lots of development changes over its lifetime. Lots of features have been added to the grep command. If you look over the man pages for the grep command, you’ll see how versatile it really is.

If you want to reverse the search (output lines that don’t match the pattern), use the -v parameter:

$ grep -v t file1
one
four
five
$

If you need to find the line numbers where the matching patterns are found, use the -n parameter:

$ grep -n t file1
2:two
3:three
$

If you just need to see a count of how many lines contain the matching pattern, use the -c parameter:

$ grep -c t file1
2$

If you need to specify more than one matching pattern, use the -e parameter to specify each individual pattern:

$ grep -e t -e f file1
two
three
four
five
$

This example outputs lines that contain either the string t or the string f.

By default, the grep command uses basic Unix-style regular expressions to match patterns. A Unix-style regular expression uses special characters to define how to look for matching patterns.

For a more detailed explanation of regular expressions. Here’s a simple example of using a regular expression in a grep search:

$ grep [tf] file1
two
three
four
five
$

The square brackets in the regular expression indicate that grep should look for matches that contain either a t or an f character. Without the regular expression, grep would search for text that would match the string tf.

The egrep command is an offshoot of grep, which allows you to specify POSIX extended regular expressions, which contain more characters for specifying the matching pattern. The fgrep command is another version that allows you to specify matching patterns as a list of fixed-string values, separated by newline characters. This allows you to place a list of strings in a file and then use that list in the fgrep command to search for the strings in a larger file.

Compressing data

If you’ve done any work in the Microsoft Windows world, no doubt you’ve used zip files. It became such a popular feature that Microsoft eventually incorporated it into the Windows operating system starting with XP. The zip utility allows you to easily compress large files (both text and executable) into smaller files that take up less space.

Linux contains several file compression utilities. Although this may sound great, it often leads to confusion and chaos when trying to download files. Table lists file compression utilities available for Linux.

Linux File Compression Utilities

The compress file compression utility is not often found on Linux systems. If you download a file with a .Z extension, you can usually install the compress package (called ncompress in many Linux distributions) using the software installation methods discussed in the next articles and then uncompress the file with the uncompress command. The gzip utility is the most popular compression tool used in Linux.

The gzip package is a creation of the GNU Project, in their attempt to create a free version of the original Unix compress utility. This package includes these files:

gzip for compressing files
gzcat for displaying the contents of compressed text files
gunzip for uncompressing files

These utilities work the same way as the bzip2 utilities:

$ gzip myprog
$ ls -l my*
-rwxrwxr-x 1 rich rich 2197 2007-09-13 11:29 myprog.gz
$

The gzip command compresses the file you specify on the command line. You can also specify more than one filename or even use wildcard characters to compress multiple files at once:

$ gzip my*
$ ls -l my*
-rwxr—r— 1 rich rich 103 Sep 6 13:43 myprog.c.gz
-rwxr-xr-x 1 rich rich 5178 Sep 6 13:43 myprog.gz
-rwxr—r— 1 rich rich 59 Sep 6 13:46 myscript.gz
-rwxr—r— 1 rich rich 60 Sep 6 13:44 myscript2.gz
$

The gzip command compresses every file in the directory that matches the wildcard pattern.

Archiving data

Although the zip command works great for compressing and archiving data into a single file, it’s not the standard utility used in the Unix and Linux worlds. By far the most popular archiving tool used in Unix and Linux is the tar command.

The tar command was originally used to write files to a tape device for archiving. However, it can also write the output to a file, which has become a popular way to archive data in Linux.
The following is the format of the tar command:

tar function [options] object1 object2 …

The function parameter defines what the tar command should do, as shown in Table below.

The tar Command Functions

Each function uses options to define a specific behavior for the tar archive file. Table lists the common options that you can use with the tar command.

The tar Command Options

These options are usually combined to create the following scenarios. First, you want to create an archive file using this command:

tar -cvf test.tar test/ test2/

The above command creates an archive file called test.tar containing the contents of
both the test directory and the test2 directory. Next, this command:

tar -tf test.tar

lists (but doesn’t extract) the contents of the tar file test.tar. Finally, this command:

tar -xvf test.tar

extracts the contents of the tar file test.tar. If the tar file was created from a directory structure, the entire directory structure is re-created starting at the current directory.

As you can see, using the tar command is a simple way to create archive files of entire directory structures. This is a common method for distributing source code files for open source applications in the Linux world.

Tip
If you download open source software, often you see filenames that end in .tgz.
These are gzipped tar files, which can be extracted using the command tar -zxvf filename.tgz.

This article discussed some of the more advanced bash commands used by Linux system administrators and programmers. The ps and top commands are vital in determining the status of the system, allowing you to see what applications are running and how many resources they are consuming.

In this day of removable media, another popular topic for system administrators is mounting storage devices. The mount command allows you to mount a physical storage device into the Linux virtual directory structure. To remove the device, use the umount command.

Finally, the article discussed various utilities used for handling data. The sort utility easily sorts large data files to help you organize data, and the grep utility allows you to quickly scan through large data files looking for specific information. Several file compression utilities are available in Linux, including gzip and zip. Each one allows you to compress large files to help save space on your filesystem. The Linux tar utility is a popular way to archive directory structures into a single file that can easily be ported to another system.

The next article discusses Linux shells and how to interact with them. Linux allows you to communicate between shells, which can come in handy when creating subshells in your scripts.

Working with Data Files On Linux

Sorting data

Searching for data

Compressing data

Archiving data

0 Response to "Working with Data Files On Linux"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel