-->

Working with Data Files On Linux

When you have a large amount of data, handling the information and making it useful can be difficult. As you saw with the du command in the previous section, it’s easy to get data overload when working with system commands.

Working with Data Files On Linux
Working with Data Files On Linux


The Linux system provides several command line tools to help you manage large amounts of data. This section covers the basic commands that every system administrator — as well as any everyday Linux user — should know how to use to make their lives easier.

Sorting data

The sort command is a popular function that comes in handy when working with large amounts of data. The sort command does what it says: It sorts data.

By default, the sort command sorts the data lines in a text file using standard sorting rules for the language you specify as the default for the session.

  1. $ cat file1
  2. one
  3. two
  4. three
  5. four
  6. five
  7. $ sort file1
  8. five
  9. four
  10. one
  11. three
  12. two
  13. $

It’s pretty simple, but things aren’t always as easy as they appear. Look at this example:

  1. $ cat file2
  2. 12 100
  3. 45
  4. 3 10
  5. 145
  6. 75
  7. $ sort file2
  8. 1 10
  9. 100
  10. 145
  11. 23 45
  12. 75
  13. $

If you were expecting the numbers to sort in numerical order, you were disappointed. By default, the sort command interprets numbers as characters and performs a standard character sort, producing output that might not be what you want. To solve this problem, use the -n parameter, which tells the sort command to recognize numbers as numbers instead of characters and to sort them based on their numerical values:

  1. $ sort -n file2
  2. 123 10
  3. 45
  4. 75
  5. 100
  6. 145
  7. $

Now, that’s much better! Another common parameter that’s used is -M, the month sort. Linux log files usually contain a timestamp at the beginning of the line to indicate when the event occurred:

Sep 13 07:10:09 testbox smartd[2718]: Device: /dev/sda, opened

If you sort a file that uses timestamp dates using the default sort, you get something like
this:

  1. $ sort file3
  2. Apr
  3. Aug
  4. Dec
  5. Feb
  6. Jan
  7. Jul
  8. Jun
  9. Mar
  10. May
  11. Nov
  12. Oct
  13. Sep
  14. $

It’s not exactly what you wanted. If you use the -M parameter, the sort command recognizes the three-character month nomenclature and sorts appropriately:

  1. $ sort -M file3
  2. Jan
  3. Feb
  4. Mar
  5. Apr
  6. May
  7. Jun
  8. Jul
  9. Aug
  10. Sep
  11. Oct
  12. Nov
  13. Dec
  14. $

The Table will shows other handy sort parameters you can use.

The sort Command Parameters
The sort Command Parameters


pic tabel

The -k and -t parameters are handy when sorting data that uses fields, such as the /etc/passwd file. Use the -t parameter to specify the field separator character, and use the -k parameter to specify which field to sort on. For example, to sort the password file based on numerical userid, just do this:

  1. $ sort -t ‘:’ -k 3 -n /etc/passwd
  2. root:x:0:0:root:/root:/bin/bash
  3. bin:x:1:1:bin:/bin:/sbin/nologin
  4. daemon:x:2:2:daemon:/sbin:/sbin/nologin
  5. adm:x:3:4:adm:/var/adm:/sbin/nologin
  6. lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
  7. sync:x:5:0:sync:/sbin:/bin/sync
  8. shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
  9. halt:x:7:0:halt:/sbin:/sbin/halt
  10. mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
  11. news:x:9:13:news:/etc/news:
  12. uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
  13. operator:x:11:0:operator:/root:/sbin/nologin
  14. games:x:12:100:games:/usr/games:/sbin/nologin
  15. gopher:x:13:30:gopher:/var/gopher:/sbin/nologin
  16. ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin

Now the data is perfectly sorted based on the third field, which is the numerical userid value.

The -n parameter is great for sorting numerical outputs, such as the output of the du command:

  1. $ du -sh * | sort -nr
  2. 1008k mrtg-2.9.29.tar.gz
  3. 972k bldg1
  4. 888k fbs2.pdf
  5. 760k Printtest
  6. 680k rsync-2.6.6.tar.gz
  7. 660k code
  8. 516k fig1001.tiff
  9. 496k test
  10. 496k php-common-4.0.4pl1-6mdk.i586.rpm
  11. 448k MesaGLUT-6.5.1.tar.gz
  12. 400k plp

Notice that the -r option also sorts the values in descending order, so you can easily see what files are taking up the most space in your directory.

Note
The pipe command (|) used in this example redirects the output of the du command to
the sort command.

Searching for data

Often in a large file, you must look for a specific line of data buried somewhere in the middle of the file. Instead of manually scrolling through the entire file, you can let the grep command search for you. The command line format for the grep command is:

grep [options] pattern [file]

The grep command searches either the input or the file you specify for lines that contain characters that match the specified pattern. The output from grep is the lines that contain the matching pattern.

Here are two simple examples of using the grep command with the file1 file used in the
“Sorting data” section:

  1. $ grep three file1
  2. three
  3. $ grep t file1
  4. two
  5. three
  6. $

The first example searches the file file1 for text matching the pattern three. The grep command produces the line that contains the matching pattern. The next example searches the file file1 for the text matching the pattern t. In this case, two lines matched the specified pattern, and both are displayed.

Because of the popularity of the grep command, it has undergone lots of development changes over its lifetime. Lots of features have been added to the grep command. If you look over the man pages for the grep command, you’ll see how versatile it really is.

If you want to reverse the search (output lines that don’t match the pattern), use the -v parameter:

  1. $ grep -v t file1
  2. one
  3. four
  4. five
  5. $

If you need to find the line numbers where the matching patterns are found, use the -n parameter:

  1. $ grep -n t file1
  2. 2:two
  3. 3:three
  4. $

If you just need to see a count of how many lines contain the matching pattern, use the -c parameter:

  1. $ grep -c t file1
  2. 2$

If you need to specify more than one matching pattern, use the -e parameter to specify each individual pattern:

  1. $ grep -e t -e f file1
  2. two
  3. three
  4. four
  5. five
  6. $

This example outputs lines that contain either the string t or the string f.

By default, the grep command uses basic Unix-style regular expressions to match patterns. A Unix-style regular expression uses special characters to define how to look for matching patterns.

For a more detailed explanation of regular expressions. Here’s a simple example of using a regular expression in a grep search:

  1. $ grep [tf] file1
  2. two
  3. three
  4. four
  5. five
  6. $

The square brackets in the regular expression indicate that grep should look for matches that contain either a t or an f character. Without the regular expression, grep would search for text that would match the string tf.

The egrep command is an offshoot of grep, which allows you to specify POSIX extended regular expressions, which contain more characters for specifying the matching pattern. The fgrep command is another version that allows you to specify matching patterns as a list of fixed-string values, separated by newline characters. This allows you to place a list of strings in a file and then use that list in the fgrep command to search for the strings in a larger file.

Compressing data

If you’ve done any work in the Microsoft Windows world, no doubt you’ve used zip files. It became such a popular feature that Microsoft eventually incorporated it into the Windows operating system starting with XP. The zip utility allows you to easily compress large files (both text and executable) into smaller files that take up less space.

Linux contains several file compression utilities. Although this may sound great, it often leads to confusion and chaos when trying to download files. Table lists file compression utilities available for Linux.

Linux File Compression Utilities
Linux File Compression Utilities


The compress file compression utility is not often found on Linux systems. If you download a file with a .Z extension, you can usually install the compress package (called ncompress in many Linux distributions) using the software installation methods discussed in the next articles and then uncompress the file with the uncompress command. The gzip utility is the most popular compression tool used in Linux.

The gzip package is a creation of the GNU Project, in their attempt to create a free version of the original Unix compress utility. This package includes these files:


  • gzip for compressing files 
  • gzcat for displaying the contents of compressed text files
  • gunzip for uncompressing files


These utilities work the same way as the bzip2 utilities:

  1. $ gzip myprog
  2. $ ls -l my*
  3. -rwxrwxr-x 1 rich rich 2197 2007-09-13 11:29 myprog.gz
  4. $

The gzip command compresses the file you specify on the command line. You can also specify more than one filename or even use wildcard characters to compress multiple files at once:

  1. $ gzip my*
  2. $ ls -l my*
  3. -rwxr—r— 1 rich rich 103 Sep 6 13:43 myprog.c.gz
  4. -rwxr-xr-x 1 rich rich 5178 Sep 6 13:43 myprog.gz
  5. -rwxr—r— 1 rich rich 59 Sep 6 13:46 myscript.gz
  6. -rwxr—r— 1 rich rich 60 Sep 6 13:44 myscript2.gz
  7. $

The gzip command compresses every file in the directory that matches the wildcard pattern.

Archiving data

Although the zip command works great for compressing and archiving data into a single file, it’s not the standard utility used in the Unix and Linux worlds. By far the most popular archiving tool used in Unix and Linux is the tar command.

The tar command was originally used to write files to a tape device for archiving. However, it can also write the output to a file, which has become a popular way to archive data in Linux.
The following is the format of the tar command:

tar function [options] object1 object2 …

The function parameter defines what the tar command should do, as shown in Table below.

The tar Command Functions
The tar Command Functions


Each function uses options to define a specific behavior for the tar archive file. Table lists the common options that you can use with the tar command.

The tar Command Options
The tar Command Options


These options are usually combined to create the following scenarios. First, you want to create an archive file using this command:

tar -cvf test.tar test/ test2/

The above command creates an archive file called test.tar containing the contents of
both the test directory and the test2 directory. Next, this command:

tar -tf test.tar

lists (but doesn’t extract) the contents of the tar file test.tar. Finally, this command:

tar -xvf test.tar

extracts the contents of the tar file test.tar. If the tar file was created from a directory structure, the entire directory structure is re-created starting at the current directory.

As you can see, using the tar command is a simple way to create archive files of entire directory structures. This is a common method for distributing source code files for open source applications in the Linux world.

Tip
If you download open source software, often you see filenames that end in .tgz.
These are gzipped tar files, which can be extracted using the command tar -zxvf filename.tgz.


This article discussed some of the more advanced bash commands used by Linux system administrators and programmers. The ps and top commands are vital in determining the status of the system, allowing you to see what applications are running and how many resources they are consuming.

In this day of removable media, another popular topic for system administrators is mounting storage devices. The mount command allows you to mount a physical storage device into the Linux virtual directory structure. To remove the device, use the umount command.

Finally, the article discussed various utilities used for handling data. The sort utility easily sorts large data files to help you organize data, and the grep utility allows you to quickly scan through large data files looking for specific information. Several file compression utilities are available in Linux, including gzip and zip. Each one allows you to compress large files to help save space on your filesystem. The Linux tar utility is a popular way to archive directory structures into a single file that can easily be ported to another system.

The next article discusses Linux shells and how to interact with them. Linux allows you to communicate between shells, which can come in handy when creating subshells in your scripts.


0 Response to "Working with Data Files On Linux"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel