上QQ阅读APP看书，第一时间看更新

Dealing with big datasets

If the dataset you want to load is too big to fit in the memory, you can deal with it by using a batch machine learning algorithm, which works with only a part of the data at once. Using a batch approach also makes sense if you just need a sample of the data (let's say that you want to take a peek at the data). Thanks to Python, you can actually load the data in chunks. This operation is also called data streaming since the dataset flows into a DataFrame or some other data structure as a continuous flow. As opposed to all the previous cases, the dataset has been fully loaded into the memory in a standalone step.

With pandas, there are two ways to chunk and load a file. The first way is by loading the dataset in chunks of the same size; each chunk is a piece of the dataset that contains all the columns and a limited number of lines, no more than the number you actually have set in the function call (the chunksize parameter). Note that the output of the read_csv function, in this case, is not a pandas DataFrame, but an iterator-like object. In fact, to get the results in memory, you need to iterate that object:

In: import pandas as pd 
    iris_chunks = pd.read_csv(iris_filename, header=None, 
                              names=['C1', 'C2', 'C3', 'C4', 'C5'], 
                              chunksize=10) 
    for chunk in iris_chunks: 
        print ('Shape:', chunk.shape) 
        print (chunk,'n')

Out: Shape: (10, 5)   
        C1   C2   C3   C4            C5
     0  5.1  3.5  1.4  0.2  Iris-setosa
     1  4.9  3.0  1.4  0.2  Iris-setosa
     2  4.7  3.2  1.3  0.2  Iris-setosa
     3  4.6  3.1  1.5  0.2  Iris-setosa
     4  5.0  3.6  1.4  0.2  Iris-setosa
     5  5.4  3.9  1.7  0.4  Iris-setosa
     6  4.6  3.4  1.4  0.3  Iris-setosa
     7  5.0  3.4  1.5  0.2  Iris-setosa
     8  4.4  2.9  1.4  0.2  Iris-setosa
     9  4.9  3.1  1.5  0.1  Iris-setosa
     ...

There will be 14 other pieces like these, each of them of Shape: 10, 5. The other method to load a big dataset is by specifically asking for an iterator of it. In this case, you can dynamically decide the length (that is, how many lines to get) you want for each piece of the pandas DataFrame:

In:  iris_iterator = pd.read_csv(iris_filename, header=None,
                                 names=['C1', 'C2', 'C3', 'C4', 'C5'], 
                                 iterator=True)  

In:  print (iris_iterator.get_chunk(10).shape)

Out: (10, 5)

In:  print (iris_iterator.get_chunk(20).shape)

Out: (20, 5)

In:  piece = iris_iterator.get_chunk(2)
     piece

The output represents just a chunk of the original dataset:

In this example, we first defined the iterator. Next, we retrieved a piece of data containing 10 lines. We then obtained 20 further rows, and finally the two rows that are printed at the end.

Besides pandas, you can also use the CSV package, which offers two functions to iterate small chunks of data from files: the reader and DictReader functions. Let's illustrate such functions by importing the CSV package:

In:import csv

The reader inputs the data from disks to the Python lists. DictReader instead transforms the data into a dictionary. Both functions work by iterating over the rows of the file being read. The reader returns exactly what it reads, stripped of the return carriage, and splits into a list by the separator (which is a comma by default, but this can be modified). DictReader will map the list's data into a dictionary, whose keys will be defined by the first row (if a header is present) or the fieldnames parameter (using a list of strings that reports the column names).

The reading of lists in a native manner is not a limitation. For instance, it will be easier to speed up the code using a fast Python implementation, such as PyPy. Moreover, we can always convert lists into NumPy ndarrays (a data structure that we are going to introduce soon). By reading the data into JSON-style dictionaries, it will be quite easy to get a DataFrame.

Here is a simple example that uses such functionalities from the CSV package.

Let's pretend that our datasets-uci-iris.csv file, that was downloaded from http://mldata.org/, is a huge file that we cannot fully load in the memory (actually, we just pretend that this is the case because we remember that we saw the file at the beginning of this chapter; it is made up of just 150 examples, and the CSV lacks a header row).

Therefore, our only choice is to load it into chunks. First, let's conduct an experiment:

In: with open(iris_filename, 'rt') as data_stream:
      # 'rt' mode
      for n, row in enumerate(csv.DictReader(data_stream,
           fieldnames = ['sepal_length', 'sepal_width',
                         'petal_length', 'petal_width', 
                         'target'],
           dialect='excel')):
              if n== 0:
                  print (n, row)
              else:
                  break

Out: 0 OrderedDict([('sepal_length', '5.1'), ('sepal_width', '3.5'),     
     ('petal_length', '1.4'), ('petal_width', '0.2'), ('target', 'Iris-
     setosa')])

What does the preceding code accomplish? First, it opens a read-binary connection to the file that aliases it as data_stream. Using the with command assures that the file is closed after the commands placed in the preceding indentation are completely executed.

Then, it iterates (for...in) and it enumerates a csv.DictReader call, which wraps the flow of the data from data_stream. Since we don't have a header row in the file, fieldnames provides information about the fields' names. dialect just specifies that we are calling the standard comma-separated CSV (we'll provide some hints on how to modify this parameter later).

Inside the iteration, if the row being read is the first one, then it is printed. Otherwise, the loop is stopped by a break command. The print command presents us with the row number 0 and a dictionary. Therefore, you can recall every piece of data of the row by just calling the keys bearing the variables' names.

Similarly, we can make the same code work for the csv.reader command, as follows:

In: with open(iris_filename, 'rt') as data_stream:
    for n, row in enumerate(csv.reader(data_stream,
        dialect='excel')):
            if n==0:
                print (row)
            else:
                break

Out: ['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']

Here, the code is even more straightforward and the output is simpler, providing a list that contains the row values in a sequence.

At this point, based on this second piece of code, we can create a generator callable from a for-loop iteration. This retrieves the data on the fly from the file in the blocks of the size defined by the batch parameter of the function:

In: def batch_read(filename, batch=5):
      # open the data stream
      with open(filename, 'rt') as data_stream:
        # reset the batch
        batch_output = list()
        # iterate over the file
        for n, row in enumerate(csv.reader(data_stream, dialect='excel')):
            # if the batch is of the right size
            if n > 0 and n % batch == 0:
                # yield back the batch as an ndarray
                yield(np.array(batch_output))
                # reset the batch and restart
                batch_output = list()
            # otherwise add the row to the batch
            batch_output.append(row)
        # when the loop is over, yield what's left               
        yield(np.array(batch_output))

Similar to the previous example, the data is drawn out, thanks to the csv.reader function wrapped by the enumerate function that accompanies the extracted list of data along with the example number (which starts from zero). Based on the example number, a batch list is either appended with the data list or returned to the main program using the generative yield function. This process is repeated until the entire file is read and returned in batches:

In: import numpy as np
    for batch_input in batch_read(iris_filename, batch=3):
        print (batch_input)
        break

Out: [['5.1' '3.5' '1.4' '0.2' 'Iris-setosa']
      ['4.9' '3.0' '1.4' '0.2' 'Iris-setosa']
      ['4.7' '3.2' '1.3' '0.2' 'Iris-setosa']]

Such a function can provide the basic functionality for learning with stochastic gradient descent, as will be presented in Chapter 4, Machine Learning, where we will come back to this piece of code and expand this example by introducing some more advanced examples.