In this tutorial we will learn how to use Pandas sample to randomly select rows and columns from a Pandas dataframe. There are some reasons for randomly sample our data; for instance, we may have a very large dataset and want to build our models on a smaller sample of the data. Other examples are when carrying out bootstrapping or cross-validation. Here we will learn how to; select rows at random, set a random seed, sample by group, using weights, and conditions, among other useful things.
How to Take a Random Sample of Rows
In this section we are going to learn how to take a random sample of a Pandas dataframe. We are going to use an Excel file that can be downloaded here. First, we start by importing Pandas and we use read_excel to load the Excel file into a dataframe:
import pandas as pd df = pd.read_excel('MLBPlayerSalaries.xlsx') df.head()
- Read the Pandas Excel Tutorial to learn more about loading Excel files into Pandas dataframes.
Now we know how many rows and columns there are (19543 and 5 rows and columns, respectively) and we will now continue by using Pandas sample. In the example below we are not going to use any parameters. The default behavior, when not using any parameters, is sampling one row:
In the most cases we want to take random samples of more rows than one. Thus, in the next Pandas sample example we are going to take random sample of the size of 200. We are going to use the parameter n to accomplish this:
As can be seen in the above image, we also used the head method to print only the 10 first rows of the randomly sampled rows. In most cases, we may want to save the randomly sampled rows. To accomplish this, we ill create a new dataframe:
df200 = df.sample(n=200) df200.shape # Output: (200, 5)
In the code above we created a new dataframe, called df200, with 200 randomly selected rows. Again, we used the method shape to see how many rows (and columns) we now have.
Random Sampling Rows using NumPy Choice
It’s of course very easy and convenient to use Pandas sample method to take a random sample of rows. Note, however, that it’s possible to use NumPy and random.choice. In the example below we will get the same result as above by using np.random.choice.
As usual when working with Python modules, we start by importing NumPy. After this is done we will the continue to create an array of indices (rows) and then use Pandas loc method to select the rows based on the random indices:
import numpy as np rows = np.random.choice(df.index.values, 200) df200 = df.loc[rows] df200.head()
How to Sample Pandas Dataframe using frac
Now that we have used NumPy we will continue this Pandas dataframe sample tutorial by using sample’s frac parameter. This parameter specifies the fraction (percentage) of rows to return in the random sample. This means that setting frac to 1 (frac=1) will return all rows, in random order. That is, if we just want to shuffle the dataframe it can be done using sample and the parameter frac.
As can be seen in the output table above the order of the rows are now random. We can use shape, again, to see that we have the same amount of rows:
df.sample(frac=1).shape # Output: (19543, 5)
As expected there are as many rows and columns as in the original dataframe.
How to Shuffle Pandas Dataframe using Numpy
Here we will use another method to shuffle the dataframe. In the example code below we will use the Python module NumPy again. We have to use reindex (Pandas) and random.permutation (NumPy). More specifically, we will permute the datframe using the indices:
df_shuffled = df.reindex(np.random.permutation(df.index))
We can use frac to get 200 randomly selected rows also. Before doing this we will, of course, need to calculate how many % 200 is of our total amount of rows. In this case it’s approximately 1% of the data and using the code below will also give us 200 random rows from the dataframe.
df200 = df.sample(frac=.01023)
Note, the frac parameter cannot be used together with n. We will get a ValueError that states that we cannot enter a value for both frac and n.
Pandas Sample with Replacement
We can also, of course, sample with replacement. By default Pandas sample will sample without replacement. In some cases we have to sample with replacement (e.g., with really large datasets). If we want to sample with replacement we should use the replace parameter:
df5 = df.sample(n=5, replace=True)
Sample Dataframe with Seed
If we want to be able to reproduce our random sample of rows we can use the random_state parameter. This is the seed for the random number generator and we need to input an integer:
df200 = df.sample(n=200, random_state=1111)
We can, of course, use both the parameters frac and random_state, or n and random_state, together. In the example below we randomly select 50% of the rows and use the random_state. It is further possible to use replace=True parameter together with frac and random_state to get a reproducible percentage of rows with replacement.
df200 = df.sample(frac=.5, replace=True, random_state=1111)
Pandas Sample with Weights
The sample method also have the parameter weights and this can be used if we want to increase the probability for certain rows to be sampled. We start of the next Pandas sample example by importing NumPy.
import numpy as np df['Weights'] = np.where(df['Year'] <= 2000, .75, .25) df['Weights'].unique() # Output: array([0.75 , 0.25])
In the code above we used NumPy’s where to create a new column ‘Weights’. Up until the year 2000 the weights are .5. This will increase the probability for Pandas sample to select rows up until this year:
df2 = df.sample(frac=.5, random_state=1111, weights='Weights') df2.shape # Output: (9772, 6)
Pandas Sample by Group
It’s also possible to sample each group after we have used Pandas groupby method. In the example below we are going to group the dataframe by player and then take 2 samples of data from each player:
grouped = df.groupby('Player') grouped.apply(lambda x: x.sample(n=2, replace=True)).head()
The code above may need some clarification. In the second line, we used Pandas apply method and the anonymous Python function lambda. What it will do is run sample on each subset (i.e., for each Player) and take 2 random rows. Note, here we have to use replace=True or else it won’t work.
Pandas Random Sample with Condition
Say that we want to take a random sample of players with a salary under 421000 (or rows when the salary is under this number. Could be certain years for some players. This is quite easy, in the example below we sample 10% of the dataframe based on this condition.
df[df['Salary'] < 421000].sample(frac=.1).head()
It’s also possible to have more than one condition. We just have to add some code to the above example. Now we are going to sample salaries under 421000 and prior to the year 2000:
df[(df['Salary'] < 421000) & (df['Year'] < 2000)].sample(frac=.1).head()
Using Pandas Sample and Remove
We may want to take a random sample from our dataframe and remove those rows. Maybe we want to create two different dataframes; one with 80% of the rows and one with the remaining 20%. Both of these things can, of course, be done using sample and the drop method. In the code example below we create two new dataframes; one with 80% of the rows and one with the remaining 20%.
df1 = df.sample(frac=0.8, random_state=138) df2 = df.drop(df1.index)
If we merely want to remove random rows we can use drop and the inplace parameter:
df.drop(df1.index, inplace=True) df.shape # Same as: df.drop(df.sample(frac=0.8, random_state=138).index, inplace=True) # Output: (3909, 5)
More useful Pandas guides:
Saving the Pandas Sample
Finally, we may also want to save the to work on later. In the example code below we are going to save a Pandas sample to csv. To accomplish this we use the to_csv method. The first parameter is the filename and because we don’t want an index column in the file, we use index_col=False.
import pandas as pd df = pd.read_excel('MLBPlayerSalaries.xlsx') df.sample(200, random_state=1111).to_csv('MBPlayerSalaries200Sample.csv', index_col=False)
In this brief Pandas tutorial we have learned how to use the sample method. More specifically, we have learned how to:
- take a random sample of a data using the n (a number of rows) and frac (a percentage of rows) parameters,
- get reproducible results using a seed (random_state),
- sample by group, sample using weights, and sample with conditions
- create two samples and deleting random rows
- saving the Pandas sample
That was it! Now we should know how to use Pandas sample.