Blog

Links

Book review: Python for Data Analysis, Second Edition, by Wes McKinney

posted: May 10, 2018

tl;dr: A well-written, in-depth, practical guide to using Python’s primary data analysis libraries and tools...

The field of data science has grown dramatically in the past five years, as more businesses and institutions seek to use data to improve decision making and glean new insights. Python is one of the most popular languages for data science, which might be a surprise to those who think of Python as a slow, interpreted (dare I say “scripting”) language. Python’s secret, which allows it to be performant for intensive number crunching, is that a variety of third-party Python libraries have been written in C and other lower-level languages, with a Python interface that allows the code which utilizes the libraries to be written in easily understandable Python.

The growing popularity of Python for performing data science, and recent advances in the library and tool ecosystem, are the reasons Wes McKinney recently came out with the second edition of Python for Data Analysis. It should find an eager audience of folks looking to start doing serious data analysis within the Python ecosystem.

McKinney’s goal for Python for Data Analysis is to provide a foundation for manipulating and analyzing large amounts of data using Python, and he succeeds admirably. Basic knowledge of Python programming, including Python’s built-in data structures, major keywords, and invoking functions and methods, is a prerequisite. From there McKinney begins by showing the reader how to install and use various tools and libraries, starting with very basic commands and gradually working up to more sophisticated examples.

McKinney focuses on the important, widely used foundational Python data analysis tools and libraries: numpy, pandas, matplotlib, iPython, and Jupyter notebooks. This actually provides plenty of material to work with, as McKinney wants readers to become proficient with using these tools to manipulate data before diving into more advanced mathematical analysis and machine learning. McKinney goes into much detail showing how to bring in data, cleanse it, organize it, subset it, apply mathematical operations to it, aggregate it, and display visualizations of the results. As many have said, 80% of data science is cleaning the data so that algorithms can be run on it, hence McKinney’s focus is on educating readers to become expert data wranglers.

McKinney provides plenty of step-by-step instructions, real-world examples, recommended design patterns, and best practices. With just a few exceptions his examples and the accompanying discussion proceed at a gradual pace, from simple statements to more complex operations, without any sudden major leaps in the knowledge level required. This should get moderately proficient Python programmers up-and-running to the point where they can work on their own use cases. The next step beyond McKinney’s guidance is the online documentation for the various tools. Python for Data Analysis functions as a useful bridge to this more voluminous syntactical documentation.

I highly recommend Python for Data Analysis. It’s a great book for those looking to get started doing serious data analysis in Python, and to build a foundation for more intensive data science.