Category: Ressources

On MOOCs and The Audacious

As already discussed at The Data Science Lab, massive open online courses are shaking many concepts in the traditional higher-education landscape. The mere fact that thousands of students can now simultaneously attend graduate-level lectures all around the globe without physically putting a foot on campus might lead to the redefinition of the “college experience”.

Thus far, the courses offered by Coursera and edX could be understood as an extension of otherwise regular university courses, simply made available to students outside the classroom by means of technology. The materials did somehow exist in a similar form prior to being offered online; transferring them to a MOOC-appropriate format certainly involves extra overhead work from the lecturer, but the structure of the syllabus essentially remains unchanged.

But as an article in Forbes reports, an audacious player in the MOOC providers league, Udacity, is set to disrupt the market with its offers of coaching for students as well as verified certification for their final projects. As Michael Horn bluntly puts it:

But the real disruption in U.S. higher education was never going to come from slapping traditional courses online for free. That is mostly glorified edutainment—not a bad thing for humanity by any means and potentially a useful upgrade over a traditional textbook, but not disruptive to the higher education sector writ large in and of itself. The real disruption in higher education was always going to come from a new system that looks quite difrom the current one, begins by serving nonconsumers of traditional higher education, and integrates with employer needs to help students make progress in their lives because of an understanding that employers are ultimately—like it or not—the end customers for higher education because they ultimately finance much of the system for students.

Browsing Udacity’s offering in the Data Science track, one finds interesting videos and catchy trailers. We can choose among Introduction to Data Science, Data Wrangling with MongoDB, Introduction to Hadoop and MapReduce and some other courses, priced at around 100$-150$/month for a duration of 1-2 months.

I do not doubt that there is a market for Udacity’s offers, and I am sure that the quality of their mentoring and materials are worth the investment. However, let’s us not forget that there are other approaches to MOOCs, as this interview with Prof. Abu-Mostafa illustrates. He makes a very interesting and profound point, with which we at The Data Science Lab thoroughly agree:

Stick to your guns. Don’t water down the course to increase the numbers. Make the course as interesting as possible WITHOUT compromising the rigor and the content. What matters is what the students actually learn and retain. This is real education not a video game or a popularity contest.

What is, in your opinion, the best way to organize MOOCs in hot topics, such as Data Science, that attract tons of attention from media and aspiring practitioners alike?

Advertisements

Put Some Pandas in Your Python

Image

Every scientist needs tools to perform their experiments, and data scientists are no exception. Much advice has been written to answer the question of what the best stack of tools for machine learning, data analysis and big data is. At least for the past few years, the champions in the programming language league seem to be R, python and SQL. SQL needs no introduction, being the standard tool for managing records held in relational databases. R is undeniably the language of choice for most data analysts and statisticians; alone the size of its community of users and the thousands of contributed libraries and add-on packages turn it into a reliable workhorse for a broad range of tasks. But one thing R has not, and it is a consistent syntax. To use the words of somebody else, the R language is pathologically flexible, and that might be very confusing to new users.

Python as a language for data science

Python, on the contrary, might not be king when it comes to statistics-specific problems; it is rather a general-purpose programming language. Luis Pedro Coelho summarizes it well:

Python may often be the second best choice: for linear algebra, Matlab may have nicer syntax; for statistics, R is probably nicer; for heavy regular expression usage, Perl (ugh) might still be nicer; if you want speed, Fortran or C(++) may be a better choice. To design a webpage; perhaps you want node.js. Python is not perfect for any of these, but is acceptable for all of them.

Still and all, Python is growing to include an increasingly mature set of packages for data manipulation and analysis. One that has gotten a respectable amount of attention lately is pandas, a library that offers data structures and operations for manipulating numerical tables and time series. Its creator, Wes McKinney, regularly posts materials and tutorials in his blog, and has written a very handy book on data analysis with python. In combination with the ipython notebook, pandas provides an easy environment to develop and present data analysis routines.

Data analysis with pandas: a beginners tutorial

Let us have a brief look into the basics of pandas with a beginners tutorial. Installation of both ipython and pandas is trivial via pip or anaconda, and superior to using the standard Linux apt-get install utility, since it guarantees more recent versions. The latest releases of pandas and ipython are 0.12.0 and 1.1.0 respectively. Once installed, the notebook starts in the browser when invoking the following command from the terminal:

$ ipython notebook --pylab inline

Pandas has very neat routines to load datasets in various formats, which can be browsed by typing pd.read*? in ipython. For this table-top experiment we will read a CSV file containing attributes of photo cameras, found in the Aviz project page. The data specifies model, weight, focal length and price, among others, and can be downloaded from the Aviz wiki.

We load the file indicating that the fields are separated by semicolons and rename the columns with descriptive labels for the variables. This creates a DataFrame, a fundamental structure that stores entries in tabular form, with an index (dataframe.index) for the rows and multiple columns for the different variables (dataframe.columns). Each column can be retrieved by dictionary-like notation (dataframe['column_name']) or by attribute (dataframe.column_name). The method head(n) applied to a data frame displays the first n rows of data (default = 5). The method apply(f) allows to apply a function to each column or row. For our camera dataset, we want to access also the maker, which corresponds to the first word of the model name. Thus, we create a new column 'Maker' by operating on the existing column 'Model'.

import pandas as pd
df = pd.read_csv('Camera.csv', sep=';')
columns = ['Model', 'Date', 'MaxRes', 'LowRes', 'EffPix', 'ZoomW', 'ZoomT',
           'NormalFR', 'MacroFR', 'Storage', 'Weight', 'Dimensions', 'Price']
df.columns = columns
df['Maker'] = df['Model'].apply(lambda s:s.split()[0])
df[['Maker','Model','Date','MaxRes','LowRes','Weight','Dimensions','Price']].head()
Maker Model Date MaxRes LowRes Weight Dimensions Price
0 Agfa Agfa ePhoto 1280 1997 1024 640 420 95 179
1 Agfa Agfa ePhoto 1680 1998 1280 640 420 158 179
2 Agfa Agfa ePhoto CL18 2000 640 0 0 0 179
3 Agfa Agfa ePhoto CL30 1999 1152 640 0 0 269
4 Agfa Agfa ePhoto CL30 Clik! 1999 1152 640 300 128 1299



A number of handy operations can be performed on a data frame with virtually no effort:

      • Sorting data: display 5 most recent models
        df.sort(['Date'], ascending = False).head()
      • Filtering columns by value: show only models made by Nikon
        df[df['Maker'] == 'Nikon']
      • Filtering columns by range of values: return cameras with prices above 350 and below 500
        df[(df['Price'] > 350) & (df['Price'] <= 500)]
      • Get statistical descriptions of the data set: find maxima, minima, averages, standard deviations, percentiles
        df[['MaxRes','LowRes','Storage','Weight','Dimensions','Price']].describe()
MaxRes LowRes Storage Weight Dimensions Price
count 1038.000000 1038.000000 1036.000000 1036.000000 1036.000000 1038.000000
mean 2474.672447 1773.936416 17.447876 319.265444 105.363417 457.384393
std 759.513608 830.897955 27.440655 260.410137 24.262761 760.452918
min 0.000000 0.000000 0.000000 0.000000 0.000000 14.000000
25% 2048.000000 1120.000000 8.000000 180.000000 92.000000 149.000000
50% 2560.000000 2048.000000 16.000000 226.000000 101.000000 199.000000
75% 3072.000000 2560.000000 20.000000 350.000000 115.000000 399.000000
max 5616.000000 4992.000000 450.000000 1860.000000 240.000000 7999.000000


Plotting data frames with pandas

Pandas comes with handy wrappers around standard matplotlib routines that allow to plot data frames very easily. In the example below we create a histogram of all camera prices under 500 in our data set. Most camera models are priced in the bracket between 100 and 150 (units are not provided in the metadata of this set, but we might safely assume that we are talking about US dollars here).

matplotlib.rcParams.update({'font.size': 16})
df[(df['Price'] < 500)][['Price']].hist(figsize=(8,5), bins=10, alpha=0.5)
plt.title('Histogram of camera prices\n')
plt.savefig('PricesHist.png', bbox_inches='tight')

PricesHist

Grouping data with pandas

A powerful way of manipulating data in a frame is via processes that involve splitting it, applying functions to groups and combining the results. This is the “group by” approach that most SQL users will immediately recognize. Here is a clear exposition of how groupby works in pandas. In our table-top experiment, we group the data according to release year of the models, and study the evolution of the remaining variables with time. To the grouped structure gDate we apply the aggregating function mean(), which averages the non-grouped variables year by year. Quite reassuringly, we find out that photo cameras have become lighter over the past decade, while pixels and storage of newer models have steadily increased.

gDate = df[df['Date'] > 1998].groupby('Date').mean()
dates = [str(s) for s in gDate.index]
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20,5))
cols = ['b', 'r', 'g']
vars = ['EffPix', 'Weight', 'Storage']
titles = ['effective pixels', 'weight', 'storage']
for i, var in enumerate(vars):
    gDate[[var]].plot(ax=axes[i], alpha=0.5, legend=False, lw=4, c=cols[i])
    axes[i].set_xticklabels(dates, rotation=40)
    axes[i].set_title('Evolution of %s\n' % titles[i])
plt.savefig('CameraEvolution.png', bbox_inches='tight')

CameraEvolution

Grouping by maker instead, we can see which ones make the smallest and most economical models, on average.

gMak = df.groupby('Maker').median()
gMak.index.name = ''
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18,8))
c = ['y','c']
vars = ['Dimensions', 'Price']
for i, var in enumerate(vars):
    gMak[[var]].plot(kind='barh', ax=axes[i], alpha=0.5, legend=False, color=c[i])
    axes[i].set_title('Average %s by maker\n' % vars[i])
    axes[i].grid(False)
plt.savefig('MeanDimensionsPrices.png', bbox_inches='tight')

MeanDimensionsPrices

Pandas provides many more functionalities, and this is just a first look at them here at The Data Science Lab. We will continue the exploration of this and other datasets in due time. In the meanwhile, a quick google search for pandas shall be enough to keep us entertained.

Learning Machine Learning Online

The concept of distance, asynchronous learning is not an invention of the digital age. Correspondence and radio courses were already a thing in the past century, offering value to continuous learners and people with otherwise no possibility of attending traditional schools. With the Internet widely spreading over the last decade though, a market for massive open online courses has emerged. Coursera and edX might well be two of the best known providers. Incidentally, Andrew Ng, the co-founder of the former, is a professor of Computer Science and researcher in the field of artificial intelligence. Considering the fact that data science is enjoying lots of popularity at the moment, and that still few higher education institutions are offering comprehensive data science degrees, it is perhaps no surprise that some of the best attended online courses are on machine learning and related data analysis topics.

Below is a brief review of some of my favorite courses.

Machine Learning by Andrew Ng, Stanford (via Coursera)

mlNg’s course has become somewhat of a classic for machine learning beginners and a good introduction to the topic. During 10 weeks, the Stanford professor covers single- and multi-variable linear regression, logistic regression, regularization, neural networks, support vector machines, clustering, and recommender systems, and finishes with general advice for real applications of machine learning. The review quizzes, which can be repeated multiple times, allow the materials to sink in, and the programming exercises, which must be completed in octave/matlab, provide the opportunity to see the methods in action. Most mathematics gory details are skipped, which arguably appeals to students seeking a hands-on approach. The programming exercises are designed as supervised step-by-step milestones, which somehow constrains the room for creativity. On the other hand the provided scripts work flawlessly and help understand and visualize what one is doing.

Learning From Data by Yaser S. Abu-Mostafa, Caltech (via edX)

lfdProf. Abu-Mostafa’s introductory machine learning course is a real gem. He’s an excellent educator who, not surprisingly, won the Feynman prize for excellence in teaching in 1996. The course covers similar material to that of Ng’s, however the focus is more on the underlying mathematical notions and is thus more formal. Each week two one-hour lectures are presented, followed by a homework set containing ten questions. At the end of the ten weeks a final exam is given. The participants are free to choose their favorite language to solve the exercises; only solutions, hence no code, are submitted. The questions in the quizzes can be answered only once, making it more challenging than its Coursera equivalent. I particularly enjoyed the rigorous way of explaining the theory of generalization, which really makes you understand when and why is machine learning possible. Here is an interesting interview with Abu-Mostafa.

Introduction to Data Science by Bill Howe, University of Washington (via Coursera)

datasciThis course, taught by Prof. Howe, has an impressive syllabus to begin with. From relational algebra to parallel databases, including Hadoop, MapReduce, (No)SQL, text analysis, and visualization, it seems to cover everything a data scientist needs. In practice, the workload is slightly unbalanced, with some assignments being clearly more challenging than others. The assignments offer the possibility of getting familiar with catchy topics such as sentiment analysis of tweets, Kaggle competitions, MapReduce, Amazon Elastic Cloud and the popular visualization software Tableau. The homework comprises both automatically and peer-graded exercises. The 8-week duration feels a bit rushed, especially towards the end, but overall it was a fun compilation of assignments.

Natural Language Processing by Michael Collins, Columbia University (via Coursera)

nlpMichael Collins does a great job in this NLP course, which covers very interesting topics in a nice formal and rigorous way. His notes are superb, and the topics chosen provide a solid basis for computational linguistics. There are bi-weekly quizzes, plus three mandatory programming assignments that are challenging enough to keep students occupied for the 10-week course duration. Coding can be done in any language, submitted are just the result files that need to comply with a specific format. I really enjoyed implementing three of the algorithms that Collins very clearly presents in his videos: hidden Markov models for classification, the CKY decoder for parsing, and the translation alignments for the IBM models. NLP skills are certainly a nice addition to any data scientist’s set of tools.

Data Analysis by Jeff Leek, Johns Hopkins University (via Coursera)

daThis is a course on applied statistics focused on data analysis. It puts an emphasis on teaching students how to organize and carry on a data analysis, and to write up a report end-to-end. In addition to the weekly review quizzes, there are two peer-reviewed data analysis assignments, which involve the submission of a full report plus figures and references. That in itself is a nice touch, since it forces you to explain clearly what you learned. The presentation of the statistics concepts and the weekly quizzes rely quite heavily on the R language. The course is not loaded with mathematical derivations, but it does provide a good introduction to the usage of R in data analysis, arguably the most spread tool among statisticians.

This is by no means a comprehensive list of online courses for machine learning. New programs are continuously added to the stack of materials of interest for data scientists. Don’t be afraid of enrolling and trying out some of the courses; you might find that the level and focus is exactly right for you, or else you can always let it sit for a semester and come back in future sessions, for many courses are offered yearly.