The post Some Essential Hacks and Tricks for Machine Learning with Python appeared first on DPhi.

]]>

“I am a student of computer science/engineering. How do I get into the field of machine learning/deep learning/AI?”

It’s never been easier to get started with machine learning. In addition to structured MOOCs, there is also a huge number of incredible, free resources available around the web. Here are just a few that have helped me:

- Start with some cool videos on YouTube. Read a couple of good books or articles. For example, have you read “The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World”? And I can guarantee you’ll fall in love with this cool interactive page about machine learning?
- Learn to clearly differentiate between buzzwords first —
*machine learning, artificial intelligence, deep learning, data science, computer vision, robotics*. Read or listen to the talks, given by experts, on each of them. Watch this amazing video by Brandon Rohrer, an influential data scientist. Or this video about the clear definition and difference of various roles associated with data science. - Have your goal clearly set for what you want to learn. And then, go and take that Coursera course. Or take the other one from Univ. of Washington, which is pretty good too.
**Follow some good blogs**: KDnuggets, Mark Meloon’s blog about data science career, Brandon Rohrer’s blog, Open AI’s blog about their research, and of course, Heartbeat- If you are enthusiastic about taking online MOOCs, check out this article for guidance.
- Most of all, develop a feel for it. Join some good social forums, but
**resist the temptation to latch onto sensationalized headlines and news bytes**posted. Do your own reading, understand what it is and what it is not, where it might go, and what possibilities it can open up. Then sit back and think about how you can apply machine learning or imbue data science principles into your daily work. Build a simple regression model to predict the cost of your next lunch or download your electricity usage data from your energy provider and do a simple time-series plot in Excel to discover some pattern of usage. And after you are thoroughly enamored with machine learning, you can watch this video.

Familiarity and moderate expertise in at least one high-level programming language is useful for beginners in machine learning. Unless you are a Ph.D. researcher working on a purely theoretical proof of some complex algorithm, you are expected to mostly use the existing machine learning algorithms and apply them in solving novel problems. This requires you to put on a programming hat.

There’s a lot of debate on the ‘*best language for data science*’ (in fact, here’s a take on why data scientists should learn Swift).

While the debate rages on, grab a coffee and read this insightful article to get an idea and see your choices. Or, check out this post on KDnuggets. For now, it’s widely believed that Python helps developers to be more productive from development to deployment and maintenance. Python’s syntax is simpler and of a higher level when compared to Java, C, and C++. It has a vibrant community, open-source culture, hundreds of high-quality libraries focused on machine learning, and a huge support base from big names in the industry (e.g. Google, Dropbox, Airbnb, etc.).

This article will focus on some essential hacks and tricks in Python focused on machine learning.

There are few core Python packages/libraries you need to master for practicing machine learning effectively. Very brief description of those are given below,

Short for Numerical Python, NumPy is the fundamental package required for high performance scientific computing and data analysis in the Python ecosystem. It’s the foundation on which nearly all of the higher-level tools such as Pandas and scikit-learn are built. TensorFlow uses NumPy arrays as the fundamental building block on top of which they built their Tensor objects and graphflow for deep learning tasks. Many NumPy operations are implemented in C, making them super fast. For data science and modern machine learning tasks, this is an invaluable advantage.

This is the most popular library in the scientific Python ecosystem for doing general-purpose data analysis. Pandas is built upon Numpy array thereby preserving the feature of fast execution speed and offering many **data engineering features** including:

- Reading/writing many different data formats
- Selecting subsets of data
- Calculating across rows and down columns
- Finding and filling missing data
- Applying operations to independent groups within the data
- Reshaping data into different forms
- Combing multiple datasets together
- Advanced time-series functionality
- Visualization through Matplotlib and Seaborn

Data visualization and storytelling with your data are essential skills that every data scientist needs to communicate insights gained from analyses effectively to any audience out there. This is equally critical in pursuit of machine learning mastery too as often in your ML pipeline, you need to perform exploratory analysis of the data set before deciding to apply particular ML algorithm.

Matplotlib is the most widely used 2-D Python visualization library equipped with a dazzling array of commands and interfaces for producing publication-quality graphics from your data. Here is an amazingly detailed and rich article on getting you started on Matplotlib.

Seaborn is another great visualization library **focused on statistical plotting**. It’s worth learning for machine learning practitioners. Seaborn provides an API (with flexible choices for plot style and color defaults) on top of Matplotlib, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas. Here is a great tutorial on Seaborn for beginners.

Scikit-learn is the most important general machine learning Python package you must master. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, *k*-means, and DBSCAN, and is designed to inter-operate with the Python numerical and scientific libraries NumPy and SciPy. It provides a range of supervised and unsupervised learning algorithms via a consistent interface. The vision for the library has a level of robustness and support required for use in production systems. This means a deep focus on concerns such as ease of use, code quality, collaboration, documentation, and performance. Look at this gentle introduction to machine learning vocabulary as used in the Scikit-learn universe. Here is another article demonstrating a simple machine learning pipeline method using Scikit-learn.

Machine learning models don’t have to live on servers or in the cloud — they can also live on your smartphone. And Fritz AI has the tools to easily teach mobile apps to see, hear, sense, and think.

Scikit-learn is a great package to master for machine learning beginners and seasoned professionals alike. However, even experienced ML practitioners may not be aware of all the hidden gems of this package which can aid in their task significantly. I am trying to list few of these relatively lesser known methods/interfaces available in Scikit-learn.

** Pipeline**: This can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Here is more info about it.

** Grid-search**: Hyper-parameters are parameters that are not directly learnt within estimators. In Scikit-learn they are passed as arguments to the constructor of the estimator classes. It is possible and recommended to search the hyper-parameter space for the best cross validation score. Any parameter provided when constructing an estimator may be optimized in this manner. Read more about it here.

** Validation curves**: Every estimator has its advantages and drawbacks. Its generalization error can be decomposed in terms of bias, variance and noise. The

** One-hot encoding of categorial data**: It is an extremely common data preprocessing task to transform input categorical features in one-in-k binary encodings for using in classification or prediction tasks (e.g. logistic regression with mixed numerical and text features). Scikit-learn offers powerful yet simple methods to accomplish this. They operate directly on Pandas dataframe or Numpy arrays thereby freeing the user to write any special map/apply function for these transformations.

** Polynomial feature generation**: For countless regression modeling tasks, often it is useful to add complexity to the model by considering nonlinear features of the input data. A simple and common method to use is polynomial features, which can get features’ high-order and interaction terms. Scikit-learn has a ready-made function to generate such higher-order cross-terms from a given feature set and user’s choice of highest degree of polynomial.

** Dataset generators**: Scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. It has functions for classification, clustering, regression, matrix decomposition, and manifold testing.

Project Jupyter was born out of the IPython Project in 2014 and evolved rapidly to support interactive data science and scientific computing across all major programming languages. There is no doubt that it has left one of the biggest degrees of impact on how a data scientist can quickly test and prototype his/her idea and showcase the work to peers and open-source community.

However, **learning and experimenting with data become truly immersive when the user can interactively control the parameters of the model **and see the effect (almost) real-time. Most of the common rendering in Jupyter are static.

But you want more control,

you want to change variables at the simple swipe of your mouse, not by writing a for-loop. What should you do? You can useIPython widget.

Widgets are eventful python objects that have a representation in the browser, often as a control like a slider, text box, etc., through a front-end (HTML/JavaScript) rendering channel.

In this article, I demonstrate a simple curve fitting exercise using basic widget controls. In a follow-up article, that is extended further in the realm of interactive machine learning techniques.

Machine learning is rapidly moving closer to where data is collected — edge devices. Subscribe to the Fritz AI Newsletter to learn more about this transition and how it can help scale your business.

This article gloss over some essential tips for jump-starting your journey to the fascinating world of machine learning with Python. It does not cover deep learning frameworks like TensorFlow, Keras, or PyTorch as they merit deep discussion about themselves exclusively. You can read some great articles about them here but we may come back later with a dedicated discussion about these amazing frameworks.

- 7 great articles on TensorFlow (Datascience Central)
- Datacamp tutorial on neural nets and Keras example
- AnalyticsVidhya tutorial on PyTorch

You can also try the following,

** Deep Learning Course (with TensorFlow) by SimpliLearn**: This course has been crafted by industry experts and been aligned with the latest best practices. You will learn foundational concepts and the TensorFlow open source framework, implement the most popular deep learning architectures, and traverse layers of data abstraction to understand the power of data.

**Google’s Cloud-based TensorFlow specialization (Coursera)**: This 5-course specialization focuses on advanced machine learning topics using Google Cloud Platform where you will get hands-on experience optimizing, deploying, and scaling production ML models of various types in hands-on labs.

Thanks for reading this article. Machine learning is currently one of the most exciting and promising intellectual fields, with applications ranging from e-commerce to healthcare and virtually everything in between. There are hypes and hyperbole, but there is also solid research and best practices. If properly learned and applied, this field of study can bring immense intellectual and practical rewards to the practitioner and to her/his professional task.

It’s impossible to cover even a small fraction of machine learning topics in the space of one (or ten) articles. But hopefully, the current article has piqued your interest in the field and given you solid pointers on some of the powerful frameworks, already available in the Python ecosystem, to start your machine learning tasks.

**Note: **T*his article was originally published on Heartbeat, and kindly contributed to DPhi to spread the knowledge.*

Become a guide. Become a mentor.We at DPhi, welcome you to share your experience in data science – be it your learning journey, experience while participating in Data Science Challenges, data science projects, tutorials and anything that is related to Data Science. Your learnings could help a large number of aspiring data scientists! Interested? Submit here.

The post Some Essential Hacks and Tricks for Machine Learning with Python appeared first on DPhi.

]]>The post Essential beginners’ Q/A for machine learning/data science appeared first on DPhi.

]]>I have been an R&D engineer for last 8 years post my PhD in Electrical Engineering. I work in the domain of semiconductor devices and have started learning about and practicing machine learning (ML) and related data science (DS) concepts and techniques only for past one year or so.

Apart from taking numerous online MOOCs and building my GitHub repos, I have taken time to write some articles on Medium on the topic of data science and machine learning. Some of them have got good responses and stimulated interesting discussion. During this time, many of my friends, who are from similar technical background, have asked me how I have continued to study and what I have focused on to learn the essential ML/DS concepts. Also, I have silently observed hundreds of such questions on some social media forum I am part of. I have received emails from complete stranger in my personal Inbox, discussing issues with my Github code or seeking tips on how to manage time for studying data science after a full day’s hard work as a mechanical engineer!

In response, I am trying to list down the most essential insights I have had and found valuable during my journey so far. Of course, these may look pretty regular and commonplace for an experience practitioner. However, my goal is to give some introductory ideas to the beginners like me. At the same time, I will try to write down answers to some oft-repeated questions I have received in my inbox from friends and strangers alike.

This will be highly

biasedby my own experience. And there is novariancebecause the sample set consists of only one mind — mine Therefore, if you read on, please apply a suitable filter (convolutional or not) and then put the final classification label

“I am a student of computer science/engineering. How do I get into field of machine learning/deep learning/AI?”

— You are lucky. When I was student, this field was still kind of ‘chilly’ and I never had heard of it No matter what your main area of study is, you can read about and acquire knowledge about machine learning. And please, DO NOT go straight to Coursera and register in Prof. Ng’s course just because somebody gushed about it on internet or you saw the certificate on your friend’s LinkedIn profile.

- Start with some cool videos on YouTube. Read couple of good books or articles. For example, have you read “The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World”. And I can guarantee you to fall in love with this cool interactive page about machine learning?
- Learn to clearly differentiate between buzzwords first —
*machine learning, artificial intelligence, deep learning, data science, computer vision, robotics*. Read or listen to the talks, given by experts, on each of them. Watch this amazing video by Brandon Rohrer, an influential data scientist. Or this video about the clear definition and difference of various roles associated with data science. - Have your goal clearly set about what you want to learn. And then, go and take that Coursera course. Make sure you are comfortable with MATLAB before starting that course and you know by heart how to do a matrix multiplication. Or take the other one from Univ. of Washington, which is pretty good too. And please, do not take the deeplearning.ai courses before taking the foundational course.
**Follow some good blogs**: KDnuggets, Mark Meloon’s blog about data science career, Brandon Rohrer’s blog, Open AI’s blog about their research,- If you have already started to study and practice machine learning but not feeling confident about your understanding then, you can try this
**12 Important Machine Learning Interview Questions**guide from Simplilearn to align your learning compass. - Most of all, develop a feel for it. Join some good social forums but
**resist the temptations to latch onto sensationalized headlines and news bytes**posted. Do your own reading, understand what it is and what it is not, where it might go, and what possibilities it can open up. Then sit back and think how you can apply machine learning or imbue data science principles into your daily work. Build a simple regression model to predict the cost of your next lunch or download your electricity usage data from your energy provider and do a simple time-series plot in Excel to discover some pattern of usage.. And after you are thoroughly enamored with machine learning, you can watch this video*If you can get down to that kind of personal level, then the love affair is natural and guaranteed*

“What are some good books to read on AI/machine learning/deep learning?”

— There seem to be plenty of solid answers on this question. Instead of stacking up my own bias, I am inclined to point you to some top links which show most useful curated collections,

- 10 free must-read books for machine learning and data science by Matthew Mayo.
- Must Read Books for Beginners on Machine Learning and Artificial Intelligence — by AnalyticsVidhya.
- This awesome GitHub repo.
- A solid guide for all levels of practitioners from machinelearningmastery.com.

“Now tell me about some best online courses for beginners”

— Again, couple of links and some words of advice. First the links,

- Top Machine Learning MOOCs and Online Lectures: A Comprehensive Survey.
- 15 minute guide to choose effective courses for machine learning and data science.
- The Best Machine Learning Courses — Class Central Career Guides

**The advice is that only you can decide the right sequence and pace of study through these MOOCs**. Some people are highly attuned to take online courses and can keep themselves motivated enough to download all the programing material and practice them. Some others passionate but cannot keep awake in online video lectures. And there are plenty in between. You can get great amount of knowledge and build solid foundation as a beginner, from online courses but you have to pace them and approach them right.

“OK, so what are your special tips for taking online courses? Are paid courses/certificates worth it?”

- Study the curated lists of best courses (above) and visit them all. Should not cost you anything.
**List down the requirements and difficult level and the breadth of topics**covered. - Always
**keep a personal roadmap of things you want to learn. It should be almost like a time-series**i.e. definite goals to be achieved on definite time point. **Bring on your data scientist hat and try to fit the course with this time-series**. Does it match i.e. teach you valuable concepts and tools that you want to acquire in near future? Does it go deeper into the framework and specializations you already know but want to polish further? Score the courses weighted on their utility and your personal requirement.- Make a list of top 5 courses for a quarter.
**I have found it useful to think in terms of quarter.** - Now, look for cost implications and read reviews of past students on those top 5. Try to quickly judge which reviewer’s personal situation is most like yours and weight that review highly.
- I have mostly taken audited courses/free option. I paid only for the courses/certificates which I thought will give me a boost in terms of credibility in this field. Quizzes are not that important but programming assignments/Jupyter notebooks are.
**Most audited courses will let you download the Notebooks for free. That’s all you need**. Trust me, you can make those notebooks much better than the instructor provided versions if you seriously work on them diligently. - If you have some budget for spending, I would suggest look for
**local universities offering certificates or boot camp style courses with classroom options**. This works for working professionals too as they are mostly delivered in the evenings or on weekends.**Spend money only when you get a highly credible certificate out of it**— preferably from good university affiliated camps or programs. **If you need to brush up programming language skills from scratch, there are few focused and easy-paced courses on Udemy**. Generally, these courses are designed to teach you essentials of programming in a language of your choice without the burden of deeper subject matters. I have found that**Coursera or edX courses are mostly offered by academic researchers and are most suitable for subject learning not as a programming 101. Udemy courses are, on the other hand, perfect for brushing up fundamentals**. Wait for Udemy’s $10 offer deals (they have them almost every month) and buy those courses in a pack. I particularly like Jose Marcial Portilla’s courses on Python, R, SQL, and Apache Spark.**Internet is one big university and library rolled into one. Consider and utilize any online resource — whole or piecemeal – as you need**. If you don’t know anything about Gits, look for specific short courses on them. If you need to know quick web-scrapping techniques with a specific language, just watch lectures on those from a bigger course. If you really think that basic data structures are mysterious, go and take a freshman level computer science course. Harvad’s CS50 is an excellent one. Or there is even a version of it for business professionals with a different flavor.

Copying Leo Tolstoy but making an inverse of it, “** every bad course is similar, every great course has an unique element of greatness about it**”. Develop an eye for discovering that greatness in the top courses.

“So, what are the absolute bedrock knowledge needed to start the journey?”

— Nothing but a curiosity to learn new things and a passion to work hard for it. You have to acquire knowledge, practice, and internalize concepts as you go. But even then, some general points can be made for a structured learning approach. Again, this is from my very personal experience and therefore is subject to complete re-tuning based on your personal situation and goal.

**Make yourself comfortable with data and patterns with numbers**. And accept the fact that data can come from any type of signal or source — spreadsheet financials, bank transactions, ad clicks, hospital patient records, wearable electronics, Amazon echo, factory assembly lines, customer survey. So, they can be messy, incomplete and hard to decipher and it is useful to learn how to be patient and deal with them cleverly for great models to be built later. This is not a hard knowledge but an aptitude.- You should be
**very very comfortable with high-school level mathematics including basic calculus**. Here is an article about it. Specifically, concepts about multi-variable functions, linear algebra, derivatives, and graphs should be clear.

**Learn principles of basic statistics by heart**. Most, if not all, *classical machine learning is nothing but sound statistical modeling wrapped with an interface of computer programming and optimization technique*. Deep learning is a different beast and theories pertaining to it are still being developed actively. However, if you have solid grasp of basic statistics, you are far along the path of becoming a good machine learning practitioner. This is an excellent free online book to start. Or, this course from UC San Diego really nails down the basics in a fun and interesting way. Or if watching fun YouTube videos is your thing, you can try this channel.

**Familiarity with at least one high-level programming language is useful**. Because, unless you are a PhD researcher in machine learning working on purely theoretical proof of some complex algorithm, you are expected to mostly use the existing machine learning algorithms, apply them in solving novel problems, and create cool predictive models based on those techniques. This requires programming hat to be put on. Nothing extraordinary needed though. Just **basic familiarity with syntaxes, variables, data structures, conditionals, functions, and class definitions of a particular language**. There is a high amount of debate and active discussion on the ‘best language for data science’. While the debate rages on, grab you coffee and read this insightful article to get an idea and see your choices. Or, check out this post on KDnuggets.

**Learn ****Jupyter fundamentals**. This is a kickass programming and experimenting tool for a data scientist. Learn not only how to code but how to write a full technical document or report using Jupyter including images, markdown, LaTex formatting, etc. Here is a Github notebook that I put together on markdown features. Always remember, the Jupyter project started as an offshoot of IPython, but developed into a full-blown development platform with support of any number of programming languages that you can think of.

**Familiarity with Linux/command line interfaces, Gits, and Stack exchanges can be immensely helpful** to get you started on the practical implementation side. Surprisingly, this is harder than it sounds. Particularly, for many working professionals like me, using Windows or Mac based enterprise level tools becomes an ingrained nature. For technical help on a software tool, we expect a shiny little electronic manual to pop up. Registering an account and posting a well-framed question on a Stack forum is too much for us. But learn them early and **embrace the open-source culture. You will find that whole world is eager to solve your problem before you can even finish typing your question on Stackoverfllow** And you will also be pleasantly surprised to see that installing a powerful software library is not as intimidating as those geeks make you believe. It is as easy as “*pip install …*”

That’s it for Part-I. In Part-II, I hope to cover most important Q/A on two more aspects,

- Some key machine learning/statistical modeling concepts that are useful to beginners and,
- Other much-needed skills such as being visible on social/professional platforms and focusing on building stuff regularly.

**Note: **This article was originally published on towardsdatascience.com, and kindly contributed to DPhi to spread the knowledge.

Become a guide. Become a mentor.We at DPhi, welcome you to share your experience in data science – be it your learning journey, experience while participating in Data Science Challenges, data science projects, tutorials and anything that is related to Data Science. Your learnings could help a large number of aspiring data scientists! Interested? Submit here.

The post Essential beginners’ Q/A for machine learning/data science appeared first on DPhi.

]]>The post How to choose effective courses for machine learning and data science appeared first on DPhi.

]]>Bill Gates proclaimed in a recent graduation ceremony, that artificial intelligence (AI), energy, and bio science are three most exciting and rewarding career choices today’s young college graduates can choose from.

I couldn’t agree more.

I have come to believe strongly that some of the most important questions of our generation – related to sustainability, energy generation and distribution, transportation, access to basic amenities of life etc., are dependent on how intelligently we can mix the the first two branches of knowledge Mr. Gates mentions.

In other words, the world of physical electronics (semiconductor industry comprises a central portion of that world), must do more to embrace fully the fruits of information technology and new developments in AI or data science.

I am a semiconductor professional with 8+ years of post-PhD experience in a top technology company. I take pride in the fact that I work in the cross-section of physical electronics which directly contributes to the energy sector. I develop power semiconductor devices. They are built to carry the electrical power efficiently and reliably and they power everything from the tiny sensor inside your smartphone to the large industrial motor drives which process food or cloth for everyday consumption.

Therefore, naturally, I want to learn and apply the techniques of modern data science and machine learning to improve the design, reliability, and operation such devices and systems.

But I am no computer science graduate. I could not tell a linked list from a heap. Support vector machines sounded like (few months back) some special equipment for people with disabilities. And the only keyword of AI I remembered (from my junior year elective course) was ‘*first order predicate calculus*’, a remnant of the so-called ‘*old AI*’ or *knowledge-engineering* approach as opposed to the newer machine learning based approach.

I had to start somewhere to learn the basics and then study my way deep. The obvious choice was

MOOC(Massive Open Online Courses). I am still very much in the learning phase but believe that I have at least gathered some good experience in choosing the right MOOC for this path. In this article, I wanted to share my insights on that aspect.

Sorry for the bad analogy It’s from Netlfix’s latest superhero ensemble saga — The Defenders.

But it’s true that you should know your strengths, weakness, and technical inclination very well before you start the learning-through-MOOC process.

Because, let’s face it, time and energy are limited and you cannot afford to waste your precious resources on something you are highly unlikely to practice in your current work or future job. And this is assuming that you want to take the ** (almost) free learning path i.e. auditing the MOOCs rather than paying for the certificates**. I have an ‘

In this picture, I just want to show the possibilities and impossibilities of this process i.e. what you can hope to learn through self-studying and practice and what must be learned on the job or what kind of mentality must be cultivated no matter what your profession is. Having said that, however, these circles broadly encompass the core skills that one can study to venture into the field of data science/machine learning from a non-CS background. Please note that even if you are in information technology (IT) sector, you may have a steep learning curve ahead because traditional IT is being disrupted by these new fields and the core skills and good practices are often different.

I, for one, view the data science field as more democratic than many other professional domains (e.g. my own area of work semiconductor technology), where the entry barrier is low and with sufficient hard work and zeal, anybody can acquire meaningful skills. For me personally, I have no burning desire to ‘break in’ this field, rather I just have a passion to borrow the fruits to apply to my own area of expertise. However, that end goal does not impact the initial learning curve that one has to traverse. So, you could be aiming to be either data engineer, or business analyst, or machine learning scientist, or a visualization expert — the field and choices are wide open. And if your aim is like mine – stay in the current domain of expertise and apply the newly learned techniques — you are fine too.

I started with real basic —**learning Python on Codeacademy**. In all likelihood, you cannot go more basic than this :-). It worked though. I had this aversion towards coding but the simple and fun interface and the right pace of Codeacademy’s free course was appropriate to excite me enough to keep going. I could have picked a Java or C++ course on Coursera or Datacamp or Udacity but some reading and research told me that Python is the optimal choice balancing learning complexity and utility (especially for data science) and I decided to trust the insight.

Codeacademy’s introduction was a fine base to start with. I had choices from so many online MOOC platforms and predictably enough, I signed up for multiple courses at the same time. However, after dabbling with a Coursera class for few days, I realized I was not ready enough to learn Python from a professor! I was looking for a course taught by some enthusiastic instructor who will take time to go over the concepts in great detail, teach me other essential tools like Git and Jupyter notebook system, and maintain a right balance between basic concepts and advanced topics in the curriculum. And I found the right man for the job: Jose Marcial Portilla. He offers multiple courses on Udemy and is one of the most popular and positively reviewed instructors on that platform. I signed up and completed his **Python Bootcamp** course. It was an amazing introduction to the language with right pace, depth and rigor. I recommend this course highly for new learners even though you have to fork out $10 (*Udemy courses are generally not free and their regular price is $190 or $200 but you can always wait few days to have the recurrent promotion cycle and sign up for $10 or $15*).

The next step proved crucial for me. I could have gone astray and try to study anything and everything I could on Python. Especially, the object-oriented and class definition part which easily can suck you in for a long and arduous journey. Now, taking nothing away from that key sphere of Python universe, one can safely say that you can practice deep learning and good data science without being able to define your own class and methods in Python. One of the fundamental reasons of Python’s ever increasing popularity as de facto language of choice for data science, is the availability of large number of high-quality, peer-reviewed, expert-written libraries, classes, and methods, just waiting to be downloaded in a nice packaged form and unwrapped for seamless integration into your code.

Therefore, it was important for me to quickly jump into the packages and methods used most widely for data science — NumPy, Pandas, and Matplotlib.

I was introduced to those by a neat little course from edX. Although most courses on edX are from universities and rigorous (and long*ish)* in nature, there are few short and more hands-on/less theoretical courses offered by technology companies like Microsoft. One of them is the **Microsoft Professional Program in Data Science**. You can register for as many courses under this program as you want. However, I took only the following courses (and I intend to come back for other courses in future)

**Data Science Orientation**: Discusses everyday life of a typical data scientist and touches upon the core skills one is expected to have in this role along with basic introduction to the constituting subjects.**Introduction to Python for Data Science**: Teaches basics of Python — data structures, loops, functions, and then introduces*NumPy*,*Matplotlib*, and*Pandas*.**Introduction to Data Analysis using Excel**: Teaches basic and few advanced data analysis functions, plotting, and tools with Excel (e.g.*pivot table*,*power pivot*, and*solver plug-in*).**Introduction to R for Data Science**: Introduces R syntax, data types, vector and matrix operations, factors, functions, data frames, and graphics with*ggplot2*.

Although these courses present the material in a rudimentary fashion and cover only the most basics of examples, they were enough to spark the plug! Boy, I was hooked!

The last course made me realize few important things: (a) statistics and linear algebra are at the core of data science process, (b) I did not know/had forgot enough of that, and (c) R is naturally suited for the kind of work I want to do with my data set — few MB sized data generated by controlled wafer fab experiment or TCAD simulation, primed for basic inferential analysis.

This prompted me to search for a solid introductory course in R language and who better to turn to than Jose Portilla again! I signed up for his “**Data Science and Machine Learning Bootcamp with R**” class. This was a ‘buy one get another free’ deal as the course covered essentials of R language in the first half and switched to teaching basic machine learning concepts (all the important concepts, expected in an introductory course, were covered with sufficient care). Unlike the edX Microsoft course, which used a server-based hands-on lab environment, this course covered the installation and setup of R Studio and necessary packages, introduced me to kaggle and gave the required push to graduate from being a passive learner (aka MOOC video watcher) to a person who is not afraid of playing with data. It also followed the great “** Introduction to Statistical Learning in R**” (ISLR) book by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, chapter by chapter.

If you are allowed to read only one book in your lifetime to learn machine learning and nothing else, pick this book and read all the chapters, no exception. By the way, there is no neural network or deep learning material in this book, so there’s that…

Armed with the course materials, the ISLR book, and practice on random data sets downloaded from kaggle or even my own electricity usage data from PG&E, I was no longer afraid of writing small bytes of codes which can actually model something interesting or useful. I analyzed some US county-level crime data, why a large design-of-experiment can lead to spurious correlation, and even my apartment’s electricity usage over past 3 months. I also successfully used R to built predictive models based on some real-world data sets from my work. The statistical/functional nature of the language and ready-made estimate of the confidence intervals (p-values or z-score) for a variety of models (regression or classifications) really help a new learner to gain easy foothold in the domain of statistical modeling.

This aspect of learning cannot be over-emphasized — especially for non-CS graduates and IT engineers who are not in touch with rigorous mathematics for some years into their professional lives. I even wrote a medium article on what mathematics knowledge is necessary to have for machine learning and data science.

For this I chose few courses from Cousera and edX. Few of them stand out in their depth and rigor. Those are,

**Statistical Thinking for Data Science and Analytics****(Columbia Univ.):**Foundation statistics course from Columbia University on their Data Science Executive certificate program on edX. Rigorous but drills down the concepts very well in a structured manner.**Computational Probability and Inference****(MIT):**This is a hard one from MIT, be aware! It covers advanced topics like Bayesian models and Graphical models in unparalleled depth.**Statistics with R Specialization**(**Duke Univ.**): This is a 5-course (last one is a capstone project, you can ignore that) specialization from Duke University to enhance your statistics foundation along with hands-on programming exercise. Recommended for balanced difficulty level and rigor.**LAFF: Linear Algebra — Foundations to Frontiers**(**UT Austin**): This is an amazing course in linear algebra foundation (along with deep discussion about high-performance computing of linear algebra routines) that you must give a try. Offered by University of Texas, Austin on edX platform. Trust me when I say, after taking this course, you will never want to invert a matrix to solve a linear system of equations even if that is tempting and easy to understand but you will try to find a QR factorization or Cholesky decomposition to reduce the computation complexity.**Optimization Methods in Business Analytics**(**MIT**): This is a course in optimization/operation research methods for business analytics from MIT. I signed up because this was the only highly-rated course on a good platform (edX) that I could find about linear and dynamic programming techniques. I believed that learning about those techniques could be immensely helpful as the**optimization problem turns up in almost all machine learning algorithm**.

Please note that I did not search and sign up for any calculus course as I was comfortable with the level of knowledge I could remember (from college days) and what I expected to be useful for any machine learning or data science study and practice. If you are rusty in that area, please search for a good one.

Somewhere among all these side-studies, I managed to complete the course that is considered as one of the pioneers of all MOOCs — Andrew Ng’s machine learning course on Coursera. I guess there are plenty of articles written about it already, and therefore, I will not waste any more of your time describing this course. Just take it, do all the homework and programming assignments, **learn to think in terms of vectorized codes for all the major machine learning algorithms **that you know of, and save the notes for ready reference for your future work.

Oh, by the way, if you want to brush up/ learn from scratch MATLAB (you will need to write MATLAB codes for this course, not R or Python), then you can check out this course: **Introduction to Programming with MATLAB**.

Now, I want to talk about personalities.

I took multiple machine learning courses and the aspect I enjoyed most was realizing how the treatment of the same fundamental subject becomes a function of the personality and worldview of different instructors This was a fascinating experience.

I am listing down the various machine learning MOOCs I signed up and covered…

**Machine Learning (Stanford Univ.)**: Andrew Ng’s widely known course. Talked about it in the paragraph above.**Machine Learning Specialization (Univ. of Washington)**: This comes with a different flavor than Ng’s. Emily Fox and Carlos Guestrin present the concepts from a statistician’s and a practitioner’s perspective respectively. I could not install the Python package that Carlos’ company offers as a free license but this specialization is worth completing for its theory lectures alone.**The proofs and discussion of some of the fundamental concepts like bias-variance trade-off, cost computation, and comparison of analytic vs. numerical approaches for cost function minimization, are more intuitively and carefully presented than even Prof. Ng’s course**(and that’s saying something given the superb quality of Prof. Ng’s teaching).**Machine Learning for Data Science and Analytics (Columbia Univ.)**: This course had a little unusual syllabus for a general machine learning course by devoting the full first half on conventional algorithms lectures. It covered essential sorting, searching, graph traversing, and scheduling algorithms. There is not much one-to-one discussion about how these algorithms are exactly used in the machine learning problems but studying about them gives you an idea about the traditional computer science knowledge necessary to appreciate how large-scale data science problems are tackled.**Think****O(n^3)****whenever you are about to multiply to matrices or think O(nlog(n)) whenever you are sorting a list.**You may not exclusively use this knowledge in your day-to-day job, but knowing about these nuts and bolts of computation process certainly broadens your worldview about the problem at hand.**Data Science: Data to Insights (MIT xPro 6 weeks online course)**: This one is among the very few paid courses I have taken (I generally go Audit route for MOOCs). This is not available on public edX website although it uses the edX platform for delivering content. The 6-week course is well-structured and full of interesting content which opens up the wide world of data science and machine learning to the uninitiated. The case studies are very interesting but reasonably hard and time consuming to codify. Lectures are very engaging with the illustration of those case studies. My particular favorite module was the one about recommendation system.**I literally started viewing the****Netflix screen on my laptop in terms of adjacency matrix after taking this class**!**Neural Networks for Machine Learning (Univ. of Toronto)**: This is a somewhat underrated course on Coursera, even with the neural network pioneer Jeff Hinton as the instructor. I realize that Andrew Ng’s new Deep Learning specialization will directly compete with this course and I would not be surprised if Coursera removes this in near future. However, while it is there,**a deep learning enthusiastic should sit through this one, even if just to gauge the pattern of the historical development of deep networks.****Deep Learning Specialization (deeplearning.ai)**: This is the newest kid on the block but it stands of the very board shoulder of Andrew Ng, and therefore boasts of very strong legs I have finished the 2nd course and on to the 3rd now. Jury is still out there but definitely you should consider completing this series if you want to brush over the latest trends in deep learning. Even if the programming assignments look hard and you want to stay out of programming a deep network by hand (you can argue there are always excellent open-source packages like TensorFlow, Keras, Theranos, out there to take care of the nuts and bolts under the hood),**it is imperative to have deep understanding of the essential concepts such as regularization, exploding gradient, hyperparameter tuning, batch normalization, etc. to effectively use those high-level deep learning frameworks.**

As we draw closer to the end of this long article, I wanted to list down two multi-course MOOCs I found interesting and useful to go along with the specific subject areas mentioned above.

**Data Science Specialization (John Hopkins Univ.)**: This one is a well-known 10-course specialization offered on Coursera. Not every course will appeal to every learner. I personally completed only 5 of the 10.**The key thing is the timing i.e. when to start this specialization**. Often this comes up at the top of the Google result when one researches about MOOCs for data science and therefore this becomes the first MOOC for many new learners. Personally, I would have had a problem getting the full value from this course if I had done that. The introductory Microsoft and Udemy courses on R and few statistics and linear algebra courses before this helped me immensely to extract the full benefit from these set of courses. As the specialization is instructed by professors from bio-statistics department of JHU,**one gets an excellent treatment of two aspects of data science which are often under-represented in many curricula— research study and design of experiment.****Data Science Micromasters certificate program (UC San Diego)**: I have just enrolled and started the 1st of the 4 courses in this series/certificate program. I like the fact that this is similar in breadth and goals as the John Hopkins specialization, except it chooses Python as the working language for the hands-on portion. The structure and content seem well thought out covering basics of Python, Git, Jupyter all the way up to Big data processing with Apache Spark framework (statistics and machine learning courses thrown in the middle). The case studies and hands-on examples are drawn from the real-world application of data science such as wildfire modeling, cholera outbreak, or world development indicator analysis. One of the lead instructors is Ilkay Altintas, who has created an amazing platform for helping wildfire dynamics prediction and is putting the fruits of data science research for pursuing societal good. I am sure my journey with this specialization will be an exciting and rewarding one. You are welcome to join the party!**Data Science Capstone Project (Simplilearn)**: Through dedicated mentoring sessions, you’ll learn how to solve a real-world, industry-aligned data science problem, from data processing and model building to reporting your business results and insights.

As the usage of Big Data (petabyte scale) grows in every facet of business and analytics, those engineers, who can mix Big Data and Machine Learning skills effectively, will be in high demand. Therefore, it makes sense to focus on a couple of courses centered around this mix.

**Simplilearn’s Machine Learning Certification Course**: This course helps you master the concepts of supervised and unsupervised learning, recommendation engine, and time series modeling through a hands-on approach that includes working on four major end-to-end projects and 25+ hands-on exercises. Major industry partners such as Uber and Mercedes-Benz advise on their curriculum.

**Google’s Cloud-based TensorFlow specialization (Coursera)**: This 5-course specialization focuses on advanced machine learning topics using Google Cloud Platform where you will get hands-on experience optimizing, deploying, and scaling production ML models of various types in hands-on labs.

With the advent of MOOCs, open-source programming platforms, collaboration tools, and virtually unlimited free cloud-based storage, learning is as democratized, ubiquitous, and universally accessible as it can get. If you are not a specialist on data science/machine learning but want to learn the subject, write some code for higher productivity at work, strive for a career enhancement, or just have some fun, now is the time to start learning. Few parting comments,

: Do not let any so-called expert demoralize you by saying something like “*You are a data scientist**MOOCs are for kids, you won’t learn real data science like that”*. The very fact that you are trying to data science by enrolling in a MOOC means two things: (a) you already deal with data in your professional life and (b) you want to learn scientific, structured manner of extracting maximum value from your data and generate intelligent questions around that data. That means you, my friend, are already a data scientist. If still not convinced,**read this blog by Brandon Rohrer**, one of the most admired and inspirational data scientists that I know of.: I know that I listed a lot of courses and they may look expensive to you. But, fortunately, most (if not all), can be enrolled into free of cost. edX courses are always free to enroll and they generally don’t have any restrictions in terms of course content i.e. you can view, execute, submit all the graded assignments (unlike Coursera, which let’s you watch all the videos but hides the graded material). If you think some certificate is worth showcasing on your resume, you can always pay for it in the middle of the course after you have completed some videos and judged the merit and utility.*You don’t have to spend a large sum for this learning*: There is a real algorithm called ‘online learning’ in the context of machine learning. In this technique, instead of processing a full matrix of millions of data points, the algorithm works with the latest few data points and updates the prediction. You can work in this mode too. The halting problem/parking problem is always a fascinating one and it applies to learning too. We always wonder how much to study and assimilate before building things i.e. where to halt the learning and start implementing. Don’t hesitate, don’t procrastinate. Learn a concept and test it by simple coding. Work with the latest trick or technique you watched video about, don’t wait for achieving mastery over the entire topic. You will be amazed by how simple 20 lines of coding can give you solid practice (and make you sweat enough) on the most complex concept you learned watching that video.*Practice, code, and build things to supplement your online learning*

** here is plenty of data out there**: You will also be amazed how many rich sources of free data are out there on the web. Don’t go to Kaggle, try something different for fun. Try

*About the Author: Tirthajyoti Sarkar holds a Ph.D. in EE|AI, data science, and semiconductor | AI/ML certification (Stanford, MIT) | Data science author | Speaker | Open-source contributor.*

**Note: **This article was originally published on towardsdatascience.com, and kindly contributed to DPhi to spread the knowledge.

The post How to choose effective courses for machine learning and data science appeared first on DPhi.

]]>The post How much mathematics does an IT engineer need to learn to get into data science/machine learning? appeared first on DPhi.

]]>First, the disclaimer, I am not an IT engineer

I work in the field of semiconductors, specifically high-power semiconductors, as a technology development engineer, whose day job consists of dealing primarily with semiconductor physics, finite-element simulation of the silicon fabrication process, or electronic circuit theory. There are, of course, some mathematics in this endeavor, but for better or worse, I don’t need to dabble in the kind of mathematics that will be necessary for a data scientist.

However, I have many friends in the IT industry and observed a great many traditional IT engineers enthusiastic about learning or contributing to the exciting field of data science and machine learning or artificial intelligence.

I am dabbling myself in this field to learn some tricks of the trade which I can apply to the domain of semiconductor device or process design. But when I started diving deep into these exciting subjects (by self-study), I discovered quickly that I don’t know/only have a rudimentary idea about/ forgot mostly what I studied in my undergraduate study of some essential mathematics. In this LinkedIn article, I ramble about it…

Now, I have a Ph.D. in Electrical Engineering from a reputed US University and still, I felt incomplete in my preparation for having a solid grasp over machine learning or data science techniques without having a refresher in some essential mathematics.

Meaning no disrespect to an IT engineer, I must say that the very nature of his/her job and long training generally leave him/her distanced from the world of applied mathematics. (S)he may be dealing with a lot of data and information on a daily basis but there may not be an emphasis on rigorous modeling of that data. Often, there is immense time pressure, and the emphasis is on ‘*use the data for your immediate need and move on*’ rather than on deep probing and scientific exploration of the same.

However, data science should always be about science (*not* data), and following that thread, certain tools and techniques become indispensable.

These tools and techniques — modeling a process (physical or informational) by probing the underlying dynamics, rigorously estimating the quality of the data source, training one’s sense for identification of the hidden pattern from the stream of information, or understanding clearly the limitation of a model— are the hallmarks of sound scientific process.

They are often taught at advanced graduate level courses in an applied science/engineering discipline. Or, one can imbibe them through high-quality graduate-level research work in a similar field. Unfortunately, even a decade long career in traditional IT (DevOps, database, or QA/testing) will fall short of rigorously imparting this kind of training. There is, simply, no need.

Until now.

You see, in most cases, having impeccable knowledge of SQL queries, a clear sense of the overarching business need, and idea about the general structure of the corresponding RDBMS is good enough to perform the extract-transform-load cycle and thereby generating value to the company for any IT engineer worth his/her salt.

But what happens if someone drops by and starts asking a weird question like “*is your artificially synthesized test data set random enough*” or “*how would you know if the next data point is within 3-sigma limit of the underlying distribution of your data*”? Or, even the occasional quipping from the next-cubicle computer science graduate/nerd that the computational load for *any meaningful mathematical operation with a table of data (aka a matrix) grows non-linearly with the size of table *i.e. number of rows and columns, can be exasperating and confusing.

And these type of questions are growing in frequency and urgency, simply because data is the new currency.

Executives, technical managers, decision-makers are not satisfied anymore with just the dry description of a table, obtained by traditional ETL tools. They want to see the hidden pattern, yarn to feel the subtle interaction between the columns, would like to get the full descriptive and inferential statistics that may help in predictive modeling and extending the projection power of the data set far beyond the immediate range of values that it contains.

Today’s data must tell a story, or, sing a song if you like. However, to listen to its beautiful tune, one must be versed in the fundamental notes of the music, and those are mathematical truths.

Without much further ado, let us come to the crux of the matter. What are the essential topics/sub-topics of mathematics, that an average IT engineer must study/refresh if (s)he wants to enter into the field of business analytics/data science/data mining? I’ll show my idea in the following chart.

Basic Algebra, Functions, Set theory, Plotting, Geometry.

Always a good idea to start at the root. The edifice of modern mathematics is built upon some key foundations — set theory, functional analysis, number theory, etc. From an applied mathematics learning point of view, we can simplify studying these topics through some concise modules (in no particular order):

a) set theory basics, b) real and complex numbers and basic properties, c) polynomial functions, exponential, logarithms, trigonometric identities, d) linear and quadratic equations, e) inequalities, infinite series, binomial theorem, f) permutation and combination, g) graphing and plotting, Cartesian and polar coordinate systems, conic sections, h) basic geometry and theorems, triangle properties.

Sir Issac Newton wanted to explain the behavior of heavenly bodies. But he did not have a good enough mathematical tool to describe his physical concepts. So he invented this (or a certain modern form) branch of mathematics when he was hiding away on his countryside farm from the plague outbreak in urban England. Since then, it is considered the gateway to advanced learning in any analytical study — pure or applied science, engineering, social science, economics, …

Not surprisingly then, the concept and application of calculus pop up in numerous places in the field of data science or machine learning. Most essential topics to be covered are as follows –

a) Functions of single variable, limit, continuity and differentiability, b) mean value theorems, indeterminate forms and L’Hospital rule, c) maxima and minima, d) product and chain rule, e) Taylor’s series, f) fundamental and mean value-theorems of integral calculus, g) evaluation of definite and improper integrals, h) Beta and Gamma functions, i) Functions of two variables, limit, continuity, partial derivatives, j) basics of ordinary and partial differential equations.

Got a new friend suggestion on ** Facebook**? A long lost professional contact suddenly added you on

Doesn’t it feel good to know that if you learn basics of linear algebra, then you are empowered with the knowledge about the basic mathematical object that is at the heart of all these exploits by the high and mighty of the tech industry?

At least, you will know the basic properties of the mathematical structure that controls what you shop on ** Target**, how you drive using

The essential topics to study are (not an ordered or exhaustive list by any means):

a) basic properties of matrix and vectors —scalar multiplication, linear transformation, transpose, conjugate, rank, determinant, b) inner and outer products, c) matrix multiplication rule and various algorithms, d) matrix inverse, e) special matrices — square matrix, identity matrix, triangular matrix, idea about sparse and dense matrix, unit vectors, symmetric matrix, Hermitian, skew-Hermitian and unitary matrices, f) matrix factorization concept/LU decomposition, Gaussian/Gauss-Jordan elimination, solving Ax=b linear system of equation, g) vector space, basis, span, orthogonality, orthonormality, linear least square, h) singular value decomposition, i) eigenvalues, eigenvectors, and diagonalization.

Here is a nice Medium article on what you can accomplish with linear algebra.

Only death and taxes are certain, and for everything else there is normal distribution.

The importance of having a solid grasp over essential concepts of statistics and probability cannot be overstated in a discussion about data science. Many practitioners in the field actually call machine learning nothing but statistical learning. I followed the widely known “An Introduction to Statistical Learning” while working on my first MOOC in machine learning and immediately realized the conceptual gaps I had in the subject. To plug those gaps, I started taking other MOOCs focused on basic statistics and probability and reading up/watching videos on related topics. The subject is vast and endless, and therefore focused planning is critical to cover most essential concepts. I am trying to list them as best as I can but I fear this is the area where I will fall short by most amount.

a) data summaries and descriptive statistics, central tendency, variance, covariance, correlation, b) Probability: basic idea, expectation, probability calculus, Bayes theorem, conditional probability, c) probability distribution functions — uniform, normal, binomial, chi-square, Student’s t-distribution, central limit theorem, d) sampling, measurement, error, random numbers, e) hypothesis testing, A/B testing, confidence intervals, p-values, f) ANOVA, g) linear regression, h) power, effect size, testing means, i) research studies and design-of-experiment.

Here is a nice article on the necessity of statistics knowledge for a data scientist.

These topics are little different from the traditional discourse in applied mathematics as they are most relevant and most widely used in specialized fields of study — theoretical computer science, control theory, or operation research. However, a basic understanding of these powerful techniques can be so fruitful in the practice of machine learning that they are worth mentioning here.

For example, virtually every machine learning algorithm/technique aims to minimize some kind of estimation error subject to various constraints. That, right there, is an optimization problem, which is generally solved by linear programming or similar techniques. On the other hand, it is always a deeply satisfying and insightful experience to understand a computer algorithm’s time complexity as it becomes extremely important when the algorithm is applied to a large data set. In this era of big data, where a data scientist is routinely expected to extract, transform, and analyze billions of records, (s)he must be extremely careful about choosing the right algorithm as it can make all the difference between amazing performance or abject failure. General theory and properties of algorithms are best studied in a formal computer science course but to understand how their time complexity (i.e. how much time the algorithm will take to run for a given size of data) is analyzed and calculated, one must have rudimentary familiarity with mathematical concepts such as *dynamic programming* or *recurrence equations*. Familiarity with the techniqueof*proof by mathematical induction* can be extremely helpful too.

Scared? Mind-bending list of topics to learn just as pre-requisite? Fear not, you will learn on the go and as needed. But the goal is to keep the windows and doors of your mind open and welcoming.

There is even a concise MOOC course to get you started. Note, this is a beginner-level course for refreshing your high-school or freshman year level knowledge. And here is a summary article on 15 best math courses for data science on kdnuggets.

But you can be assured that, after refreshing these topics, many of which you may have studied in your undergraduate, or even learning new concepts, you will feel so empowered that you will definitely start to hear the hidden music that the data sings. And that’s called a big leap towards becoming a data scientist…

**Note: **T*his article was originally published on towardsdatascience.com, and kindly contributed to DPhi to spread the knowledge.*

Become a guide. Become a mentor.We at DPhi, welcome you to share your experience in data science – be it your learning journey, experience while participating in Data Science Challenges, data science projects, tutorials and anything that is related to Data Science. Your learnings could help a large number of aspiring data scientists! Interested? Submit here.

The post How much mathematics does an IT engineer need to learn to get into data science/machine learning? appeared first on DPhi.

]]>