Uncategorized

Coding for Data Science Tips 2 – Standardize csv reading between Windows and Mac

When one starts to learn data science it is extremely useful to ask feedback from other data scientists and data enthusiasts on the quality of our code and the process, we are using to analyse data. To ease this process, we often send notebooks and projects back and forth. But way too often the code looks like this.

Don’t do this. This points the file to a file pathway specific to your computer and makes the life of those wanting to help you a lot more…boring.

Instead try something like this. Keep a folder within the folder of your project where you keep your data and name it something straightforward like….data. The function to upload a .csv or .xlsx starts by the working directory where the notebook is so it won’t be a problem. The code will look like this:

Much simpler, isn’t it? With this all you will need to ask for feedback from another data enthusiast is to copy the folder where the notebook is and the code should work fine.

You can do an extra step to be sure it works in all environments (Mac, Linux and Windows). Add an r before the pathway to the file, like this:

As usual all tips are stored in code in https://github.com/insilicobiologyblog/DScodingTips so you can check them.

Any tips & tricks you might have for coding in Data Science for all levels of data scientists? Share them with me! =)

Biology

Lessons from Complexity Sciences that make me a Better Data Scientist

The whole is more than the simple sum of its parts.

It was with this very short phrase that my fascination with complexity sciences began.

To sum what’s a huge area of knowledge, complexity sciences aims to study complex systems, systems that display emergent behaviours, which cannot be predicted by the characteristics of its parts. From societies, to astronomy, biology, medicine, politics, economics…. we find complex systems everywhere.

From the birth of complexity sciences (around the 1950’s) to today, those who study this area of knowledge gained access to plenty of new techniques and methodologies to use, mainly from technology and mathematics. One of the most used techniques is mathematically model the system being studied and using data to assess its accuracy.

Using data to backup the mathematical model is crucial and eventually…. this lead up to a whole different area inside the complexity sciences constellation….ya know….data science. 🙂 . We can say that complexity sciences are the grandma of data science.

Who can find data science in this map of the complexity sciences? =)

All these years and a Masters in Complexity Sciences later and I keep with me very important lessons from my time as a complexity sciences student.

1 – Knowing where your data comes from matters, a lot – Every data has an origin and said origin, more often than not, is the reason of many peculiar things in that data. Going to the source, as much as possible, allows for us to understand the story of the data and the system. And more often, the problems one finds in data analysis…..yep, you’ll find why they happen in the origin.

2 – Considering the system where the data exists is crucial in your work – your data doesn’t exist in a void. If one wants to have a productive project that will be used by those around you, one needs to consider the system in which the data is integrated. What is the tech stack available to you? How reliable is it? Who are your “clients” within the system? What is the level of mathematics and data visualization they are used to? Are people going to directly interact with your model? If yes, how? All of these are questions we need to answer when working on a project to best provide a suitable answer for it. Every system has its own peculiarities so spend your time knowing them.

3 – Simplify whenever possible, explain whenever needed, keep everything registered – Do not overcomplicate a model. This seems counterproductive, doesn’t it? However, when we spend our time trying our best to find the underlying rules of a model and simplifying them as possible, we gain a better understanding of the system, being better able to work within it. Add to this a careful register of your work, adding useful explanations to your client that make everything as clear as possible and you’ll be an effective data scientist. And will help you maintain the model for a long time, avoiding the “Wtf is this?” situations as much as possible.

4- Finally, be open to the wonders of emergent behaviour – One of the coolest parts of a complex system is what is called emergent behaviour, or behaviour that cannot be deduced by the characteristics of the parts of the system alone. It gives us a peak on the wonders of communication and network behaviour that exists in such wonders of nature as a beehive, the ocean or even yourself, dear human. Ya know that you are a freakish awesome marvel of nature that is more than the simple sum of your cells? =) right. These emergent behaviours are, more often than not, sources of great study and research projects that aim to discover the wonders of everything around us and might give a peak of crucial factors to study in the system you are working/studying. Let some control go and let nature lead you, you’ll be amazed.

And if nothing of the four points took you to think about complexity sciences, maybe these adorable pups will. Remember, none of them knows how to create a pinwheel, all they know is that they want that milk…. badly =)

Uncategorized

Coding for Data Science Tips 1 – Discovering the encoding of a .csv file

“Grrr, why can’t I upload this csv?”

Sounds familiar this little rant? Sometimes csv’s gives us a struggle to understand, mainly due to encoding, or the protocol in which the .csv processes characters. Different enconding but how can we discover the encoding of our .csv to start researching how to upload it to the notebook?

Simple, the code is pretty straightforward 🙂

Let’s hope that the next time you face a problem with uploading csv’s the solution comes easier 🙂

Check the code on https://github.com/insilicobiologyblog/DScodingTips/blob/main/CodingTips1/DSCodeTips1-DiscoveringEncodingCSVfile.ipynb

Any tips & tricks you might have for coding in Data Science for all levels of data scientists? Share them with me! =)

Uncategorized

Hire a Data Scientist, not a Data Technician

Everytime I write a new blogpost here or a new post on LinkedIn I always get the same question: “What programming languages do I need to learn to become a data scientist?”

In a short answer: “There is no answer for that question. Focus on becoming a Scientist, rather than a technician.” The market knows that the need for data scientists has increased but it doesn’t know how to hire them or even…find them.

With the change in thinking that data science has brought to the world, the paradigm in recruitment has to change as well. Data Scientist means a scientific mindset, not a technical mindset.

So, what can change to ease the life of data scientists and companies? Here are a few ideas.

Change or Ban Technical CV’s – Data Science exists where data exists, it really doesn’t matter if a data scientist has programmed in R, Python, Java or even Excel and for how many years. We are not developers, and it should not be expected of a data scientist a developer knowledge of a specific language. Why not changing for a story telling side that reflects the experience of the candidate with different projects, even the academic ones and volunteer work. Then you can assess how the candidate deals with change and how it adapts and contributes to new projects. A good data scientist should be happy to talk about projects, the pros and cons of each one, be aware of not revealing sensitive information and able to explain which lessons each project provided him/her.

Focus interviews on Story Telling and Challenge Solving – Data Science is not a checklist kind of job so it should not be faced as one. Instead of making a data scientist enumerate tools and technologies that he/she has worked on, why not making them tell how the project was developed and how the solution(s) was(were) found? That will help you assess the influence of the candidate in a group and his/ her abilities in communicating results. Two crucial traits in a data scientist.

Discuss current themes in Data Science and ask for their opinion – Is data privacy on the spotlight nowadays? Maybe the problems with driverless cars are on the news? Neural networks spark your curiosity? Ask for their opinion on the subject and what do they know about it. It can assess their curiosity and passion for the area and how updated they are on an area that can change from day to day.

Hire Diverse instead of just claiming it – Data Science is the best opportunity for a diverse team. Engineers, life sciences scientists, physicists, chemists, etc….all walks of life can bring a unique perspective to a project. Instead of demanding an engineering background, why not asking for an analytical background? As long as there is some knowledge and experience of statistics and linear algebra does it really matter if that person is an engineer or a chemist? In fact, diverse backgrounds mean different methodologies and techniques within a group which fosters creativity.

What do you think? How would you evaluate a potential scientific mindset?

A bit of Banksy fun to remind you that office work is not mandatory for data scientists =)

Uncategorized

In Silico Biology – What is this? Who am I?

Data is at the center of today’s world. In every transaction, in every event, in every single tiny thing happening in our universe.

This is what motivates a data scientist. Understanding this data.

If we move then to the specific data existing within nature and that can help us improve healthcare, science, ecology and even help us protect the environment, we get computational biologists working with it.

We’re talking about terabytes upon terabytes of data. But fear not, they’re far more simple and fun than what you might think. And it’s my responsibility to help you understand it.

This is my little corner of the web to show what a computational biologist and science communicator does. Let’s go 🙂