Prashant Gupta

Probability Basic Questions

  • Always remember in probability questions, if multiple objects are being picked simulataneously or one by one. for practice try this and this

  • Circular permutation (n-1)! , how? solve this


Context free Grammer

Parsing:

Input - Sentence Output - Parsed tree

Parsing is a supervised machine learning problem. Training can be achieved by Treebank which consists of several sentences and their associated parsed trees. One example is Penn WSJ Treebank

The leaf nodes makes up a sentences. THEN part of speech tagging THEN PHRASES/CONSTITUENTS.

NP - noun phrase VP - verb phrase DT - Determiner S - Sentence V - Verb N - Noun


ROC Curve...AUC...What is that?

YouTube explainer Video

Visualization

Research paper


Confusion Matrix...confusing ?


Sentimental Analysis

Sample sentence 1

Sample Sentence 2

read more

Serving Flask with Nginx + uWSGI

How this setup works?

Flask is managed by uWSGI.

uWSGI talks to nginx.

nginx handles contact with the outside world.

When a client connects to your server trying to reach your Flask app:

  1. nginx opens the connection and proxies it to uWSGI

  2. uWSGI handles the Flask instances you have and connects one to the client

  3. Flask talks to the client happily

read more

Cross Validation techniques

Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen. One way to overcome this problem is to not use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on new data. This is the basic idea for a whole class of model evaluation methods called cross validation.

Know more techniques? Start here


Machine learning evaluation metrics

One method of judging the quality of a particular model is by residuals. That means the model is fit using all the data points and the prediction for each data point is compared with its actual output. The absolute value of each error is taken and the mean of those values is computed to arrive at the mean absolute residual error. Models with lower values of this measure are deemed to be better.

There are always a plethora of metrics in machine learning that can be used to evaluate the performance of a ML model. This is an attempt to draw a metric map, just to keep them all in one place.

read more

Machine learning algorithms map

Machine Learning Algorithms in one glance.

read more

Benchmarking right way

Table of contents

  1. Benchmarking right way - Introduction
    1. What is?
      1. Latency
      2. Throughput
      3. Packet Loss
      4. Processing time
      5. Resposne time
  2. Benchmarking time
    1. Network Latency
      1. Measure by ping
      2. Meaure by flent
    2. Test using curl
    3. Benchmark POST API using ab
    4. Benchmark POST API using wrk
  3. Memory benchmark
  4. CPU benchmark

Benchmarking right way - Introduction

Benchmakring API should be majorly concerned with time, cpu & memory. One should also be concerned about the number of concurrent connections the API can handle in Prod env. This post will help you understand the mechanisms of network slowdowns.

When troubleshooting network degradation or outage, we need to find ways to measure the network performance and determine when the network is slow and for what is the root cause (saturation, bandwidth outage, misconfiguration, network device defect, etc..). This could help maintain a flawless less service for your customers even in bad times.

Whatever the approach you take to the problem (traffic capture with network analyzers like Wireshark, SNMP polling with tools such as PRTG or Cacti or generating traffic and active testing with tools such as SmokePing or simple ping or trace route to track network response times), you need indicators: these are usually called metrics and are aimed at putting tangible figures to reflect the performance status of the network.

There are several major network performance indicators reflect and how they interact with each other in TCP and UDP traffic streams.

What is?

Important metrics

Latency - is the time required to vehiculate a packet across a network. Latency may be measured in many different ways: round trip, one way. Latency may be impacted by any element in the chain which is used to vehiculate data: workstation, WAN links, routers, local area network, server… and ultimately it may be limited, for large networks, by the speed of light.

Throughput - is defined as the quantity of data being sent/received by unit of time.

Packet loss - reflects the number of packets lost per 100 of packets sent by a host.

Processing time -

is the amount of time a system takes to process a given request, not including the time it takes the message to get from the user to the system or the time it takes to get from the system back to the user.

Processing time can be affected by changes to your code, changes to systems that your code depends on (e.g. databases), or improvements in hardware.

Response time -

is the total time it takes from when a user makes a request until they receive a response.

Response time can be affected by changes to the processing time of your system and by changes in latency, which occur due to changes in hardware resources or utilization.

In many cases, you can assert that your latency is nominal, thus making your response time and your processing time pretty much the same. I guess it doesn’t matter what you call things as long as everybody involved in your performance analysis understands these different aspects of the system. For example, it is useful to make a graph latency vs. response time, and it is important for all the parties involved to know the difference between the two.

UDP Throughput is not impacted by latency

UDP is a protocol used to carry data over IP networks. One of the principles of UDP is that we assume that all packets sent are received by the other party (or such kind of controls is executed at a different layer, for example by the application itself).

In theory or for some specific protocols (where no control is undertaken at a different layer – e.g. one-way transmissions), the rate at which packets can be sent by the sender is not impacted by the time required to deliver the packets to the other party (= latency). Whatever that time is, the sender will send a given number of packets per second, which depends on other factors (application, operating system, resources, …).

TCP directly impacted by latency

TCP is a more complex protocol as it integrates a mechanism which checks that all packets are correctly delivered. This mechanism is called acknowledgment: it consists in having the receiver sending a specific packet or flag to the sender to confirm the proper reception of a packet.

TCP Congestion Window

For efficiency purposes, not all packets will be acknowledged one by one: the sender does not wait for each acknowledgment before sending new packets. Indeed, the number of packets which may be sent before receiveing the corresponding acknowledgement packet is managed by a value called TCP congestion window.

How the TCP Congestion Window impacts the throughput If we make the hypothesis that no packet gets lost; the sender will send a first quota of packets (corresponding to the TCP congestion window) and when it will receive the acknowledgment packet, it will increase the TCP congestion window; progressively the number of packets that can be sent in a given period of time will increase (throughput). The delay before acknowledgment packets are received (= latency) will have an impact on how fast the TCP congestion window increases (hence the throughput).

When latency is high, it means that the sender spends more time idle (not sending any new packets), which reduces how fast throughput grows.

Roundtrip latency TCP Throughput
0ms 93.5 Mbps
30ms 16.2 Mbps
60ms 8.07 Mbps
90ms 5.32 Mbps

TCP is impacted by retransmission and packet loss

How the TCP Congestion handles missing acknowledgment packets

The TCP congestion window mechanism deals with missing acknowledgment packets as follows: if an acknowledgement packet is missing after a period of time, the packet is considered as lost and the TCP congestion window is reduced by half (hence the througput too – which corresponds to the perception of limited capacity on the route by the sender); the TCP window size can then restart increasing if acknowledgment packets are received properly.

Benchmarking Time

We will first discuss ways to measure network latency and then move to response time and concurrent connections.

Network Latency

Measure by ping

Sample output

Test using curl

Content of curl-format.txt is

read more

Scalability

Reliability…what is that?

“continuing to work correctly, even when things go wrong.” The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient. The former term is slightly misleading: it suggests that we could make a system tolerant of every possible kind of fault, which in reality is not feasible. If the entire planet Earth (and all servers on it) were swallowed by a black hole, tolerance of that fault would require web hosting in space—good luck getting that budget item approved. So it only makes sense to talk about tolerating certain types of faults. Note that a fault is not the same as a failure. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.

read more

Writing secure python applications

Table of contents

  1. Injection Attacks
    1. SQL Injection
    2. XML Injection
    3. Command Injection
  2. Broken authentication & session management
    1. Session fixation
    2. Use of Insufficiently random values
  3. Cross site scripting
    1. Reflected XSS
    2. Persistent XSS
    3. Document Object Model (DOM) Based XSS
  4. Insecure direct object references
    1. Directory (Path) Traversal
  5. Security Misconfiguration
    1. Privileged Interface Exposure
    2. Leftover debug code
  6. Sensitive data exposure
    1. Authentication credentials in URL
    2. Session Exposure within URL
    3. User Enumeration
  7. Missing function level access control
    1. Horizontal Privilege Escalation
    2. Vertical privileage Escalation
  8. Cross site request forgery
    1. Cross site request forgery(POST)
    2. Cross site request forgery(GET)
    3. Click Jacking
  9. Unvalidated redirects & Forwards
    1. Insecure URL redirect
read more

System design

Table of contents

  1. Software Design Theoritical concepts - Introduction
    1. CRC Card
    2. Four concepts revolving around OOP
    3. Coupling & Cohesion
    4. Separation of concerns
    5. SOLID
      1. The Single Responsibility Principle
      2. The Open Closed Principle
      3. The Liskov Substitution Principle
      4. The Interface Segretation Principle
      5. The Dependency Inversion Principle
  2. Class Diagrams
    1. Tool to draw UML
    2. UML class diagram rules
  3. Sample UML Diagrams examples
  4. Object oriented cheat-sheet
  5. References
read more

Using Anaconda in CI/CD pipeline

To successfully build the pipeline, it was required to automate all the yes invocation while executing Anaconda sh file.

I did by invoking sh file with -b option bash Anaconda2-5.0.1-Linux-x86_64.sh -b


Shipping machine learning modules in a single executable

You have done all your research, prototyped it, optimize it and now you are ready to ship it. This post not only focusses on shipping machine learning modules but python based codebases in general.

How do you ship ?

  1. Expose an API
  2. Package your code in a single executable
read more

Feature Extraction

To start with

feature selection: select a subset of the original feature set.

feature extraction: build new set of features from original feature set.


Language Modelling in NLP

What is Language Modelling ?

Language modeling in very simple terms is the task of assigning a probability to sentences in a language. Besides assigning a probability to each sequence of words, the language models also assigns a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words.

read more

All you need to know about Word2vec

Feature Extraction from texts using Word2vec

To get a better semantic understanding of a word, word2vec was published for nlp community.

read more

All you need to know about BOW

Feature Extraction from texts using Bag of words

The bag of words model ignores grammar and order of words. ‘All my cats in a row’, ‘When my cat sits down, she looks like a Furby toy!’,

Breaking down the given sentences into words and assigning them each a unique ID

read more

Visual recognition

I am really fascinated by the subject of this broad research topic. So, I decided to play around things and this post is a serially arranged attempts of mine into visual recognition.

What inspired me ?

Dr. Fei Fei Li with her TED talk

read more

Basics of probability and stats

Data ?

Data are pieces of information about individuals organized into variables. By an individual, we mean a particular person or object. By a variable, we mean a particular characteristic of the individual.

Variables can be classified into one of two types: categorical or quantitative.

read more

Striving better at Competitive programming

I stepped into competitve programming in my college. I started from SPOJ attempted Life, the Universe, and Everything and wola got compilation errror :laughing:

Anyways, that was a learning curve and I continued with other platforms like Codechef and Codeforces along with SPOJ. Topcoder problems were tough then and now as well :wink:

In the spirit of making myself a better developer, I am releasing all my submitted solutions of various problems on all platforms. The main reason behind putting all my codes at one place read more


Git utility commands

  1. Ignore file/folder while commit

    For a File

         git add -u
         git reset -- main/dontcheckmein.txt
         git add .
         git commit -m "commit message"
         git push -u origin master
    

    For a folder

         git add -u
         git reset -- main/*
         git add .
         git commit -m "commit message"
         git push -u origin master
    
  2. Use these commands to revert/delete your last (only one) commit from the repo but keep in mind that if other contributors have pulled the code before you start reverting then it may cause problems.

read more

Linux utility commands

To monitor CPU Usage on any Linux distribution, use

read more

Play vocals #1

For me, It rolled out very tough, when it comes to singing with or without karaoke. I tried a mix of both in these three beautiful songs.

My first experience of singing with a recording? Out of key, out of time…Still, here are they. read more